Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd.
Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory, reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
- If a task attempts to reserve more memory than limit_in_bytes allows, the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation.
This proposal is implemented in this patch, with tests to verify functionality and show the usage.
Alternatives considered: 1. A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via reservation_limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and reservation_limit_in_bytes concurrently, and this approach gives them the option to do so.
Caveats: 1. This support is implemented for cgroups-v1. I have not tried hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet. This is largely because we use cgroups-v1 for now. If required, I can add hugetlb_cgroup support to cgroups v2 in this patch or a follow up. 2. Most complicated bit of this patch I believe is: where to store the pointer to the hugetlb_cgroup to uncharge at unreservation time? Normally the cgroup pointers hang off the struct page. But, with hugetlb_cgroup reservations, one task can reserve a specific page and another task may fault it in (I believe), so storing the pointer in struct page is not appropriate. Proposed approach here is to store the pointer in the resv_map. See patch for details.
Signed-off-by: Mina Almasry almasrymina@google.com
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Changes in v2: - Split the patch into a 5 patch series. - Fixed patch subject.
Mina Almasry (5): hugetlb_cgroup: Add hugetlb_cgroup reservation counter hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations hugetlb_cgroup: add reservation accounting for private mappings hugetlb_cgroup: add accounting for shared mappings hugetlb_cgroup: Add hugetlb_cgroup reservation tests
include/linux/hugetlb.h | 10 +- include/linux/hugetlb_cgroup.h | 19 +- mm/hugetlb.c | 256 ++++++++-- mm/hugetlb_cgroup.c | 153 +++++- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 4 + .../selftests/vm/charge_reserved_hugetlb.sh | 438 ++++++++++++++++++ .../selftests/vm/write_hugetlb_memory.sh | 22 + .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++ 9 files changed, 1087 insertions(+), 68 deletions(-) create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c
-- 2.23.0.rc1.153.gdeed80330f-goog
These counters will track hugetlb reservations rather than hugetlb memory faulted in. This patch only adds the counter, following patches add the charging and uncharging of the counter. --- include/linux/hugetlb.h | 2 +- mm/hugetlb_cgroup.c | 86 +++++++++++++++++++++++++++++++++++++---- 2 files changed, 80 insertions(+), 8 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index edfca42783192..6777b3013345d 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -340,7 +340,7 @@ struct hstate { unsigned int surplus_huge_pages_node[MAX_NUMNODES]; #ifdef CONFIG_CGROUP_HUGETLB /* cgroup control files */ - struct cftype cgroup_files[5]; + struct cftype cgroup_files[9]; #endif char name[HSTATE_NAME_LEN]; }; diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 68c2f2f3c05b7..708103663988a 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -25,6 +25,10 @@ struct hugetlb_cgroup { * the counter to account for hugepages from hugetlb. */ struct page_counter hugepage[HUGE_MAX_HSTATE]; + /* + * the counter to account for hugepage reservations from hugetlb. + */ + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; };
#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) @@ -33,6 +37,15 @@ struct hugetlb_cgroup {
static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+static inline +struct page_counter *get_counter(struct hugetlb_cgroup *h_cg, int idx, + bool reserved) +{ + if (reserved) + return &h_cg->reserved_hugepage[idx]; + return &h_cg->hugepage[idx]; +} + static inline struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s) { @@ -256,28 +269,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
enum { RES_USAGE, + RES_RESERVATION_USAGE, RES_LIMIT, + RES_RESERVATION_LIMIT, RES_MAX_USAGE, + RES_RESERVATION_MAX_USAGE, RES_FAILCNT, + RES_RESERVATION_FAILCNT, };
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { struct page_counter *counter; + struct page_counter *reserved_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; + reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
switch (MEMFILE_ATTR(cft->private)) { case RES_USAGE: return (u64)page_counter_read(counter) * PAGE_SIZE; + case RES_RESERVATION_USAGE: + return (u64)page_counter_read(reserved_counter) * PAGE_SIZE; case RES_LIMIT: return (u64)counter->max * PAGE_SIZE; + case RES_RESERVATION_LIMIT: + return (u64)reserved_counter->max * PAGE_SIZE; case RES_MAX_USAGE: return (u64)counter->watermark * PAGE_SIZE; + case RES_RESERVATION_MAX_USAGE: + return (u64)reserved_counter->watermark * PAGE_SIZE; case RES_FAILCNT: return counter->failcnt; + case RES_RESERVATION_FAILCNT: + return reserved_counter->failcnt; default: BUG(); } @@ -291,6 +318,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, int ret, idx; unsigned long nr_pages; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); + bool reserved = false;
if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ return -EINVAL; @@ -303,10 +331,16 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, idx = MEMFILE_IDX(of_cft(of)->private); nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
+ if (MEMFILE_ATTR(of_cft(of)->private) == RES_RESERVATION_LIMIT) { + reserved = true; + } + switch (MEMFILE_ATTR(of_cft(of)->private)) { + case RES_RESERVATION_LIMIT: case RES_LIMIT: mutex_lock(&hugetlb_limit_mutex); - ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages); + ret = page_counter_set_max(get_counter(h_cg, idx, reserved), + nr_pages); mutex_unlock(&hugetlb_limit_mutex); break; default: @@ -320,18 +354,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { int ret = 0; - struct page_counter *counter; + struct page_counter *counter, *reserved_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; + reserved_counter = &h_cg->reserved_hugepage[ + MEMFILE_IDX(of_cft(of)->private)];
switch (MEMFILE_ATTR(of_cft(of)->private)) { case RES_MAX_USAGE: page_counter_reset_watermark(counter); break; + case RES_RESERVATION_MAX_USAGE: + page_counter_reset_watermark(reserved_counter); + break; case RES_FAILCNT: counter->failcnt = 0; break; + case RES_RESERVATION_FAILCNT: + reserved_counter->failcnt = 0; + break; default: ret = -EINVAL; break; @@ -357,7 +399,7 @@ static void __init __hugetlb_cgroup_file_init(int idx) struct hstate *h = &hstates[idx];
/* format the size */ - mem_fmt(buf, 32, huge_page_size(h)); + mem_fmt(buf, sizeof(buf), huge_page_size(h));
/* Add the limit file */ cft = &h->cgroup_files[0]; @@ -366,28 +408,58 @@ static void __init __hugetlb_cgroup_file_init(int idx) cft->read_u64 = hugetlb_cgroup_read_u64; cft->write = hugetlb_cgroup_write;
- /* Add the usage file */ + /* Add the reservation limit file */ cft = &h->cgroup_files[1]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes", + buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT); + cft->read_u64 = hugetlb_cgroup_read_u64; + cft->write = hugetlb_cgroup_write; + + /* Add the usage file */ + cft = &h->cgroup_files[2]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf); cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); cft->read_u64 = hugetlb_cgroup_read_u64;
+ /* Add the reservation usage file */ + cft = &h->cgroup_files[3]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes", + buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE); + cft->read_u64 = hugetlb_cgroup_read_u64; + /* Add the MAX usage file */ - cft = &h->cgroup_files[2]; + cft = &h->cgroup_files[4]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf); cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE); cft->write = hugetlb_cgroup_reset; cft->read_u64 = hugetlb_cgroup_read_u64;
+ /* Add the MAX reservation usage file */ + cft = &h->cgroup_files[5]; + snprintf(cft->name, MAX_CFTYPE_NAME, + "%s.reservation_max_usage_in_bytes", buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_MAX_USAGE); + cft->write = hugetlb_cgroup_reset; + cft->read_u64 = hugetlb_cgroup_read_u64; + /* Add the failcntfile */ - cft = &h->cgroup_files[3]; + cft = &h->cgroup_files[6]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf); cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); cft->write = hugetlb_cgroup_reset; cft->read_u64 = hugetlb_cgroup_read_u64;
+ /* Add the reservation failcntfile */ + cft = &h->cgroup_files[7]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf); + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); + cft->write = hugetlb_cgroup_reset; + cft->read_u64 = hugetlb_cgroup_read_u64; + /* NULL terminate the last cft */ - cft = &h->cgroup_files[4]; + cft = &h->cgroup_files[8]; memset(cft, 0, sizeof(*cft));
WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, -- 2.23.0.rc1.153.gdeed80330f-goog
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage or hugetlb reservation counter.
Adds a new interface to uncharge a hugetlb_cgroup counter via hugetlb_cgroup_uncharge_counter.
Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init, hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
--- include/linux/hugetlb_cgroup.h | 8 +++-- mm/hugetlb.c | 3 +- mm/hugetlb_cgroup.c | 63 ++++++++++++++++++++++++++++------ 3 files changed, 61 insertions(+), 13 deletions(-)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 063962f6dfc6a..0725f809cd2d9 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -52,7 +52,8 @@ static inline bool hugetlb_cgroup_disabled(void) }
extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr); + struct hugetlb_cgroup **ptr, + bool reserved); extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, struct page *page); @@ -60,6 +61,9 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page); extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg); +extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p, + unsigned long nr_pages); + extern void hugetlb_cgroup_file_init(void) __init; extern void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage); @@ -83,7 +87,7 @@ static inline bool hugetlb_cgroup_disabled(void)
static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr) + struct hugetlb_cgroup **ptr, bool reserved) { return 0; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ede7e7f5d1ab2..c153bef42e729 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2078,7 +2078,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, gbl_chg = 1; }
- ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg); + ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg, + false); if (ret) goto out_subpool_put;
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 708103663988a..119176a0b2ec5 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -74,8 +74,10 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) int idx;
for (idx = 0; idx < hugetlb_max_hstate; idx++) { - if (page_counter_read(&h_cg->hugepage[idx])) + if (page_counter_read(get_counter(h_cg, idx, true)) || + page_counter_read(get_counter(h_cg, idx, false))) { return true; + } } return false; } @@ -86,18 +88,27 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, int idx;
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { - struct page_counter *counter = &h_cgroup->hugepage[idx]; struct page_counter *parent = NULL; + struct page_counter *reserved_parent = NULL; unsigned long limit; int ret;
- if (parent_h_cgroup) - parent = &parent_h_cgroup->hugepage[idx]; - page_counter_init(counter, parent); + if (parent_h_cgroup) { + parent = get_counter(parent_h_cgroup, idx, false); + reserved_parent = get_counter(parent_h_cgroup, idx, + true); + } + page_counter_init(get_counter(h_cgroup, idx, false), parent); + page_counter_init(get_counter(h_cgroup, idx, true), + reserved_parent);
limit = round_down(PAGE_COUNTER_MAX, 1 << huge_page_order(&hstates[idx])); - ret = page_counter_set_max(counter, limit); + + ret = page_counter_set_max(get_counter( + h_cgroup, idx, false), limit); + ret = page_counter_set_max(get_counter( + h_cgroup, idx, true), limit); VM_BUG_ON(ret); } } @@ -127,6 +138,25 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) kfree(h_cgroup); }
+static void hugetlb_cgroup_move_parent_reservation(int idx, + struct hugetlb_cgroup *h_cg) +{ + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); + + /* Move the reservation counters. */ + if (!parent_hugetlb_cgroup(h_cg)) { + parent = root_h_cgroup; + /* root has no limit */ + page_counter_charge( + &root_h_cgroup->reserved_hugepage[idx], + page_counter_read(get_counter(h_cg, idx, + true))); + } + + /* Take the pages off the local counter */ + page_counter_cancel(get_counter(h_cg, idx, true), + page_counter_read(get_counter(h_cg, idx, true))); +}
/* * Should be called with hugetlb_lock held. @@ -181,6 +211,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) do { for_each_hstate(h) { spin_lock(&hugetlb_lock); + hugetlb_cgroup_move_parent_reservation(idx, h_cg); list_for_each_entry(page, &h->hugepage_activelist, lru) hugetlb_cgroup_move_parent(idx, h_cg, page);
@@ -192,7 +223,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) }
int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr) + struct hugetlb_cgroup **ptr, bool reserved) { int ret = 0; struct page_counter *counter; @@ -215,8 +246,10 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, } rcu_read_unlock();
- if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter)) + if (!page_counter_try_charge(get_counter(h_cg, idx, reserved), + nr_pages, &counter)) { ret = -ENOMEM; + } css_put(&h_cg->css); done: *ptr = h_cg; @@ -250,7 +283,8 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, if (unlikely(!h_cg)) return; set_hugetlb_cgroup(page, NULL); - page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); + page_counter_uncharge(get_counter(h_cg, idx, false), nr_pages); + return; }
@@ -263,7 +297,16 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) return;
- page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); + page_counter_uncharge(get_counter(h_cg, idx, false), nr_pages); +} + +void hugetlb_cgroup_uncharge_counter(struct page_counter *p, + unsigned long nr_pages) +{ + if (hugetlb_cgroup_disabled() || !p) + return; + + page_counter_uncharge(p, nr_pages); return; }
-- 2.23.0.rc1.153.gdeed80330f-goog
Normally the pointer to the cgroup to uncharge hangs off the struct page, and gets queried when it's time to free the page. With hugetlb_cgroup reservations, this is not possible. Because it's possible for a page to be reserved by one task and actually faulted in by another task.
The best place to put the hugetlb_cgroup pointer to uncharge for reservations is in the resv_map. But, because the resv_map has different semantics for private and shared mappings, the code patch to charge/uncharge shared and private mappings is different. This patch implements charging and uncharging for private mappings.
For private mappings, the counter to uncharge is in resv_map->reservation_counter. On initializing the resv_map this is set to NULL. On reservation of a region in private mapping, the tasks hugetlb_cgroup is charged and the hugetlb_cgroup is placed is resv_map->reservation_counter.
On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.
--- include/linux/hugetlb.h | 8 ++++++ include/linux/hugetlb_cgroup.h | 11 ++++++++ mm/hugetlb.c | 47 ++++++++++++++++++++++++++++++++-- mm/hugetlb_cgroup.c | 12 --------- 4 files changed, 64 insertions(+), 14 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 6777b3013345d..90b3c928d16c1 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -46,6 +46,14 @@ struct resv_map { long adds_in_progress; struct list_head region_cache; long region_cache_count; + #ifdef CONFIG_CGROUP_HUGETLB + /* + * On private mappings, the counter to uncharge reservations is stored + * here. If these fields are 0, then the mapping is shared. + */ + struct page_counter *reservation_counter; + unsigned long pages_per_hpage; +#endif }; extern struct resv_map *resv_map_alloc(void); void resv_map_release(struct kref *ref); diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 0725f809cd2d9..1fdde63a4e775 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -25,6 +25,17 @@ struct hugetlb_cgroup; #define HUGETLB_CGROUP_MIN_ORDER 2
#ifdef CONFIG_CGROUP_HUGETLB +struct hugetlb_cgroup { + struct cgroup_subsys_state css; + /* + * the counter to account for hugepages from hugetlb. + */ + struct page_counter hugepage[HUGE_MAX_HSTATE]; + /* + * the counter to account for hugepage reservations from hugetlb. + */ + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; +};
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c153bef42e729..235996aef6618 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -711,6 +711,16 @@ struct resv_map *resv_map_alloc(void) INIT_LIST_HEAD(&resv_map->regions);
resv_map->adds_in_progress = 0; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Initialize these to 0. On shared mappings, 0's here indicate these + * fields don't do cgroup accounting. On private mappings, these will be + * re-initialized to the proper values, to indicate that hugetlb cgroup + * reservations are to be un-charged from here. + */ + resv_map->reservation_counter = NULL; + resv_map->pages_per_hpage = 0; +#endif
INIT_LIST_HEAD(&resv_map->region_cache); list_add(&rg->link, &resv_map->region_cache); @@ -3192,7 +3202,19 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
reserve = (end - start) - region_count(resv, start, end);
- kref_put(&resv->refs, resv_map_release); +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Since we check for HPAGE_RESV_OWNER above, this must a private + * mapping, and these values should be none-zero, and should point to + * the hugetlb_cgroup counter to uncharge for this reservation. + */ + WARN_ON(!resv->reservation_counter); + WARN_ON(!resv->pages_per_hpage); + + hugetlb_cgroup_uncharge_counter( + resv->reservation_counter, + (end - start) * resv->pages_per_hpage); +#endif
if (reserve) { /* @@ -3202,6 +3224,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) gbl_reserve = hugepage_subpool_put_pages(spool, reserve); hugetlb_acct_memory(h, -gbl_reserve); } + + kref_put(&resv->refs, resv_map_release); }
static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr) @@ -4516,6 +4540,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; + struct hugetlb_cgroup *h_cg; long gbl_reserve;
/* This should never happen */ @@ -4549,11 +4574,29 @@ int hugetlb_reserve_pages(struct inode *inode, chg = region_chg(resv_map, from, to);
} else { + /* Private mapping. */ + chg = to - from; + + if (hugetlb_cgroup_charge_cgroup( + hstate_index(h), + chg * pages_per_huge_page(h), + &h_cg, true)) { + return -ENOMEM; + } + resv_map = resv_map_alloc(); if (!resv_map) return -ENOMEM;
- chg = to - from; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Since this branch handles private mappings, we attach the + * counter to uncharge for this reservation off resv_map. + */ + resv_map->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + resv_map->pages_per_hpage = pages_per_huge_page(h); +#endif
set_vma_resv_map(vma, resv_map); set_vma_resv_flags(vma, HPAGE_RESV_OWNER); diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 119176a0b2ec5..06e99ae1fec81 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -19,18 +19,6 @@ #include <linux/hugetlb.h> #include <linux/hugetlb_cgroup.h>
-struct hugetlb_cgroup { - struct cgroup_subsys_state css; - /* - * the counter to account for hugepages from hugetlb. - */ - struct page_counter hugepage[HUGE_MAX_HSTATE]; - /* - * the counter to account for hugepage reservations from hugetlb. - */ - struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; -}; - #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) #define MEMFILE_IDX(val) (((val) >> 16) & 0xffff) #define MEMFILE_ATTR(val) ((val) & 0xffff) -- 2.23.0.rc1.153.gdeed80330f-goog
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.
When a file_region entry is added to the resv_map via region_add, we also charge the appropriate hugetlb_cgroup and put the pointer to that in file_region->reservation_counter. This is slightly delicate since we need to not modify the resv_map until we know that charging the reservation has succeeded. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.
On region_del, which is when the hugetlb memory is unreserved, we delete the file_region entry in the resv_map, but also uncharge the file_region->reservation_counter.
--- mm/hugetlb.c | 208 +++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 170 insertions(+), 38 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 235996aef6618..d76e3137110ab 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -242,8 +242,72 @@ struct file_region { struct list_head link; long from; long to; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * On shared mappings, each reserved region appears as a struct + * file_region in resv_map. These fields hold the info needed to + * uncharge each reservation. + */ + struct page_counter *reservation_counter; + unsigned long pages_per_hpage; +#endif };
+/* Must be called with resv->lock held. Calling this with dry_run == true will + * count the number of pages added but will not modify the linked list. + */ +static long consume_regions_we_overlap_with(struct file_region *rg, + struct list_head *head, long f, long *t, + struct hugetlb_cgroup *h_cg, + struct hstate *h, + bool dry_run) +{ + long add = 0; + struct file_region *trg = NULL, *nrg = NULL; + + /* Consume any regions we now overlap with. */ + nrg = rg; + list_for_each_entry_safe(rg, trg, rg->link.prev, link) { + if (&rg->link == head) + break; + if (rg->from > *t) + break; + + /* If this area reaches higher then extend our area to + * include it completely. If this is not the first area + * which we intend to reuse, free it. + */ + if (rg->to > *t) + *t = rg->to; + if (rg != nrg) { + /* Decrement return value by the deleted range. + * Another range will span this area so that by + * end of routine add will be >= zero + */ + add -= (rg->to - rg->from); + if (!dry_run) { + list_del(&rg->link); + kfree(rg); + } + } + } + + add += (nrg->from - f); /* Added to beginning of region */ + add += *t - nrg->to; /* Added to end of region */ + + if (!dry_run) { + nrg->from = f; + nrg->to = *t; +#ifdef CONFIG_CGROUP_HUGETLB + nrg->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + nrg->pages_per_hpage = pages_per_huge_page(h); +#endif + } + + return add; +} + /* * Add the huge page range represented by [f, t) to the reserve * map. In the normal case, existing regions will be expanded @@ -258,11 +322,13 @@ struct file_region { * Return the number of new huge pages added to the map. This * number is greater than or equal to zero. */ -static long region_add(struct resv_map *resv, long f, long t) +static long region_add(struct hstate *h, struct resv_map *resv, long f, long t) { struct list_head *head = &resv->regions; - struct file_region *rg, *nrg, *trg; - long add = 0; + struct file_region *rg, *nrg; + long add = 0, newadd = 0; + struct hugetlb_cgroup *h_cg = NULL; + int ret = 0;
spin_lock(&resv->lock); /* Locate the region we are either in or before. */ @@ -277,6 +343,23 @@ static long region_add(struct resv_map *resv, long f, long t) * from the cache and use it for this range. */ if (&rg->link == head || t < rg->from) { +#ifdef CONFIG_CGROUP_HUGETLB + /* + * If res->reservation_counter is NULL, then it means this is + * a shared mapping, and hugetlb cgroup accounting should be + * done on the file_region entries inside resv_map. + */ + if (!resv->reservation_counter) { + ret = hugetlb_cgroup_charge_cgroup( + hstate_index(h), + (t - f) * pages_per_huge_page(h), + &h_cg, true); + } + + if (ret) + goto out_locked; +#endif + VM_BUG_ON(resv->region_cache_count <= 0);
resv->region_cache_count--; @@ -286,6 +369,15 @@ static long region_add(struct resv_map *resv, long f, long t)
nrg->from = f; nrg->to = t; + +#ifdef CONFIG_CGROUP_HUGETLB + if (h_cg) { + nrg->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + nrg->pages_per_hpage = pages_per_huge_page(h); + } +#endif + list_add(&nrg->link, rg->link.prev);
add += t - f; @@ -296,38 +388,37 @@ static long region_add(struct resv_map *resv, long f, long t) if (f > rg->from) f = rg->from;
- /* Check for and consume any regions we now overlap with. */ - nrg = rg; - list_for_each_entry_safe(rg, trg, rg->link.prev, link) { - if (&rg->link == head) - break; - if (rg->from > t) - break; - - /* If this area reaches higher then extend our area to - * include it completely. If this is not the first area - * which we intend to reuse, free it. */ - if (rg->to > t) - t = rg->to; - if (rg != nrg) { - /* Decrement return value by the deleted range. - * Another range will span this area so that by - * end of routine add will be >= zero - */ - add -= (rg->to - rg->from); - list_del(&rg->link); - kfree(rg); - } +#ifdef CONFIG_CGROUP_HUGETLB + /* Count any regions we now overlap with. */ + add = consume_regions_we_overlap_with(rg, head, f, &t, NULL, NULL, + true); + + if (!resv->reservation_counter) { + ret = hugetlb_cgroup_charge_cgroup( + hstate_index(h), + add * pages_per_huge_page(h), + &h_cg, true); }
- add += (nrg->from - f); /* Added to beginning of region */ - nrg->from = f; - add += t - nrg->to; /* Added to end of region */ - nrg->to = t; + if (ret) + goto out_locked; +#endif + + /* Check for and consume any regions we now overlap with. */ + newadd = consume_regions_we_overlap_with(rg, head, f, &t, h_cg, h, + false); + /* + * If these aren't equal, then there is a bug with + * consume_regions_we_overlap_with, and we're charging the wrong amount + * of memory. + */ + WARN_ON(add != newadd);
out_locked: resv->adds_in_progress--; spin_unlock(&resv->lock); + if (ret) + return ret; VM_BUG_ON(add < 0); return add; } @@ -487,6 +578,10 @@ static long region_del(struct resv_map *resv, long f, long t) struct file_region *rg, *trg; struct file_region *nrg = NULL; long del = 0; +#ifdef CONFIG_CGROUP_HUGETLB + struct page_counter *reservation_counter = NULL; + unsigned long pages_per_hpage = 0; +#endif
retry: spin_lock(&resv->lock); @@ -514,6 +609,14 @@ static long region_del(struct resv_map *resv, long f, long t) nrg = list_first_entry(&resv->region_cache, struct file_region, link); +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Save counter information from the deleted + * node, in case we need to do an uncharge. + */ + reservation_counter = nrg->reservation_counter; + pages_per_hpage = nrg->pages_per_hpage; +#endif list_del(&nrg->link); resv->region_cache_count--; } @@ -543,6 +646,14 @@ static long region_del(struct resv_map *resv, long f, long t)
if (f <= rg->from && t >= rg->to) { /* Remove entire region */ del += rg->to - rg->from; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Save counter information from the deleted node, + * in case we need to do an uncharge. + */ + reservation_counter = rg->reservation_counter; + pages_per_hpage = rg->pages_per_hpage; +#endif list_del(&rg->link); kfree(rg); continue; @@ -559,6 +670,19 @@ static long region_del(struct resv_map *resv, long f, long t)
spin_unlock(&resv->lock); kfree(nrg); +#ifdef CONFIG_CGROUP_HUGETLB + /* + * If resv->reservation_counter is NULL, then this is shared + * reservation, and the reserved memory is tracked in the file_struct + * entries inside of resv_map. So we need to uncharge the memory here. + */ + if (reservation_counter && pages_per_hpage && del > 0 && + !resv->reservation_counter) { + hugetlb_cgroup_uncharge_counter( + reservation_counter, + del * pages_per_hpage); + } +#endif return del; }
@@ -1930,7 +2054,7 @@ static long __vma_reservation_common(struct hstate *h, ret = region_chg(resv, idx, idx + 1); break; case VMA_COMMIT_RESV: - ret = region_add(resv, idx, idx + 1); + ret = region_add(h, resv, idx, idx + 1); break; case VMA_END_RESV: region_abort(resv, idx, idx + 1); @@ -1938,7 +2062,7 @@ static long __vma_reservation_common(struct hstate *h, break; case VMA_ADD_RESV: if (vma->vm_flags & VM_MAYSHARE) - ret = region_add(resv, idx, idx + 1); + ret = region_add(h, resv, idx, idx + 1); else { region_abort(resv, idx, idx + 1); ret = region_del(resv, idx, idx + 1); @@ -4536,7 +4660,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct vm_area_struct *vma, vm_flags_t vm_flags) { - long ret, chg; + long ret, chg, add; struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; @@ -4624,9 +4748,7 @@ int hugetlb_reserve_pages(struct inode *inode, */ ret = hugetlb_acct_memory(h, gbl_reserve); if (ret < 0) { - /* put back original number of pages, chg */ - (void)hugepage_subpool_put_pages(spool, chg); - goto out_err; + goto out_put_pages; }
/* @@ -4641,7 +4763,12 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) { - long add = region_add(resv_map, from, to); + add = region_add(h, resv_map, from, to); + if (add < 0) { + ret = -ENOMEM; + goto out_acct_memory; + } +
if (unlikely(chg > add)) { /* @@ -4659,10 +4786,15 @@ int hugetlb_reserve_pages(struct inode *inode, } } return 0; +out_acct_memory: + hugetlb_acct_memory(h, -gbl_reserve); +out_put_pages: + /* put back original number of pages, chg */ + (void)hugepage_subpool_put_pages(spool, chg); out_err: if (!vma || vma->vm_flags & VM_MAYSHARE) - /* Don't call region_abort if region_chg failed */ - if (chg >= 0) + /* Don't call region_abort if region_chg or region_add failed */ + if (chg >= 0 && add >= 0) region_abort(resv_map, from, to); if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) kref_put(&resv_map->refs, resv_map_release); -- 2.23.0.rc1.153.gdeed80330f-goog
On 8/8/19 4:13 PM, Mina Almasry wrote:
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.
When a file_region entry is added to the resv_map via region_add, we also charge the appropriate hugetlb_cgroup and put the pointer to that in file_region->reservation_counter. This is slightly delicate since we need to not modify the resv_map until we know that charging the reservation has succeeded. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.
I wish we did not need to modify these region_() routines as they are already difficult to understand. However, I see no other way with the desired semantics.
On region_del, which is when the hugetlb memory is unreserved, we delete the file_region entry in the resv_map, but also uncharge the file_region->reservation_counter.
mm/hugetlb.c | 208 +++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 170 insertions(+), 38 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 235996aef6618..d76e3137110ab 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -242,8 +242,72 @@ struct file_region { struct list_head link; long from; long to; +#ifdef CONFIG_CGROUP_HUGETLB
- /*
* On shared mappings, each reserved region appears as a struct
* file_region in resv_map. These fields hold the info needed to
* uncharge each reservation.
*/
- struct page_counter *reservation_counter;
- unsigned long pages_per_hpage;
+#endif };
+/* Must be called with resv->lock held. Calling this with dry_run == true will
- count the number of pages added but will not modify the linked list.
- */
+static long consume_regions_we_overlap_with(struct file_region *rg,
struct list_head *head, long f, long *t,
struct hugetlb_cgroup *h_cg,
struct hstate *h,
bool dry_run)
+{
- long add = 0;
- struct file_region *trg = NULL, *nrg = NULL;
- /* Consume any regions we now overlap with. */
- nrg = rg;
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
if (&rg->link == head)
break;
if (rg->from > *t)
break;
/* If this area reaches higher then extend our area to
* include it completely. If this is not the first area
* which we intend to reuse, free it.
*/
if (rg->to > *t)
*t = rg->to;
if (rg != nrg) {
/* Decrement return value by the deleted range.
* Another range will span this area so that by
* end of routine add will be >= zero
*/
add -= (rg->to - rg->from);
if (!dry_run) {
list_del(&rg->link);
kfree(rg);
Is it possible that the region struct we are deleting pointed to a reservation_counter? Perhaps even for another cgroup? Just concerned with the way regions are coalesced that we may be deleting counters.
On 8/13/19 4:54 PM, Mike Kravetz wrote:
On 8/8/19 4:13 PM, Mina Almasry wrote:
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.
When a file_region entry is added to the resv_map via region_add, we also charge the appropriate hugetlb_cgroup and put the pointer to that in file_region->reservation_counter. This is slightly delicate since we need to not modify the resv_map until we know that charging the reservation has succeeded. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.
I wish we did not need to modify these region_() routines as they are already difficult to understand. However, I see no other way with the desired semantics.
I suspect you have considered this, but what about using the return value from region_chg() in hugetlb_reserve_pages() to charge reservation limits? There is a VERY SMALL race where the value could be too large, but that can be checked and adjusted at region_add time as is done with normal accounting today. If the question is, where would we store the information to uncharge?, then we can hang a structure off the vma. This would be similar to what is done for private mappings. In fact, I would suggest making them both use a new cgroup reserve structure hanging off the vma.
One issue I see is what to do if a vma is split? The private mapping case 'should' handle this today, but I would not be surprised if such code is missing or incorrect.
On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/13/19 4:54 PM, Mike Kravetz wrote:
On 8/8/19 4:13 PM, Mina Almasry wrote:
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.
When a file_region entry is added to the resv_map via region_add, we also charge the appropriate hugetlb_cgroup and put the pointer to that in file_region->reservation_counter. This is slightly delicate since we need to not modify the resv_map until we know that charging the reservation has succeeded. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.
I wish we did not need to modify these region_() routines as they are already difficult to understand. However, I see no other way with the desired semantics.
I suspect you have considered this, but what about using the return value from region_chg() in hugetlb_reserve_pages() to charge reservation limits? There is a VERY SMALL race where the value could be too large, but that can be checked and adjusted at region_add time as is done with normal accounting today.
I have not actually until now; I didn't consider doing stuff with the resv_map while not holding onto the resv_map->lock. I guess that's the small race you're talking about. Seems fine to me, but I'm more worried about hanging off the vma below.
If the question is, where would we store the information to uncharge?, then we can hang a structure off the vma. This would be similar to what is done for private mappings. In fact, I would suggest making them both use a new cgroup reserve structure hanging off the vma.
I actually did consider hanging off the info to uncharge off the vma, but I didn't for a couple of reasons:
1. region_del is called from hugetlb_unreserve_pages, and I don't have access to the vma there. Maybe there is a way to query the proper vma I don't know about? 2. hugetlb_reserve_pages seems to be able to conduct a reservation with a NULL *vma. Not sure what to do in that case.
Is there a way to get around these that I'm missing here?
FWIW I think tracking is better in resv_map since the reservations are in resv_map themselves. If I do another structure, then for each reservation there will be an entry in resv_map and an entry in the new structure and they need to be kept in sync and I need to handle errors for when they get out of sync.
One issue I see is what to do if a vma is split? The private mapping case 'should' handle this today, but I would not be surprised if such code is missing or incorrect.
-- Mike Kravetz
On 8/15/19 4:04 PM, Mina Almasry wrote:
On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/13/19 4:54 PM, Mike Kravetz wrote:
On 8/8/19 4:13 PM, Mina Almasry wrote:
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.
When a file_region entry is added to the resv_map via region_add, we also charge the appropriate hugetlb_cgroup and put the pointer to that in file_region->reservation_counter. This is slightly delicate since we need to not modify the resv_map until we know that charging the reservation has succeeded. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.
I wish we did not need to modify these region_() routines as they are already difficult to understand. However, I see no other way with the desired semantics.
I suspect you have considered this, but what about using the return value from region_chg() in hugetlb_reserve_pages() to charge reservation limits? There is a VERY SMALL race where the value could be too large, but that can be checked and adjusted at region_add time as is done with normal accounting today.
I have not actually until now; I didn't consider doing stuff with the resv_map while not holding onto the resv_map->lock. I guess that's the small race you're talking about. Seems fine to me, but I'm more worried about hanging off the vma below.
This race is already handled for other 'reservation like' things in hugetlb_reserve_pages. So, I don't think the race is much of an issue.
If the question is, where would we store the information to uncharge?, then we can hang a structure off the vma. This would be similar to what is done for private mappings. In fact, I would suggest making them both use a new cgroup reserve structure hanging off the vma.
I actually did consider hanging off the info to uncharge off the vma, but I didn't for a couple of reasons:
- region_del is called from hugetlb_unreserve_pages, and I don't have
access to the vma there. Maybe there is a way to query the proper vma I don't know about?
I am still thinking about closely tying cgroup revervation limits/usage to existing reservation accounting. Of most concern (to me) is handling shared mappings. Reservations created for shared mappings are more associated with the inode/file than individual mappings. For example, consider a task which mmaps(MAP_SHARED) a hugetlbfs file. At mmap time reservations are created based on the size of the mmap. Now, if the task unmaps and/or exits the reservations still exist as they are associated with the file rather than the mapping.
Honesty, I think this existing reservation bevahior is wrong or at least not desirable. Because there are outstanding reservations, the number of reserved huge pages can not be used for other purposes. It is also very difficult for a user or admin to determine the source of the reservations. No one is currently complaining about this behavior. This proposal just made me think about it.
Tying cgroup reservation limits/usage to existing reservation accounting will introduce the same issues there. We will need to clearly document the behavior.
- hugetlb_reserve_pages seems to be able to conduct a reservation
with a NULL *vma. Not sure what to do in that case.
Is there a way to get around these that I'm missing here?
You are correct. The !vma case is there for System V shared memory such as a call to shmget(SHM_HUGETLB). In this case, reservations for the entire shared emmory segment are taken at shmget time.
In your model, the caller of shmget who creates the shared memory segment would get charged for all the reservations. Users, (those calling shmat) would not be charged.
FWIW I think tracking is better in resv_map since the reservations are in resv_map themselves. If I do another structure, then for each reservation there will be an entry in resv_map and an entry in the new structure and they need to be kept in sync and I need to handle errors for when they get out of sync.
I think you may be correct. However, this implies that we will need to change the way we do reservation in the resv_map for shared mappings. I will comment on that in reply to patch 4.
On Fri, Aug 16, 2019 at 9:29 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/15/19 4:04 PM, Mina Almasry wrote:
On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/13/19 4:54 PM, Mike Kravetz wrote:
On 8/8/19 4:13 PM, Mina Almasry wrote:
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.
When a file_region entry is added to the resv_map via region_add, we also charge the appropriate hugetlb_cgroup and put the pointer to that in file_region->reservation_counter. This is slightly delicate since we need to not modify the resv_map until we know that charging the reservation has succeeded. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.
I wish we did not need to modify these region_() routines as they are already difficult to understand. However, I see no other way with the desired semantics.
I suspect you have considered this, but what about using the return value from region_chg() in hugetlb_reserve_pages() to charge reservation limits? There is a VERY SMALL race where the value could be too large, but that can be checked and adjusted at region_add time as is done with normal accounting today.
I have not actually until now; I didn't consider doing stuff with the resv_map while not holding onto the resv_map->lock. I guess that's the small race you're talking about. Seems fine to me, but I'm more worried about hanging off the vma below.
This race is already handled for other 'reservation like' things in hugetlb_reserve_pages. So, I don't think the race is much of an issue.
If the question is, where would we store the information to uncharge?, then we can hang a structure off the vma. This would be similar to what is done for private mappings. In fact, I would suggest making them both use a new cgroup reserve structure hanging off the vma.
I actually did consider hanging off the info to uncharge off the vma, but I didn't for a couple of reasons:
- region_del is called from hugetlb_unreserve_pages, and I don't have
access to the vma there. Maybe there is a way to query the proper vma I don't know about?
I am still thinking about closely tying cgroup revervation limits/usage to existing reservation accounting. Of most concern (to me) is handling shared mappings. Reservations created for shared mappings are more associated with the inode/file than individual mappings. For example, consider a task which mmaps(MAP_SHARED) a hugetlbfs file. At mmap time reservations are created based on the size of the mmap. Now, if the task unmaps and/or exits the reservations still exist as they are associated with the file rather than the mapping.
I'm aware of this behavior, and IMO it seems fine to me. I believe it works the same way with tmfs today. I think a task that creates a file in tmpfs gets charged the memory, and even if the task exits the memory is still charged to its cgroup, and the memory remains charged until the tmpfs file is deleted by someone.
Makes sense to me for hugetlb reservations to work the same way. The memory remains charged until the hugetlbfs file gets deleted. But, if you think of improvement, I'm happy to oblige :)
Honesty, I think this existing reservation bevahior is wrong or at least not desirable. Because there are outstanding reservations, the number of reserved huge pages can not be used for other purposes. It is also very difficult for a user or admin to determine the source of the reservations. No one is currently complaining about this behavior. This proposal just made me think about it.
Tying cgroup reservation limits/usage to existing reservation accounting will introduce the same issues there. We will need to clearly document the behavior.
Yes, seems we're maybe converging on a solution here, so the next patchset will include docs for your review.
- hugetlb_reserve_pages seems to be able to conduct a reservation
with a NULL *vma. Not sure what to do in that case.
Is there a way to get around these that I'm missing here?
You are correct. The !vma case is there for System V shared memory such as a call to shmget(SHM_HUGETLB). In this case, reservations for the entire shared emmory segment are taken at shmget time.
In your model, the caller of shmget who creates the shared memory segment would get charged for all the reservations. Users, (those calling shmat) would not be charged.
FWIW I think tracking is better in resv_map since the reservations are in resv_map themselves. If I do another structure, then for each reservation there will be an entry in resv_map and an entry in the new structure and they need to be kept in sync and I need to handle errors for when they get out of sync.
I think you may be correct. However, this implies that we will need to change the way we do reservation in the resv_map for shared mappings. I will comment on that in reply to patch 4.
-- Mike Kravetz
On Tue, Aug 13, 2019 at 4:54 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/8/19 4:13 PM, Mina Almasry wrote:
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.
When a file_region entry is added to the resv_map via region_add, we also charge the appropriate hugetlb_cgroup and put the pointer to that in file_region->reservation_counter. This is slightly delicate since we need to not modify the resv_map until we know that charging the reservation has succeeded. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.
I wish we did not need to modify these region_() routines as they are already difficult to understand. However, I see no other way with the desired semantics.
On region_del, which is when the hugetlb memory is unreserved, we delete the file_region entry in the resv_map, but also uncharge the file_region->reservation_counter.
mm/hugetlb.c | 208 +++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 170 insertions(+), 38 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 235996aef6618..d76e3137110ab 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -242,8 +242,72 @@ struct file_region { struct list_head link; long from; long to; +#ifdef CONFIG_CGROUP_HUGETLB
/*
* On shared mappings, each reserved region appears as a struct
* file_region in resv_map. These fields hold the info needed to
* uncharge each reservation.
*/
struct page_counter *reservation_counter;
unsigned long pages_per_hpage;
+#endif };
+/* Must be called with resv->lock held. Calling this with dry_run == true will
- count the number of pages added but will not modify the linked list.
- */
+static long consume_regions_we_overlap_with(struct file_region *rg,
struct list_head *head, long f, long *t,
struct hugetlb_cgroup *h_cg,
struct hstate *h,
bool dry_run)
+{
long add = 0;
struct file_region *trg = NULL, *nrg = NULL;
/* Consume any regions we now overlap with. */
nrg = rg;
list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
if (&rg->link == head)
break;
if (rg->from > *t)
break;
/* If this area reaches higher then extend our area to
* include it completely. If this is not the first area
* which we intend to reuse, free it.
*/
if (rg->to > *t)
*t = rg->to;
if (rg != nrg) {
/* Decrement return value by the deleted range.
* Another range will span this area so that by
* end of routine add will be >= zero
*/
add -= (rg->to - rg->from);
if (!dry_run) {
list_del(&rg->link);
kfree(rg);
Is it possible that the region struct we are deleting pointed to a reservation_counter? Perhaps even for another cgroup? Just concerned with the way regions are coalesced that we may be deleting counters.
Yep, that needs to be handled I think. Thanks for catching!
-- Mike Kravetz
On 8/15/19 4:08 PM, Mina Almasry wrote:
On Tue, Aug 13, 2019 at 4:54 PM Mike Kravetz mike.kravetz@oracle.com wrote:
mm/hugetlb.c | 208 +++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 170 insertions(+), 38 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 235996aef6618..d76e3137110ab 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -242,8 +242,72 @@ struct file_region { struct list_head link; long from; long to; +#ifdef CONFIG_CGROUP_HUGETLB
/*
* On shared mappings, each reserved region appears as a struct
* file_region in resv_map. These fields hold the info needed to
* uncharge each reservation.
*/
struct page_counter *reservation_counter;
unsigned long pages_per_hpage;
+#endif };
+/* Must be called with resv->lock held. Calling this with dry_run == true will
- count the number of pages added but will not modify the linked list.
- */
+static long consume_regions_we_overlap_with(struct file_region *rg,
struct list_head *head, long f, long *t,
struct hugetlb_cgroup *h_cg,
struct hstate *h,
bool dry_run)
+{
long add = 0;
struct file_region *trg = NULL, *nrg = NULL;
/* Consume any regions we now overlap with. */
nrg = rg;
list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
if (&rg->link == head)
break;
if (rg->from > *t)
break;
/* If this area reaches higher then extend our area to
* include it completely. If this is not the first area
* which we intend to reuse, free it.
*/
if (rg->to > *t)
*t = rg->to;
if (rg != nrg) {
/* Decrement return value by the deleted range.
* Another range will span this area so that by
* end of routine add will be >= zero
*/
add -= (rg->to - rg->from);
if (!dry_run) {
list_del(&rg->link);
kfree(rg);
Is it possible that the region struct we are deleting pointed to a reservation_counter? Perhaps even for another cgroup? Just concerned with the way regions are coalesced that we may be deleting counters.
Yep, that needs to be handled I think. Thanks for catching!
I believe that we will no longer be able to coalesce reserv_map entries for shared mappings. That is because we need to record who is responsible for creating reservation entries.
The tests use both shared and private mapped hugetlb memory, and monitors the hugetlb usage counter as well as the hugetlb reservation counter. They test different configurations such as hugetlb memory usage via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without MAP_POPULATE.
--- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 4 + .../selftests/vm/charge_reserved_hugetlb.sh | 438 ++++++++++++++++++ .../selftests/vm/write_hugetlb_memory.sh | 22 + .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++ 5 files changed, 717 insertions(+) create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 31b3c98b6d34d..d3bed9407773c 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -14,3 +14,4 @@ virtual_address_range gup_benchmark va_128TBswitch map_fixed_noreplace +write_to_hugetlbfs diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 9534dc2bc9295..8d37d5409b52c 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd TEST_GEN_FILES += va_128TBswitch TEST_GEN_FILES += virtual_address_range +TEST_GEN_FILES += write_to_hugetlbfs
TEST_PROGS := run_vmtests
@@ -29,3 +30,6 @@ include ../lib.mk $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
$(OUTPUT)/mlock-random-test: LDLIBS += -lcap + +# Why does adding $(OUTPUT)/ like above not apply this flag..? +write_to_hugetlbfs: CFLAGS += -static diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh new file mode 100755 index 0000000000000..bf0b6dcec9977 --- /dev/null +++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh @@ -0,0 +1,438 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +set -e + +cgroup_path=/dev/cgroup/memory +if [[ ! -e $cgroup_path ]]; then + mkdir -p $cgroup_path + mount -t cgroup -o hugetlb,memory cgroup $cgroup_path +fi + +cleanup () { + echo $$ > $cgroup_path/tasks + + set +e + if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then + kill -2 write_to_hugetlbfs + # Wait for hugetlbfs memory to get depleted. + sleep 0.5 + fi + set -e + + if [[ -e /mnt/huge ]]; then + rm -rf /mnt/huge/* + umount /mnt/huge || echo error + rmdir /mnt/huge + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test1 + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test2 + fi + echo 0 > /proc/sys/vm/nr_hugepages + echo CLEANUP DONE +} + +cleanup + +function expect_equal() { + local expected="$1" + local actual="$2" + local error="$3" + + if [[ "$expected" != "$actual" ]]; then + echo "expected ($expected) != actual ($actual): $3" + cleanup + exit 1 + fi +} + +function setup_cgroup() { + local name="$1" + local cgroup_limit="$2" + local reservation_limit="$3" + + mkdir $cgroup_path/$name + + echo writing cgroup limit: "$cgroup_limit" + echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes + + echo writing reseravation limit: "$reservation_limit" + echo "$reservation_limit" > \ + $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes +} + +function write_hugetlbfs_and_get_usage() { + local cgroup="$1" + local size="$2" + local populate="$3" + local write="$4" + local path="$5" + local method="$6" + local private="$7" + local expect_failure="$8" + + # Function return values. + reservation_failed=0 + oom_killed=0 + hugetlb_difference=0 + reserved_difference=0 + + local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes + local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes + + local hugetlb_before=$(cat $hugetlb_usage) + local reserved_before=$(cat $reserved_usage) + + echo + echo Starting: + echo hugetlb_usage="$hugetlb_before" + echo reserved_usage="$reserved_before" + echo expect_failure is "$expect_failure" + + set +e + if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \ + [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then + bash write_hugetlb_memory.sh "$size" "$populate" "$write" \ + "$cgroup" "$path" "$method" "$private" "-l" & + + local write_result=$? + # This sleep is to make sure that the script above has had enough + # time to do its thing, since it runs in the background. This may + # cause races... + sleep 0.5 + echo write_result is $write_result + else + bash write_hugetlb_memory.sh "$size" "$populate" "$write" \ + "$cgroup" "$path" "$method" "$private" + local write_result=$? + fi + set -e + + if [[ "$write_result" == 1 ]]; then + reservation_failed=1 + fi + + # On linus/master, the above process gets SIGBUS'd on oomkill, with + # return code 135. On earlier kernels, it gets actual oomkill, with return + # code 137, so just check for both conditions incase we're testing against + # an earlier kernel. + if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then + oom_killed=1 + fi + + local hugetlb_after=$(cat $hugetlb_usage) + local reserved_after=$(cat $reserved_usage) + + echo After write: + echo hugetlb_usage="$hugetlb_after" + echo reserved_usage="$reserved_after" + + hugetlb_difference=$(($hugetlb_after - $hugetlb_before)) + reserved_difference=$(($reserved_after - $reserved_before)) +} + +function cleanup_hugetlb_memory() { + set +e + if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then + echo kiling write_to_hugetlbfs + killall -2 write_to_hugetlbfs + # Wait for hugetlbfs memory to get depleted. + sleep 0.5 + fi + set -e + + if [[ -e /mnt/huge ]]; then + rm -rf /mnt/huge/* + umount /mnt/huge + rmdir /mnt/huge + fi +} + +function run_test() { + local size="$1" + local populate="$2" + local write="$3" + local cgroup_limit="$4" + local reservation_limit="$5" + local nr_hugepages="$6" + local method="$7" + local private="$8" + local expect_failure="$9" + + # Function return values. + hugetlb_difference=0 + reserved_difference=0 + reservation_failed=0 + oom_killed=0 + + echo nr hugepages = "$nr_hugepages" + echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages + + setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit" + + mkdir -p /mnt/huge + mount -t hugetlbfs \ + -o pagesize=2M,size=256M none /mnt/huge + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \ + "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure" + + cleanup_hugetlb_memory + + local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.usage_in_bytes) + local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.reservation_usage_in_bytes) + + expect_equal "0" "$final_hugetlb" "final hugetlb is not zero" + expect_equal "0" "$final_reservation" "final reservation is not zero" +} + +function run_multiple_cgroup_test() { + local size1="$1" + local populate1="$2" + local write1="$3" + local cgroup_limit1="$4" + local reservation_limit1="$5" + + local size2="$6" + local populate2="$7" + local write2="$8" + local cgroup_limit2="$9" + local reservation_limit2="${10}" + + local nr_hugepages="${11}" + local method="${12}" + local private="${13}" + local expect_failure="${14}" + + # Function return values. + hugetlb_difference1=0 + reserved_difference1=0 + reservation_failed1=0 + oom_killed1=0 + + hugetlb_difference2=0 + reserved_difference2=0 + reservation_failed2=0 + oom_killed2=0 + + + echo nr hugepages = "$nr_hugepages" + echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages + + setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1" + setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2" + + mkdir -p /mnt/huge + mount -t hugetlbfs \ + -o pagesize=2M,size=256M none /mnt/huge + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \ + "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \ + "$expect_failure" + + hugetlb_difference1=$hugetlb_difference + reserved_difference1=$reserved_difference + reservation_failed1=$reservation_failed + oom_killed1=$oom_killed + + local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.usage_in_bytes + local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.reservation_usage_in_bytes + local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.usage_in_bytes + local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.reservation_usage_in_bytes + + local usage_before_second_write=$(cat $cgroup1_hugetlb_usage) + local reservation_usage_before_second_write=$(cat \ + $cgroup1_reservation_usage) + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \ + "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \ + "$expect_failure" + + hugetlb_difference2=$hugetlb_difference + reserved_difference2=$reserved_difference + reservation_failed2=$reservation_failed + oom_killed2=$oom_killed + + expect_equal "$usage_before_second_write" \ + "$(cat $cgroup1_hugetlb_usage)" "Usage changed." + expect_equal "$reservation_usage_before_second_write" \ + "$(cat $cgroup1_reservation_usage)" "Reservation usage changed." + + cleanup_hugetlb_memory + + local final_hugetlb=$(cat $cgroup1_hugetlb_usage) + local final_reservation=$(cat $cgroup1_reservation_usage) + + expect_equal "0" "$final_hugetlb" \ + "hugetlbt_cgroup_test1 final hugetlb is not zero" + expect_equal "0" "$final_reservation" \ + "hugetlbt_cgroup_test1 final reservation is not zero" + + local final_hugetlb=$(cat $cgroup2_hugetlb_usage) + local final_reservation=$(cat $cgroup2_reservation_usage) + + expect_equal "0" "$final_hugetlb" \ + "hugetlb_cgroup_test2 final hugetlb is not zero" + expect_equal "0" "$final_reservation" \ + "hugetlb_cgroup_test2 final reservation is not zero" +} + +for private in "" "-r" ; do +for populate in "" "-o"; do +for method in 0 1 2; do + +# Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported. +if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then + continue +fi + +# Skip populated shmem tests. Doesn't seem to be supported. +if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then + continue +fi + +cleanup +echo +echo +echo +echo Test normal case. +echo private=$private, populate=$populate, method=$method +run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "0" + +echo Memory charged to hugtlb=$hugetlb_difference +echo Memory charged to reservation=$reserved_difference + +if [[ "$populate" == "-o" ]]; then + expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +else + expect_equal "0" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +fi + +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \ + "Reserved memory not charged to reservation usage." +echo 'PASS' + +cleanup +echo +echo +echo +echo Test normal case with write. +echo private=$private, populate=$populate, method=$method +run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "0" + +echo Memory charged to hugtlb=$hugetlb_difference +echo Memory charged to reservation=$reserved_difference + +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \ + "Reserved memory not charged to reservation usage." +echo 'PASS' + + +cleanup +echo +echo +echo +echo Test more than reservation case. +echo private=$private, populate=$populate, method=$method +run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \ + "$((5 * 1024 * 1024))" "10" "$method" "$private" "1" + +expect_equal "1" "$reservation_failed" "Reservation succeeded." +echo 'PASS' + +cleanup + +echo +echo +echo +echo Test more than cgroup limit case. +echo private=$private, populate=$populate, method=$method + +# Not sure if shm memory can be cleaned up when the process gets sigbus'd. +if [[ "$method" != 2 ]]; then + run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "1" + + expect_equal "1" "$oom_killed" "Not oom killed." +fi +echo 'PASS' + +cleanup + +echo +echo +echo +echo Test normal case, multiple cgroups. +echo private=$private, populate=$populate, method=$method +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \ + "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \ + "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \ + "$method" "$private" "0" + +echo Memory charged to hugtlb1=$hugetlb_difference1 +echo Memory charged to reservation1=$reserved_difference1 +echo Memory charged to hugtlb2=$hugetlb_difference2 +echo Memory charged to reservation2=$reserved_difference2 + +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \ + "Incorrect reservations charged to cgroup 1." +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \ + "Incorrect reservation charged to cgroup 2." +if [[ "$populate" == "-o" ]]; then + expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." + expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +else + expect_equal "0" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." + expect_equal "0" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +fi +echo 'PASS' + +cleanup +echo +echo +echo +echo Test normal case with write, multiple cgroups. +echo private=$private, populate=$populate, method=$method +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \ + "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \ + "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \ + "$method" "$private" "0" + +echo Memory charged to hugtlb1=$hugetlb_difference1 +echo Memory charged to reservation1=$reserved_difference1 +echo Memory charged to hugtlb2=$hugetlb_difference2 +echo Memory charged to reservation2=$reserved_difference2 + +expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \ + "Incorrect reservation charged to cgroup 1." +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \ + "Incorrected reservation charged to cgroup 2." + +echo 'PASS' + +done # private +done # populate +done # method + +umount $cgroup_path +rmdir $cgroup_path diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh new file mode 100644 index 0000000000000..08f5fa5527cfd --- /dev/null +++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh @@ -0,0 +1,22 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +set -e + +size=$1 +populate=$2 +write=$3 +cgroup=$4 +path=$5 +method=$6 +private=$7 +want_sleep=$8 + +echo "Putting task in cgroup '$cgroup'" +echo $$ > /dev/cgroup/memory/"$cgroup"/tasks + +echo "Method is $method" + +set +e +./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \ + "$private" "$want_sleep" diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c new file mode 100644 index 0000000000000..f02a897427a97 --- /dev/null +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c @@ -0,0 +1,252 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This program reserves and uses hugetlb memory, supporting a bunch of + * scenorios needed by the charged_reserved_hugetlb.sh test. + */ + +#include <err.h> +#include <errno.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/shm.h> +#include <sys/stat.h> +#include <sys/mman.h> + +/* Global definitions. */ +enum method { + HUGETLBFS, + MMAP_MAP_HUGETLB, + SHM, + MAX_METHOD +}; + + +/* Global variables. */ +static const char *self; +static char *shmaddr; +static int shmid; + +/* + * Show usage and exit. + */ +static void exit_usage(void) +{ + + printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> " + "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] " + "[-o] [-w]\n", self); + exit(EXIT_FAILURE); +} + +void sig_handler(int signo) +{ + printf("Received %d.\n", signo); + if (signo == SIGINT) { + printf("Deleting the memory\n"); + if (shmdt((const void *)shmaddr) != 0) { + perror("Detach failure"); + shmctl(shmid, IPC_RMID, NULL); + exit(4); + } + + shmctl(shmid, IPC_RMID, NULL); + printf("Done deleting the memory\n"); + } + exit(2); +} + +int main(int argc, char **argv) +{ + int fd = 0; + int key = 0; + int *ptr = NULL; + int c = 0; + int size = 0; + char path[256] = ""; + enum method method = MAX_METHOD; + int want_sleep = 0, private = 0; + int populate = 0; + int write = 0; + + unsigned long i; + + + if (signal(SIGINT, sig_handler) == SIG_ERR) + err(1, "\ncan't catch SIGINT\n"); + + /* Parse command-line arguments. */ + setvbuf(stdout, NULL, _IONBF, 0); + self = argv[0]; + + while ((c = getopt(argc, argv, "s:p:m:owlr")) != -1) { + switch (c) { + case 's': + size = atoi(optarg); + break; + case 'p': + strncpy(path, optarg, sizeof(path)); + break; + case 'm': + if (atoi(optarg) >= MAX_METHOD) { + errno = EINVAL; + perror("Invalid -m."); + exit_usage(); + } + method = atoi(optarg); + break; + case 'o': + populate = 1; + break; + case 'w': + write = 1; + break; + case 'l': + want_sleep = 1; + break; + case 'r': + private = 1; + break; + default: + errno = EINVAL; + perror("Invalid arg"); + exit_usage(); + } + } + + if (strncmp(path, "", sizeof(path)) != 0) { + printf("Writing to this path: %s\n", path); + } else { + errno = EINVAL; + perror("path not found"); + exit_usage(); + } + + if (size != 0) { + printf("Writing this size: %d\n", size); + } else { + errno = EINVAL; + perror("size not found"); + exit_usage(); + } + + if (!populate) + printf("Not populating.\n"); + else + printf("Populating.\n"); + + if (!write) + printf("Not writing to memory.\n"); + + if (method == MAX_METHOD) { + errno = EINVAL; + perror("-m Invalid"); + exit_usage(); + } else + printf("Using method=%d\n", method); + + if (!private) + printf("Shared mapping.\n"); + else + printf("Private mapping.\n"); + + + switch (method) { + case HUGETLBFS: + printf("Allocating using HUGETLBFS.\n"); + fd = open(path, O_CREAT | O_RDWR, 0777); + if (fd == -1) + err(1, "Failed to open file."); + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + (private ? MAP_PRIVATE : MAP_SHARED) | (populate ? + MAP_POPULATE : 0), fd, 0); + + if (ptr == MAP_FAILED) { + close(fd); + err(1, "Error mapping the file"); + } + break; + case MMAP_MAP_HUGETLB: + printf("Allocating using MAP_HUGETLB.\n"); + ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + (private ? (MAP_PRIVATE | MAP_ANONYMOUS) : MAP_SHARED) | + MAP_HUGETLB | (populate ? + MAP_POPULATE : 0), + -1, 0); + + if (ptr == MAP_FAILED) + err(1, "mmap"); + + printf("Returned address is %p\n", ptr); + break; + case SHM: + printf("Allocating using SHM.\n"); + shmid = shmget(key, size, SHM_HUGETLB | IPC_CREAT | SHM_R | + SHM_W); + if (shmid < 0) { + shmid = shmget(++key, size, SHM_HUGETLB | IPC_CREAT | + SHM_R | SHM_W); + if (shmid < 0) + err(1, "shmget"); + + } + printf("shmid: 0x%x, shmget key:%d\n", shmid, key); + + shmaddr = shmat(shmid, NULL, 0); + if (shmaddr == (char *)-1) { + perror("Shared memory attach failure"); + shmctl(shmid, IPC_RMID, NULL); + exit(2); + } + printf("shmaddr: %p\n", shmaddr); + + break; + default: + errno = EINVAL; + err(1, "Invalid method."); + } + + if (write) { + printf("Writing to memory.\n"); + if (method != SHM) { + memset(ptr, 1, size); + } else { + printf("Starting the writes:\n"); + for (i = 0; i < size; i++) { + shmaddr[i] = (char)(i); + if (!(i % (1024 * 1024))) + printf("."); + } + printf("\n"); + + printf("Starting the Check..."); + for (i = 0; i < size; i++) + if (shmaddr[i] != (char)i) { + printf("\nIndex %lu mismatched\n", i); + exit(3); + } + printf("Done.\n"); + + + } + } + + if (want_sleep) { + /* Signal to caller that we're done. */ + printf("DONE\n"); + + /* Hold memory until external kill signal is delivered. */ + while (1) + sleep(100); + } + + close(fd); + + return 0; +} -- 2.23.0.rc1.153.gdeed80330f-goog
(+CC Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar)
On 8/8/19 4:13 PM, Mina Almasry wrote:
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd.
Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation.
This proposal is implemented in this patch, with tests to verify functionality and show the usage.
Thanks for taking on this effort Mina.
Before looking at the details of the code, it might be helpful to discuss the expected semantics of the proposed reservation limits.
I see you took into account the differences between private and shared mappings. This is good, as the reservation behavior is different for each of these cases. First let's look at private mappings.
For private mappings, the reservation usage will be the size of the mapping. This should be fairly simple. As reservations are consumed in the hugetlbfs code, reservations in the resv_map are removed. I see you have a hook into region_del. So, the expectation is that as reservations are consumed the reservation usage will drop for the cgroup. Correct? The only tricky thing about private mappings is COW because of fork. Current reservation semantics specify that all reservations stay with the parent. If child faults and can not get page, SIGBUS. I assume the new reservation limits will work the same.
I believe tracking reservations for shared mappings can get quite complicated. The hugetlbfs reservation code around shared mappings 'works' on the basis that shared mapping reservations are global. As a result, reservations are more associated with the inode than with the task making the reservation. For example, consider a file of size 4 hugetlb pages. Task A maps the first 2 pages, and 2 reservations are taken. Task B maps all 4 pages, and 2 additional reservations are taken. I am not really sure of the desired semantics here for reservation limits if A and B are in separate cgroups. Should B be charged for 4 or 2 reservations? Also in the example above, after both tasks create their mappings suppose Task B faults in the first page. Does the reservation usage of Task A go down as it originally had the reservation?
It should also be noted that when hugetlbfs reservations are 'consumed' for shared mappings there are no changes to the resv_map. Rather the unmap code compares the contents of the page cache to the resv_map to determine how many reservations were actually consumed. I did not look close enough to determine the code drops reservation usage counts as pages are added to shared mappings.
On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz mike.kravetz@oracle.com wrote:
(+CC Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar)
On 8/8/19 4:13 PM, Mina Almasry wrote:
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd.
Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation.
This proposal is implemented in this patch, with tests to verify functionality and show the usage.
Thanks for taking on this effort Mina.
No problem! Thanks for reviewing!
Before looking at the details of the code, it might be helpful to discuss the expected semantics of the proposed reservation limits.
I see you took into account the differences between private and shared mappings. This is good, as the reservation behavior is different for each of these cases. First let's look at private mappings.
For private mappings, the reservation usage will be the size of the mapping. This should be fairly simple. As reservations are consumed in the hugetlbfs code, reservations in the resv_map are removed. I see you have a hook into region_del. So, the expectation is that as reservations are consumed the reservation usage will drop for the cgroup. Correct?
I assume by 'reservations are consumed' you mean when a reservation goes from just 'reserved' to actually in use (as in the task is writing to the hugetlb page or something). If so, then the answer is no, that is not correct. When reservations are consumed, the reservation usage stays the same. I.e. the reservation usage tracks hugetlb memory (reserved + used) you could say. This is 100% the intention, as we want to know on mmap time if there is enough 'free' (that is unreserved and unused) memory left over in the cgroup to satisfy the mmap call.
The hooks in region_add and region_del are to account shared mappings only. There is a check in those code blocks that makes sure the code is only engaged in shared mappings. The commit messages of patches 3/5 and 4/5 go into more details regarding this.
The only tricky thing about private mappings is COW because of fork. Current reservation semantics specify that all reservations stay with the parent. If child faults and can not get page, SIGBUS. I assume the new reservation limits will work the same.
Although I did not explicitly try it, yes. It should work the same. The additional reservation due to the COW will get charged to whatever cgroup the fork is in. If the task can't get a page it gets SIGBUS'd. If there is not enough room to charge the cgroup it's in, then the charge will fail, which I assume will trigger error path that also leads to SIGBUS.
I believe tracking reservations for shared mappings can get quite complicated. The hugetlbfs reservation code around shared mappings 'works' on the basis that shared mapping reservations are global. As a result, reservations are more associated with the inode than with the task making the reservation.
FWIW, I found it not too bad. And my tests at least don't detect an anomaly around shared mappings. The key I think is that I'm tracking cgroup to uncharge on the file_region entry inside the resv_map, so we know who allocated each file_region entry exactly and we can uncharge them when the entry is region_del'd.
For example, consider a file of size 4 hugetlb pages. Task A maps the first 2 pages, and 2 reservations are taken. Task B maps all 4 pages, and 2 additional reservations are taken. I am not really sure of the desired semantics here for reservation limits if A and B are in separate cgroups. Should B be charged for 4 or 2 reservations?
Task A's cgroup is charged 2 pages to its reservation usage. Task B's cgroup is charged 2 pages to its reservation usage.
This is analogous to how shared memory accounting is done for things like tmpfs, and I see no strong reason right now to deviate. I.e. the task that made the reservation is charged with it, and others use it without getting charged.
Also in the example above, after both tasks create their mappings suppose Task B faults in the first page. Does the reservation usage of Task A go down as it originally had the reservation?
Reservation usage never goes down when pages are consumed. Yes, I would have this problem if I was planning to decrement reservation usage when pages are put into use, but, the goal is to find out if there is 'free' memory (unreserved + unused) in the cgroup at mmap time, so we want a counter that tracks (reserved + used).
It should also be noted that when hugetlbfs reservations are 'consumed' for shared mappings there are no changes to the resv_map. Rather the unmap code compares the contents of the page cache to the resv_map to determine how many reservations were actually consumed. I did not look close enough to determine the code drops reservation usage counts as pages are added to shared mappings.
I think this concern also goes away if reservation usage doesn't go down when pages are consumed, but let me know if you still have doubts.
Thanks for taking a look so far!
-- Mike Kravetz
On 8/9/19 12:42 PM, Mina Almasry wrote:
On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/8/19 4:13 PM, Mina Almasry wrote:
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
<snip>
I believe tracking reservations for shared mappings can get quite complicated. The hugetlbfs reservation code around shared mappings 'works' on the basis that shared mapping reservations are global. As a result, reservations are more associated with the inode than with the task making the reservation.
FWIW, I found it not too bad. And my tests at least don't detect an anomaly around shared mappings. The key I think is that I'm tracking cgroup to uncharge on the file_region entry inside the resv_map, so we know who allocated each file_region entry exactly and we can uncharge them when the entry is region_del'd.
For example, consider a file of size 4 hugetlb pages. Task A maps the first 2 pages, and 2 reservations are taken. Task B maps all 4 pages, and 2 additional reservations are taken. I am not really sure of the desired semantics here for reservation limits if A and B are in separate cgroups. Should B be charged for 4 or 2 reservations?
Task A's cgroup is charged 2 pages to its reservation usage. Task B's cgroup is charged 2 pages to its reservation usage.
OK, Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages allocation. The mmap would succeed, but Task B could potentially need to allocate more than 2 huge pages. So, when faulting in more than 2 huge pages B would get a SIGBUS. Correct? Or, am I missing something?
Perhaps reservation charge should always be the same as map size/maximum allocation size?
On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/9/19 12:42 PM, Mina Almasry wrote:
On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/8/19 4:13 PM, Mina Almasry wrote:
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
<snip> >> I believe tracking reservations for shared mappings can get quite complicated. >> The hugetlbfs reservation code around shared mappings 'works' on the basis >> that shared mapping reservations are global. As a result, reservations are >> more associated with the inode than with the task making the reservation. > > FWIW, I found it not too bad. And my tests at least don't detect an > anomaly around shared mappings. The key I think is that I'm tracking > cgroup to uncharge on the file_region entry inside the resv_map, so we > know who allocated each file_region entry exactly and we can uncharge > them when the entry is region_del'd. > >> For example, consider a file of size 4 hugetlb pages. >> Task A maps the first 2 pages, and 2 reservations are taken. Task B maps >> all 4 pages, and 2 additional reservations are taken. I am not really sure >> of the desired semantics here for reservation limits if A and B are in separate >> cgroups. Should B be charged for 4 or 2 reservations? > > Task A's cgroup is charged 2 pages to its reservation usage. > Task B's cgroup is charged 2 pages to its reservation usage.
OK, Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages allocation. The mmap would succeed, but Task B could potentially need to allocate more than 2 huge pages. So, when faulting in more than 2 huge pages B would get a SIGBUS. Correct? Or, am I missing something?
Perhaps reservation charge should always be the same as map size/maximum allocation size?
I'm thinking this would work similar to how other shared memory like tmpfs is accounted for right now. I.e. if a task conducts an operation that causes memory to be allocated then that task is charged for that memory, and if another task uses memory that has already been allocated and charged by another task, then it can use the memory without being charged.
So in case of hugetlb memory, if a task is mmaping memory that causes a new reservation to be made, and new entries to be created in the resv_map for the shared mapping, then that task gets charged. If the task is mmaping memory that is already reserved or faulted, then it reserves or faults it without getting charged.
In the example above, in chronological order: - Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations. - Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb reservations because the first 2 are charged already and can be used without incurring a charge. - Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults, since none of the 4 pages are faulted in yet. If the task is only allowed 2 hugetlb page faults then it will actually get a SIGBUS. - Task A accesses 4 hugetlb pages, gets charged no faults, since all the hugetlb faults is charged to Task B.
So, yes, I can see a scenario where userspace still gets SIGBUS'd, but I think that's fine because: 1. Notice that the SIGBUS is due to the faulting limit, and not the reservation limit, so we're not regressing the status quo per say. Folks using the fault limit today understand the SIGBUS risk. 2. the way I expect folks to use this is to use 'reservation limits' to partition the available hugetlb memory on the machine using it and forgo using the existing fault limits. Using both at the same time I think would be a superuser feature for folks that really know what they are doing, and understand the risk of SIGBUS that comes with using the existing fault limits. 3. I expect userspace to in general handle this correctly because there are similar challenges with all shared memory and accounting of it, even in tmpfs, I think.
I would not like to charge the full reservation to every process that does the mmap. Think of this, much more common scenario: Task A and B are supposed to collaborate on a 10 hugetlb pages of data. Task B should not access any hugetlb memory other than the memory it is working on with Task A, so:
1. Task A is put in a cgroup with 10 hugetlb pages reservation limit. 2. Task B is put in a cgroup with 0 hugetlb pages of reservation limit. 3. Task A mmaps 10 hugetlb pages of hugetlb memory, and notifies Task B that it is done. 4. Task B, due to programmer error, tries to mmap hugetlb memory beyond what Task A set up for it, it gets denied at mmap time by the cgroup reservation limit. 5. Task B mmaps the same 10 hugetlb pages of memory and starts working on them. The mmap succeeds because Task B is not charged anything.
If we were charging the full reservation to both Tasks A and B, then both A and B would have be in cgroups that allow 10 pages of hugetlb reservations, and the mmap in step 4 would succeed, and we accidentally overcommitted the amount of hugetlb memory available.
-- Mike Kravetz
On 8/10/19 3:01 PM, Mina Almasry wrote:
On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/9/19 12:42 PM, Mina Almasry wrote:
On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/8/19 4:13 PM, Mina Almasry wrote:
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
<snip> >> I believe tracking reservations for shared mappings can get quite complicated. >> The hugetlbfs reservation code around shared mappings 'works' on the basis >> that shared mapping reservations are global. As a result, reservations are >> more associated with the inode than with the task making the reservation. > > FWIW, I found it not too bad. And my tests at least don't detect an > anomaly around shared mappings. The key I think is that I'm tracking > cgroup to uncharge on the file_region entry inside the resv_map, so we > know who allocated each file_region entry exactly and we can uncharge > them when the entry is region_del'd. > >> For example, consider a file of size 4 hugetlb pages. >> Task A maps the first 2 pages, and 2 reservations are taken. Task B maps >> all 4 pages, and 2 additional reservations are taken. I am not really sure >> of the desired semantics here for reservation limits if A and B are in separate >> cgroups. Should B be charged for 4 or 2 reservations? > > Task A's cgroup is charged 2 pages to its reservation usage. > Task B's cgroup is charged 2 pages to its reservation usage.
OK, Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages allocation. The mmap would succeed, but Task B could potentially need to allocate more than 2 huge pages. So, when faulting in more than 2 huge pages B would get a SIGBUS. Correct? Or, am I missing something?
Perhaps reservation charge should always be the same as map size/maximum allocation size?
I'm thinking this would work similar to how other shared memory like tmpfs is accounted for right now. I.e. if a task conducts an operation that causes memory to be allocated then that task is charged for that memory, and if another task uses memory that has already been allocated and charged by another task, then it can use the memory without being charged.
So in case of hugetlb memory, if a task is mmaping memory that causes a new reservation to be made, and new entries to be created in the resv_map for the shared mapping, then that task gets charged. If the task is mmaping memory that is already reserved or faulted, then it reserves or faults it without getting charged.
In the example above, in chronological order:
- Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations.
- Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb
reservations because the first 2 are charged already and can be used without incurring a charge.
- Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults,
since none of the 4 pages are faulted in yet. If the task is only allowed 2 hugetlb page faults then it will actually get a SIGBUS.
- Task A accesses 4 hugetlb pages, gets charged no faults, since all
the hugetlb faults is charged to Task B.
So, yes, I can see a scenario where userspace still gets SIGBUS'd, but I think that's fine because:
- Notice that the SIGBUS is due to the faulting limit, and not the
reservation limit, so we're not regressing the status quo per say. Folks using the fault limit today understand the SIGBUS risk. 2. the way I expect folks to use this is to use 'reservation limits' to partition the available hugetlb memory on the machine using it and forgo using the existing fault limits. Using both at the same time I think would be a superuser feature for folks that really know what they are doing, and understand the risk of SIGBUS that comes with using the existing fault limits. 3. I expect userspace to in general handle this correctly because there are similar challenges with all shared memory and accounting of it, even in tmpfs, I think.
Ok, that helps explain your use case. I agree that it would be difficult to use both fault and reservation limits together. Especially in the case of shared mappings.
linux-kselftest-mirror@lists.linaro.org