This is an updated patchset rebasing onto -torvalds master (post 6.11-rc1), and addressing comments from Michal and Tejun.
As requested by Tejun and Johannes, I've removed the explicit check for the string "reset", so it now allows any non-empty string. (Empty strings get filtered before our write handler executes)
I've also made several of the field reads and writes atomic with {READ,WRITE}_ONCE, and adjusted more of the types to be unsigned.
Documentation/admin-guide/cgroup-v2.rst | 22 ++-- include/linux/cgroup-defs.h | 5 + include/linux/cgroup.h | 3 + include/linux/memcontrol.h | 5 + include/linux/page_counter.h | 11 +- kernel/cgroup/cgroup-internal.h | 2 + kernel/cgroup/cgroup.c | 7 + mm/memcontrol.c | 116 +++++++++++++++-- mm/page_counter.c | 30 +++-- tools/testing/selftests/cgroup/cgroup_util.c | 22 ++++ tools/testing/selftests/cgroup/cgroup_util.h | 2 + tools/testing/selftests/cgroup/test_memcontrol.c | 229 +++++++++++++++++++++++++++++++-- 12 files changed, 419 insertions(+), 35 deletions(-)
[1]: https://lore.kernel.org/cgroups/20240724161942.3448841-3-davidf@vimeo.com/T/
Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API.
For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS.
This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local.
Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD.
Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter.
Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons.
This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems.
Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case.
As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory.
Suggested-by: Johannes Weiner hannes@cmpxchg.org Suggested-by: Waiman Long longman@redhat.com Signed-off-by: David Finkel davidf@vimeo.com --- Documentation/admin-guide/cgroup-v2.rst | 22 +++-- include/linux/cgroup-defs.h | 5 + include/linux/cgroup.h | 3 + include/linux/memcontrol.h | 5 + include/linux/page_counter.h | 11 ++- kernel/cgroup/cgroup-internal.h | 2 + kernel/cgroup/cgroup.c | 7 ++ mm/memcontrol.c | 116 ++++++++++++++++++++++-- mm/page_counter.c | 30 ++++-- 9 files changed, 174 insertions(+), 27 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 86311c2907cd3..f0499884124d2 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1333,11 +1333,14 @@ The following nested keys are defined. all the existing limitations and potential future extensions.
memory.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. + + The max memory usage recorded for the cgroup and its descendants since + either the creation of the cgroup or the most recent reset for that FD.
- The max memory usage recorded for the cgroup and its - descendants since the creation of the cgroup. + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor.
memory.oom.group A read-write single value file which exists on non-root @@ -1663,11 +1666,14 @@ The following nested keys are defined. Healthy workloads are not expected to reach this limit.
memory.swap.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. + + The max swap usage recorded for the cgroup and its descendants since + the creation of the cgroup or the most recent reset for that FD.
- The max swap usage recorded for the cgroup and its - descendants since the creation of the cgroup. + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor.
memory.swap.max A read-write single value file which exists on non-root diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ae04035b6cbe5..7fc2d0195f560 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -775,6 +775,11 @@ struct cgroup_subsys {
extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
+struct cgroup_of_peak { + unsigned long value; + struct list_head list; +}; + /** * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups * @tsk: target task diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index c60ba0ab14627..3e0563753cc3e 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -11,6 +11,7 @@
#include <linux/sched.h> #include <linux/nodemask.h> +#include <linux/list.h> #include <linux/rculist.h> #include <linux/cgroupstats.h> #include <linux/fs.h> @@ -854,4 +855,6 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id);
+struct cgroup_of_peak *of_peak(struct kernfs_open_file *of); + #endif /* _LINUX_CGROUP_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0e5bf25d324f0..cc74d73d3b065 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -193,6 +193,11 @@ struct mem_cgroup { struct page_counter memsw; /* v1 only */ };
+ /* registered local peak watchers */ + struct list_head memory_peaks; + struct list_head swap_peaks; + spinlock_t peaks_lock; + /* Range enforcement for interrupt charges */ struct work_struct high_work;
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 904c52f97284f..898f562c0b838 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -26,6 +26,8 @@ struct page_counter { atomic_long_t children_low_usage;
unsigned long watermark; + /* Latest cg2 reset watermark */ + unsigned long local_watermark; unsigned long failcnt;
/* Keep all the read most fields in a separete cacheline. */ @@ -78,7 +80,14 @@ int page_counter_memparse(const char *buf, const char *max,
static inline void page_counter_reset_watermark(struct page_counter *counter) { - counter->watermark = page_counter_read(counter); + unsigned long usage = page_counter_read(counter); + + /* + * Update local_watermark first, so it's always <= watermark + * (modulo CPU/compiler re-ordering) + */ + counter->local_watermark = usage; + counter->watermark = usage; }
void page_counter_calculate_protection(struct page_counter *root, diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 520b90dd97eca..c964dd7ff967a 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -81,6 +81,8 @@ struct cgroup_file_ctx { struct { struct cgroup_pidlist *pidlist; } procs1; + + struct cgroup_of_peak peak; };
/* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c8e4b62b436a4..0a97cb2ef1245 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1972,6 +1972,13 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param return -EINVAL; }
+struct cgroup_of_peak *of_peak(struct kernfs_open_file *of) +{ + struct cgroup_file_ctx *ctx = of->priv; + + return &ctx->peak; +} + static void apply_cgroup_root_flags(unsigned int root_flags) { if (current->nsproxy->cgroup_ns == &init_cgroup_ns) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9603717886877..2663e2108cdbe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -25,6 +25,7 @@ * Copyright (C) 2020 Alibaba, Inc, Alex Shi */
+#include <linux/cgroup-defs.h> #include <linux/page_counter.h> #include <linux/memcontrol.h> #include <linux/cgroup.h> @@ -41,6 +42,7 @@ #include <linux/rcupdate.h> #include <linux/limits.h> #include <linux/export.h> +#include <linux/list.h> #include <linux/mutex.h> #include <linux/rbtree.h> #include <linux/slab.h> @@ -3558,6 +3560,9 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
INIT_WORK(&memcg->high_work, high_work_func); vmpressure_init(&memcg->vmpressure); + INIT_LIST_HEAD(&memcg->memory_peaks); + INIT_LIST_HEAD(&memcg->swap_peaks); + spin_lock_init(&memcg->peaks_lock); memcg->socket_pressure = jiffies; memcg1_memcg_init(memcg); memcg->kmemcg_id = -1; @@ -3950,14 +3955,91 @@ static u64 memory_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE; }
-static u64 memory_peak_read(struct cgroup_subsys_state *css, - struct cftype *cft) +#define OFP_PEAK_UNSET (((-1UL))) + +static int peak_show(struct seq_file *sf, void *v, struct page_counter *pc) { - struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct cgroup_of_peak *ofp = of_peak(sf->private); + u64 fd_peak = READ_ONCE(ofp->value), peak; + + /* User wants global or local peak? */ + if (fd_peak == OFP_PEAK_UNSET) + peak = pc->watermark; + else + peak = max(fd_peak, READ_ONCE(pc->local_watermark)); + + seq_printf(sf, "%llu\n", peak * PAGE_SIZE); + return 0; +} + +static int memory_peak_show(struct seq_file *sf, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(sf)); + + return peak_show(sf, v, &memcg->memory); +} + +static int peak_open(struct kernfs_open_file *of) +{ + struct cgroup_of_peak *ofp = of_peak(of); + + ofp->value = OFP_PEAK_UNSET; + return 0; +} + +static void peak_release(struct kernfs_open_file *of) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + struct cgroup_of_peak *ofp = of_peak(of); + + if (ofp->value == OFP_PEAK_UNSET) { + /* fast path (no writes on this fd) */ + return; + } + spin_lock(&memcg->peaks_lock); + list_del(&ofp->list); + spin_unlock(&memcg->peaks_lock); +} + +static ssize_t peak_write(struct kernfs_open_file *of, char *buf, size_t nbytes, + loff_t off, struct page_counter *pc, + struct list_head *watchers) +{ + unsigned long usage; + struct cgroup_of_peak *peer_ctx; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + struct cgroup_of_peak *ofp = of_peak(of); + + spin_lock(&memcg->peaks_lock); + + usage = page_counter_read(pc); + WRITE_ONCE(pc->local_watermark, usage); + + list_for_each_entry(peer_ctx, watchers, list) + if (usage > peer_ctx->value) + WRITE_ONCE(peer_ctx->value, usage); + + /* initial write, register watcher */ + if (ofp->value == -1) + list_add(&ofp->list, watchers); + + WRITE_ONCE(ofp->value, usage); + spin_unlock(&memcg->peaks_lock); + + return nbytes; +} + +static ssize_t memory_peak_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- return (u64)memcg->memory.watermark * PAGE_SIZE; + return peak_write(of, buf, nbytes, off, &memcg->memory, + &memcg->memory_peaks); }
+#undef OFP_PEAK_UNSET + static int memory_min_show(struct seq_file *m, void *v) { return seq_puts_memcg_tunable(m, @@ -4307,7 +4389,10 @@ static struct cftype memory_files[] = { { .name = "peak", .flags = CFTYPE_NOT_ON_ROOT, - .read_u64 = memory_peak_read, + .open = peak_open, + .release = peak_release, + .seq_show = memory_peak_show, + .write = memory_peak_write, }, { .name = "min", @@ -5099,12 +5184,20 @@ static u64 swap_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE; }
-static u64 swap_peak_read(struct cgroup_subsys_state *css, - struct cftype *cft) +static int swap_peak_show(struct seq_file *sf, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(sf)); + + return peak_show(sf, v, &memcg->swap); +} + +static ssize_t swap_peak_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- return (u64)memcg->swap.watermark * PAGE_SIZE; + return peak_write(of, buf, nbytes, off, &memcg->swap, + &memcg->swap_peaks); }
static int swap_high_show(struct seq_file *m, void *v) @@ -5188,7 +5281,10 @@ static struct cftype swap_files[] = { { .name = "swap.peak", .flags = CFTYPE_NOT_ON_ROOT, - .read_u64 = swap_peak_read, + .open = peak_open, + .release = peak_release, + .seq_show = swap_peak_show, + .write = swap_peak_write, }, { .name = "swap.events", diff --git a/mm/page_counter.c b/mm/page_counter.c index 0153f5bb31611..ad9bdde5d5d20 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -79,9 +79,22 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) /* * This is indeed racy, but we can live with some * inaccuracy in the watermark. + * + * Notably, we have two watermarks to allow for both a globally + * visible peak and one that can be reset at a smaller scope. + * + * Since we reset both watermarks when the global reset occurs, + * we can guarantee that watermark >= local_watermark, so we + * don't need to do both comparisons every time. + * + * On systems with branch predictors, the inner condition should + * be almost free. */ - if (new > READ_ONCE(c->watermark)) - WRITE_ONCE(c->watermark, new); + if (new > READ_ONCE(c->local_watermark)) { + WRITE_ONCE(c->local_watermark, new); + if (new > READ_ONCE(c->watermark)) + WRITE_ONCE(c->watermark, new); + } } }
@@ -129,12 +142,13 @@ bool page_counter_try_charge(struct page_counter *counter, goto failed; } propagate_protected_usage(c, new); - /* - * Just like with failcnt, we can live with some - * inaccuracy in the watermark. - */ - if (new > READ_ONCE(c->watermark)) - WRITE_ONCE(c->watermark, new); + + /* see comment on page_counter_charge */ + if (new > READ_ONCE(c->local_watermark)) { + WRITE_ONCE(c->local_watermark, new); + if (new > READ_ONCE(c->watermark)) + WRITE_ONCE(c->watermark, new); + } } return true;
On Mon, Jul 29, 2024 at 10:37:42AM -0400, David Finkel wrote:
Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API.
For example:
- Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark.
- writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS.
This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local.
Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD.
Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter.
Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons.
This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems.
Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case.
As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory.
Suggested-by: Johannes Weiner hannes@cmpxchg.org Suggested-by: Waiman Long longman@redhat.com Signed-off-by: David Finkel davidf@vimeo.com
Acked-by: Johannes Weiner hannes@cmpxchg.org
On Mon, Jul 29, 2024 at 10:37:42AM GMT, David Finkel davidf@vimeo.com wrote:
... Documentation/admin-guide/cgroup-v2.rst | 22 +++-- include/linux/cgroup-defs.h | 5 + include/linux/cgroup.h | 3 + include/linux/memcontrol.h | 5 + include/linux/page_counter.h | 11 ++- kernel/cgroup/cgroup-internal.h | 2 + kernel/cgroup/cgroup.c | 7 ++ mm/memcontrol.c | 116 ++++++++++++++++++++++-- mm/page_counter.c | 30 ++++-- 9 files changed, 174 insertions(+), 27 deletions(-)
Thanks for the fixups, you can add
Reviewed-by: Michal Koutný mkoutny@suse.com
(I believe the test failure in the other thread is unrelated.)
Extend two existing tests to cover extracting memory usage through the newly mutable memory.peak and memory.swap.peak handlers.
In particular, make sure to exercise adding and removing watchers with overlapping lifetimes so the less-trivial logic gets tested.
Signed-off-by: David Finkel davidf@vimeo.com --- tools/testing/selftests/cgroup/cgroup_util.c | 22 ++ tools/testing/selftests/cgroup/cgroup_util.h | 2 + .../selftests/cgroup/test_memcontrol.c | 229 +++++++++++++++++- 3 files changed, 245 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c index 432db923bced0..1e2d46636a0ca 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.c +++ b/tools/testing/selftests/cgroup/cgroup_util.c @@ -141,6 +141,16 @@ long cg_read_long(const char *cgroup, const char *control) return atol(buf); }
+long cg_read_long_fd(int fd) +{ + char buf[128]; + + if (pread(fd, buf, sizeof(buf), 0) <= 0) + return -1; + + return atol(buf); +} + long cg_read_key_long(const char *cgroup, const char *control, const char *key) { char buf[PAGE_SIZE]; @@ -183,6 +193,18 @@ int cg_write(const char *cgroup, const char *control, char *buf) return ret == len ? 0 : ret; }
+/* + * Returns fd on success, or -1 on failure. + * (fd should be closed with close() as usual) + */ +int cg_open(const char *cgroup, const char *control, int flags) +{ + char path[PATH_MAX]; + + snprintf(path, sizeof(path), "%s/%s", cgroup, control); + return open(path, flags); +} + int cg_write_numeric(const char *cgroup, const char *control, long value) { char buf[64]; diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h index e8d04ac9e3d23..19b131ee77072 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.h +++ b/tools/testing/selftests/cgroup/cgroup_util.h @@ -34,9 +34,11 @@ extern int cg_read_strcmp(const char *cgroup, const char *control, extern int cg_read_strstr(const char *cgroup, const char *control, const char *needle); extern long cg_read_long(const char *cgroup, const char *control); +extern long cg_read_long_fd(int fd); long cg_read_key_long(const char *cgroup, const char *control, const char *key); extern long cg_read_lc(const char *cgroup, const char *control); extern int cg_write(const char *cgroup, const char *control, char *buf); +extern int cg_open(const char *cgroup, const char *control, int flags); int cg_write_numeric(const char *cgroup, const char *control, long value); extern int cg_run(const char *cgroup, int (*fn)(const char *cgroup, void *arg), diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index 41ae8047b8895..f54c1f75b6da7 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -161,13 +161,15 @@ static int alloc_pagecache_50M_check(const char *cgroup, void *arg) /* * This test create a memory cgroup, allocates * some anonymous memory and some pagecache - * and check memory.current and some memory.stat values. + * and checks memory.current, memory.peak, and some memory.stat values. */ -static int test_memcg_current(const char *root) +static int test_memcg_current_peak(const char *root) { int ret = KSFT_FAIL; - long current; + long current, peak, peak_reset; char *memcg; + bool fd2_closed = false, fd3_closed = false, fd4_closed = false; + int peak_fd = -1, peak_fd2 = -1, peak_fd3 = -1, peak_fd4 = -1;
memcg = cg_name(root, "memcg_test"); if (!memcg) @@ -180,15 +182,108 @@ static int test_memcg_current(const char *root) if (current != 0) goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak"); + if (peak != 0) + goto cleanup; + if (cg_run(memcg, alloc_anon_50M_check, NULL)) goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(50)) + goto cleanup; + + /* + * We'll open a few FDs for the same memory.peak file to exercise the free-path + * We need at least three to be closed in a different order than writes occurred to test + * the linked-list handling. + */ + peak_fd = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (peak_fd == -1) + goto cleanup; + + peak_fd2 = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (peak_fd2 == -1) + goto cleanup; + + peak_fd3 = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (peak_fd3 == -1) + goto cleanup; + + /* any non-empty string resets, but make it clear */ + static const char reset_string[] = "reset\n"; + + peak_reset = write(peak_fd, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak_reset = write(peak_fd2, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak_reset = write(peak_fd3, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + /* Make sure a completely independent read isn't affected by our FD-local reset above*/ + peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(50)) + goto cleanup; + + fd2_closed = true; + if (close(peak_fd2)) + goto cleanup; + + peak_fd4 = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (peak_fd4 == -1) + goto cleanup; + + peak_reset = write(peak_fd4, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak = cg_read_long_fd(peak_fd); + if (peak > MB(30) || peak < 0) + goto cleanup; + if (cg_run(memcg, alloc_pagecache_50M_check, NULL)) goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(50)) + goto cleanup; + + /* Make sure everything is back to normal */ + peak = cg_read_long_fd(peak_fd); + if (peak < MB(50)) + goto cleanup; + + peak = cg_read_long_fd(peak_fd4); + if (peak < MB(50)) + goto cleanup; + + fd3_closed = true; + if (close(peak_fd3)) + goto cleanup; + + fd4_closed = true; + if (close(peak_fd4)) + goto cleanup; + ret = KSFT_PASS;
cleanup: + close(peak_fd); + if (!fd2_closed) + close(peak_fd2); + if (!fd3_closed) + close(peak_fd3); + if (!fd4_closed) + close(peak_fd4); cg_destroy(memcg); free(memcg);
@@ -817,13 +912,17 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
/* * This test checks that memory.swap.max limits the amount of - * anonymous memory which can be swapped out. + * anonymous memory which can be swapped out. Additionally, it verifies that + * memory.swap.peak reflects the high watermark and can be reset. */ -static int test_memcg_swap_max(const char *root) +static int test_memcg_swap_max_peak(const char *root) { int ret = KSFT_FAIL; char *memcg; - long max; + long max, peak; + + /* any non-empty string resets */ + static const char reset_string[] = "foobarbaz";
if (!is_swap_enabled()) return KSFT_SKIP; @@ -840,6 +939,45 @@ static int test_memcg_swap_max(const char *root) goto cleanup; }
+ int swap_peak_fd = cg_open(memcg, "memory.swap.peak", + O_RDWR | O_APPEND | O_CLOEXEC); + + if (swap_peak_fd == -1) + goto cleanup; + + int mem_peak_fd = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (mem_peak_fd == -1) + goto cleanup; + + if (cg_read_long(memcg, "memory.swap.peak")) + goto cleanup; + + if (cg_read_long_fd(swap_peak_fd)) + goto cleanup; + + /* switch the swap and mem fds into local-peak tracking mode*/ + int peak_reset = write(swap_peak_fd, reset_string, sizeof(reset_string)); + + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + if (cg_read_long_fd(swap_peak_fd)) + goto cleanup; + + if (cg_read_long(memcg, "memory.peak")) + goto cleanup; + + if (cg_read_long_fd(mem_peak_fd)) + goto cleanup; + + peak_reset = write(mem_peak_fd, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + if (cg_read_long_fd(mem_peak_fd)) + goto cleanup; + if (cg_read_strcmp(memcg, "memory.max", "max\n")) goto cleanup;
@@ -862,6 +1000,61 @@ static int test_memcg_swap_max(const char *root) if (cg_read_key_long(memcg, "memory.events", "oom_kill ") != 1) goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long(memcg, "memory.swap.peak"); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long_fd(mem_peak_fd); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long_fd(swap_peak_fd); + if (peak < MB(29)) + goto cleanup; + + /* + * open, reset and close the peak swap on another FD to make sure + * multiple extant fds don't corrupt the linked-list + */ + peak_reset = cg_write(memcg, "memory.swap.peak", (char *)reset_string); + if (peak_reset) + goto cleanup; + + peak_reset = cg_write(memcg, "memory.peak", (char *)reset_string); + if (peak_reset) + goto cleanup; + + /* actually reset on the fds */ + peak_reset = write(swap_peak_fd, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak_reset = write(mem_peak_fd, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak = cg_read_long_fd(swap_peak_fd); + if (peak > MB(10)) + goto cleanup; + + /* + * The cgroup is now empty, but there may be a page or two associated + * with the open FD accounted to it. + */ + peak = cg_read_long_fd(mem_peak_fd); + if (peak > MB(1)) + goto cleanup; + + if (cg_read_long(memcg, "memory.peak") < MB(29)) + goto cleanup; + + if (cg_read_long(memcg, "memory.swap.peak") < MB(29)) + goto cleanup; + if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30))) goto cleanup;
@@ -869,9 +1062,29 @@ static int test_memcg_swap_max(const char *root) if (max <= 0) goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long(memcg, "memory.swap.peak"); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long_fd(mem_peak_fd); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long_fd(swap_peak_fd); + if (peak < MB(19)) + goto cleanup; + ret = KSFT_PASS;
cleanup: + if (close(mem_peak_fd)) + ret = KSFT_FAIL; + if (close(swap_peak_fd)) + ret = KSFT_FAIL; cg_destroy(memcg); free(memcg);
@@ -1295,7 +1508,7 @@ struct memcg_test { const char *name; } tests[] = { T(test_memcg_subtree_control), - T(test_memcg_current), + T(test_memcg_current_peak), T(test_memcg_min), T(test_memcg_low), T(test_memcg_high), @@ -1303,7 +1516,7 @@ struct memcg_test { T(test_memcg_max), T(test_memcg_reclaim), T(test_memcg_oom_events), - T(test_memcg_swap_max), + T(test_memcg_swap_max_peak), T(test_memcg_sock), T(test_memcg_oom_group_leaf_events), T(test_memcg_oom_group_parent_events),
Hello.
On Mon, Jul 29, 2024 at 10:37:43AM GMT, David Finkel davidf@vimeo.com wrote:
Extend two existing tests to cover extracting memory usage through the newly mutable memory.peak and memory.swap.peak handlers.
BTW do the tests pass for you?
I gave it a try (v6.11-rc1+your patches)
$ grep "not ok 2" -B30 test.strace
... 315 15:19:13.990351 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:13.994457 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:13.998562 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:13.998652 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:14.002759 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:14.006864 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.current", O_RDONLY) = 7 315 15:19:14.006989 read(7, "270336\n", 127) = 7 315 15:19:14.011114 close(7) = 0 315 15:19:14.015262 close(6) = 0 315 15:19:14.015448 exit_group(-1) = ? 315 15:19:14.019753 +++ exited with 255 +++ 313 15:19:14.019820 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 255}], 0, NULL) = 315 313 15:19:14.019878 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=315, si_uid=0, si_status=255, si_utime=1 /* 0.01 s */, - 313 15:19:14.019926 close(3) = 0 313 15:19:14.020001 close(5) = 0 313 15:19:14.020072 close(4) = 0 313 15:19:14.024173 rmdir("/sys/fs/cgroup/memcg_test") = 0 313 15:19:14.028517 write(1, "not ok 2 test_memcg_current_peak"..., 33) = 33
grep "^315 .*read.*4096" -c test.strace 12800
Hopefully, unrelated to your changes. I ran this within initrd (rapido image) so it may be an issue how rootfs pagecache is undercharged (due to sharing?), instead of 50M, there's only ~256k.
To verify, I also tried with memory.peak patch reverted, failing differently:
... 238 15:30:29.034623 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.current", O_RDONLY) = 3 238 15:30:29.034766 read(3, "52801536\n", 127) = 9 238 15:30:29.038895 close(3) = 0 238 15:30:29.043048 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.stat", O_RDONLY) = 3 238 15:30:29.043230 read(3, "anon 52436992\nfile 0\nkernel 1105"..., 4095) = 870 238 15:30:29.047379 close(3) = 0 238 15:30:29.051491 munmap(0x7f2473600000, 52432896) = 0 238 15:30:29.058516 exit_group(0) = ? 238 15:30:29.062992 +++ exited with 0 +++ 237 15:30:29.067054 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 238 237 15:30:29.067136 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=238, si_uid=0, si_status=0, si_utime=1 /* 0.01 s */, si- 237 15:30:29.067210 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.peak", O_RDONLY) = 3 237 15:30:29.071349 read(3, "52805632\n", 127) = 9 237 15:30:29.075470 close(3) = 0 237 15:30:29.075562 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.peak", O_RDWR|O_APPEND|O_CLOEXEC) = 3 237 15:30:29.079712 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.peak", O_RDWR|O_APPEND|O_CLOEXEC) = 4 237 15:30:29.083848 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.peak", O_RDWR|O_APPEND|O_CLOEXEC) = 5 237 15:30:29.083970 write(3, "reset\n\0", 7) = -1 EINVAL (Invalid argument) 237 15:30:29.088095 close(3) = 0 237 15:30:29.092209 close(4) = 0 237 15:30:29.092295 close(5) = 0 237 15:30:29.096398 close(-1) = -1 EBADF (Bad file descriptor) 237 15:30:29.100497 rmdir("/sys/fs/cgroup/memcg_test") = 0 237 15:30:29.100760 write(1, "not ok 2 test_memcg_current_peak"..., 33) = 33
This failure makes sense but it reminded me -- could you please modify the test so that it checks write permission of memory.peak and skips the reset testing on old(er) kernels? That'd be in accordance with other cgroup selftest that maintain partial backwards compatibility when possible.
Thanks, Michal
Thanks for checking!
On Tue, Jul 30, 2024 at 11:46 AM Michal Koutný mkoutny@suse.com wrote:
Hello.
On Mon, Jul 29, 2024 at 10:37:43AM GMT, David Finkel davidf@vimeo.com wrote:
Extend two existing tests to cover extracting memory usage through the newly mutable memory.peak and memory.swap.peak handlers.
BTW do the tests pass for you?
Yeah, my tests pass when running on top of an ext2 mount. At least one of the existing tests failed when running out of tmpfs, so I've been testing with an ext2 mount in UML.
I gave it a try (v6.11-rc1+your patches)
$ grep "not ok 2" -B30 test.strace
... 315 15:19:13.990351 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:13.994457 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:13.998562 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:13.998652 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:14.002759 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 315 15:19:14.006864 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.current", O_RDONLY) = 7 315 15:19:14.006989 read(7, "270336\n", 127) = 7 315 15:19:14.011114 close(7) = 0 315 15:19:14.015262 close(6) = 0 315 15:19:14.015448 exit_group(-1) = ? 315 15:19:14.019753 +++ exited with 255 +++ 313 15:19:14.019820 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 255}], 0, NULL) = 315 313 15:19:14.019878 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=315, si_uid=0, si_status=255, si_utime=1 /* 0.01 s */, - 313 15:19:14.019926 close(3) = 0 313 15:19:14.020001 close(5) = 0 313 15:19:14.020072 close(4) = 0 313 15:19:14.024173 rmdir("/sys/fs/cgroup/memcg_test") = 0 313 15:19:14.028517 write(1, "not ok 2 test_memcg_current_peak"..., 33) = 33
grep "^315 .*read.*4096" -c test.strace 12800
Hopefully, unrelated to your changes. I ran this within initrd (rapido image) so it may be an issue how rootfs pagecache is undercharged (due to sharing?), instead of 50M, there's only ~256k.
To verify, I also tried with memory.peak patch reverted, failing differently:
... 238 15:30:29.034623 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.current", O_RDONLY) = 3 238 15:30:29.034766 read(3, "52801536\n", 127) = 9 238 15:30:29.038895 close(3) = 0 238 15:30:29.043048 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.stat", O_RDONLY) = 3 238 15:30:29.043230 read(3, "anon 52436992\nfile 0\nkernel 1105"..., 4095) = 870 238 15:30:29.047379 close(3) = 0 238 15:30:29.051491 munmap(0x7f2473600000, 52432896) = 0 238 15:30:29.058516 exit_group(0) = ? 238 15:30:29.062992 +++ exited with 0 +++ 237 15:30:29.067054 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 238 237 15:30:29.067136 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=238, si_uid=0, si_status=0, si_utime=1 /* 0.01 s */, si- 237 15:30:29.067210 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.peak", O_RDONLY) = 3 237 15:30:29.071349 read(3, "52805632\n", 127) = 9 237 15:30:29.075470 close(3) = 0 237 15:30:29.075562 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.peak", O_RDWR|O_APPEND|O_CLOEXEC) = 3 237 15:30:29.079712 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.peak", O_RDWR|O_APPEND|O_CLOEXEC) = 4 237 15:30:29.083848 openat(AT_FDCWD, "/sys/fs/cgroup/memcg_test/memory.peak", O_RDWR|O_APPEND|O_CLOEXEC) = 5 237 15:30:29.083970 write(3, "reset\n\0", 7) = -1 EINVAL (Invalid argument) 237 15:30:29.088095 close(3) = 0 237 15:30:29.092209 close(4) = 0 237 15:30:29.092295 close(5) = 0 237 15:30:29.096398 close(-1) = -1 EBADF (Bad file descriptor) 237 15:30:29.100497 rmdir("/sys/fs/cgroup/memcg_test") = 0 237 15:30:29.100760 write(1, "not ok 2 test_memcg_current_peak"..., 33) = 33
This failure makes sense but it reminded me -- could you please modify the test so that it checks write permission of memory.peak and skips the reset testing on old(er) kernels? That'd be in accordance with other cgroup selftest that maintain partial backwards compatibility when possible.
Sure, I'll add that to the tests in a bit.
Thanks, Michal
linux-kselftest-mirror@lists.linaro.org