Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers

18 Jul 2024

      On Wed, Jul 17, 2024 at 1:04 PM Johannes Weiner hannes@cmpxchg.org wrote:
...
On Tue, Jul 16, 2024 at 06:44:11AM -1000, Tejun Heo wrote:
...
Hello,
On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote:
...
...
...
This behavior is particularly useful for work scheduling systems that
need to track memory usage of worker processes/cgroups per-work-item.
Since memory can't be squeezed like CPU can (the OOM-killer has
opinions), these systems need to track the peak memory usage to compute
system/container fullness when binpacking workitems.
Swap still has bad reps but there's nothing drastically worse about it than
page cache. ie. If you're under memory pressure, you get thrashing one way
or another. If there's no swap, the system is just memlocking anon memory
even when they are a lot colder than page cache, so I'm skeptical that no
swap + mostly anon + kernel OOM kills is a good strategy in general
especially given that the system behavior is not very predictable under OOM
conditions.
...
As mentioned down the email thread, I consider usefulness of peak value
rather limited. It is misleading when memory is reclaimed. But
fundamentally I do not oppose to unifying the write behavior to reset
values.
The removal of resets was intentional. The problem was that it wasn't clear
who owned those counters and there's no way of telling who reset what when.
It was easy to accidentally end up with multiple entities that think they
can get timed measurement by resetting.
So, in general, I don't think this is a great idea. There are shortcomings
to how memory.peak behaves in that its meaningfulness quickly declines over
time. This is expected and the rationale behind adding memory.peak, IIRC,
was that it was difficult to tell the memory usage of a short-lived cgroup.
If we want to allow peak measurement of time periods, I wonder whether we
could do something similar to pressure triggers - ie. let users register
watchers so that each user can define their own watch periods. This is more
involved but more useful and less error-inducing than adding reset to a
single counter.
Johannes, what do you think?
I'm also not a fan of the ability to reset globally.
I seem to remember a scheme we discussed some time ago to do local
state tracking without having the overhead in the page counter
fastpath. The new data that needs to be tracked is a pc->local_peak
(in the page_counter) and an fd->peak (in the watcher's file state).

Usage peak is tracked in pc->watermark, and now also in pc->local_peak.

Somebody opens the memory.peak. Initialize fd->peak = -1.

If they write, set fd->peak = pc->local_peak = usage.

Usage grows.

They read(). A conventional reader has fd->peak == -1, so we return
pc->watermark. If the fd has been written to, return max(fd->peak, pc->local_peak).

Usage drops.

New watcher opens and writes. Bring up all existing watchers'
fd->peak (that aren't -1) to pc->local_peak *iff* latter is bigger.
Then set the new fd->peak = pc->local_peak = current usage as in 3.

See 5. again for read() from each watcher.

This way all fd's can arbitrarily start tracking new local peaks with
write(). The operation in the charging fast path is cheap. The write()
is O(existing_watchers), which seems reasonable. It's fully backward
compatible with conventional open() + read() users.
I spent some time today attempting to implement this.
Here's a branch on github that compiles, and I think is close to what you
described, but is definitely still a WIP:
https://github.com/torvalds/linux/compare/master...dfinkel:linux:memcg2_memo...
Since there seems to be significant agreement that this approach is better
long-term as a kernel interface, if that continues, I can factor out some of
the changes so it supports both memory.peak and memory.swap.peak,
fix the tests, and clean up any style issues tomorrow.
Also, If there are opinions on whether the cgroup_lock is a reasonable way
of handling this synchronization, or if I should add a more appropriate spinlock
or mutex onto either the pagecounter or the memcg, I'm all ears.
(I can mail the WIP change as-is if that's prefered to github)
-- 
David Finkel
Senior Principal Software Engineer, Core Services

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers