Hello,
On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote: ...
This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems.
Swap still has bad reps but there's nothing drastically worse about it than page cache. ie. If you're under memory pressure, you get thrashing one way or another. If there's no swap, the system is just memlocking anon memory even when they are a lot colder than page cache, so I'm skeptical that no swap + mostly anon + kernel OOM kills is a good strategy in general especially given that the system behavior is not very predictable under OOM conditions.
As mentioned down the email thread, I consider usefulness of peak value rather limited. It is misleading when memory is reclaimed. But fundamentally I do not oppose to unifying the write behavior to reset values.
The removal of resets was intentional. The problem was that it wasn't clear who owned those counters and there's no way of telling who reset what when. It was easy to accidentally end up with multiple entities that think they can get timed measurement by resetting.
So, in general, I don't think this is a great idea. There are shortcomings to how memory.peak behaves in that its meaningfulness quickly declines over time. This is expected and the rationale behind adding memory.peak, IIRC, was that it was difficult to tell the memory usage of a short-lived cgroup.
If we want to allow peak measurement of time periods, I wonder whether we could do something similar to pressure triggers - ie. let users register watchers so that each user can define their own watch periods. This is more involved but more useful and less error-inducing than adding reset to a single counter.
Johannes, what do you think?
Thanks.