Re: [PATCH v4 0/9] mm: workingset reporting

6 Dec 2024

      Thanks for the response Johannes. Some replies inline.
On Tue, Nov 26, 2024 at 11:26 PM Johannes Weiner hannes@cmpxchg.org wrote:
...
On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
...
This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace.
Another interesting idea might be hugepage workingset, so that we can
measure the proportion of hugepages backing cold memory. However, with
architectures like arm, there may be too many hugepage sizes leading to
a combinatorial explosion when exporting stats to the userspace.
Nonetheless, the kernel should provide a set of workingset interfaces
that is generic enough to accommodate the various use cases, and extensible
to potential future use cases.
Doesn't DAMON already provide this information?
CCing SJ.
Thanks for the CC. DAMON was really good at visualizing the memory
access frequencies last time I tried it out! For server use cases,
DAMON would benefit from integrations with cgroups. The key then would
be a standard interface for exporting a cgroup's working set to the
user. It would be good to have something that will work for different
backing implementations, DAMON, MGLRU, or active/inactive LRU.
...
...
Use cases
Job scheduling
On overcommitted hosts, workingset information improves efficiency and
reliability by allowing the job scheduler to have better stats on the
exact memory requirements of each job. This can manifest in efficiency by
landing more jobs on the same host or NUMA node. On the other hand, the
job scheduler can also ensure each node has a sufficient amount of memory
and does not enter direct reclaim or the kernel OOM path. With workingset
information and job priority, the userspace OOM killing or proactive
reclaim policy can kick in before the system is under memory pressure.
If the job shape is very different from the machine shape, knowing the
workingset per-node can also help inform page allocation policies.
Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting allows the policy to be more accurate and
flexible.
I'm not sure about more accurate.
Access frequency is only half the picture. Whether you need to keep
memory with a given frequency resident depends on the speed of the
backing device.
There is memory compression; there is swap on flash; swap on crappy
flash; swapfiles that share IOPS with co-located filesystems. There is
zswap+writeback, where avg refault speed can vary dramatically.
You can of course offload much more to a fast zswap backend than to a
swapfile on a struggling flashdrive, with comparable app performance.
So I think you'd be hard pressed to achieve a high level of accuracy
in the usecases you list without taking the (often highly dynamic)
cost of paging / memory transfer into account.
There is a more detailed discussion of this in a paper we wrote on
proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:
https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf
Yes, PSI takes into account the paging cost. I'm not claiming that
Workingset reporting provides a superset of information, but rather it
can complement PSI. Sorry for the bad wording here.
...
...
Ballooning (similar to proactive reclaim)
The last patch of the series extends the virtio-balloon device to report
the guest workingset.
Balloon policies benefit from workingset to more precisely determine the
size of the memory balloon. On end-user devices where memory is scarce and
overcommitted, the balloon sizing in multiple VMs running on the same
device can be orchestrated with workingset reports from each one.
On the server side, workingset reporting allows the balloon controller to
inflate the balloon without causing too much file cache to be reclaimed in
the guest.
The ballooning use case is an important one. Having working set
information would allow us to inflate a balloon of the right size in
the guest.
...
...
Promotion/Demotion
If different mechanisms are used for promition and demotion, workingset
information can help connect the two and avoid pages being migrated back
and forth.
For example, given a promotion hot page threshold defined in reaccess
distance of N seconds (promote pages accessed more often than every N
seconds). The threshold N should be set so that ~80% (e.g.) of pages on
the fast memory node passes the threshold. This calculation can be done
with workingset reports.
To be directly useful for promotion policies, the workingset report
interfaces need to be extended to report hotness and gather hotness
information from the devices[1].
...
Benchmarks
Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
compile and redis benchmarks from openbenchmarking.org. The policy and
runner is referred to as WMO (Workload Memory Optimization).
The results were based on v3 of the series, but v4 doesn't change the core
of the working set reporting and just adds the ballooning counterpart.
The timed Linux kernel compilation benchmark shows improvements in peak
memory usage with a policy of "swap out all bytes colder than 10 seconds
every 40 seconds". A swapfile is configured on SSD.

peak memory usage (with WMO): 4982.61328 MiB
peak memory usage (control): 9569.1367 MiB
peak memory reduction: 47.9%

Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%

Seconds, fewer is better
You can do this with a recent (>2018) upstream kernel and ~100 lines
of python [1]. It also works on both LRU implementations.
[1] https://github.com/facebookincubator/senpai
We use this approach in virtually the entire Meta fleet, to offload
unneeded memory, estimate available capacity for job scheduling, plan
future capacity needs, and provide accurate memory usage feedback to
application developers.
It works over a wide variety of CPU and storage configurations with no
specific tuning.
The paper I referenced above provides a detailed breakdown of how it
all works together.
I would be curious to see a more in-depth comparison to the prior art
in this space. At first glance, your proposal seems more complex and
less robust/versatile, at least for offloading and capacity gauging.
We have implemented TMO PSI-based proactive reclaim and compared it to
a kstaled-based reclaimer (reclaiming based on 2 minute working set
and refaults). The PSI-based reclaimer was able to save more memory,
but it also caused spikes of refaults and a lot higher
decompressions/second. Overall the test workloads had better
performance with the kstaled-based reclaimer. The conclusion was that
it was a trade-off. Since we have some app classes that we don't want
to induce pressure but still want to proactively reclaim from, there's
a missing piece. I do agree there's not a good in-depth comparison
with prior art though.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v4 0/9] mm: workingset reporting

Use cases

Benchmarks