On Fri, 6 Dec 2024 11:57:55 -0800 Yuanchu Xie yuanchu@google.com wrote:
Thanks for the response Johannes. Some replies inline.
On Tue, Nov 26, 2024 at 11:26\u202fPM Johannes Weiner hannes@cmpxchg.org wrote:
On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. Another interesting idea might be hugepage workingset, so that we can measure the proportion of hugepages backing cold memory. However, with architectures like arm, there may be too many hugepage sizes leading to a combinatorial explosion when exporting stats to the userspace. Nonetheless, the kernel should provide a set of workingset interfaces that is generic enough to accommodate the various use cases, and extensible to potential future use cases.
Doesn't DAMON already provide this information?
CCing SJ.
Thanks for the CC. DAMON was really good at visualizing the memory access frequencies last time I tried it out!
Thank you for this kind acknowledgement, Yuanchu!
For server use cases, DAMON would benefit from integrations with cgroups. The key then would be a standard interface for exporting a cgroup's working set to the user.
I show two ways to make DAMON supports cgroups for now. First way is making another DAMON operations set implementation for cgroups. I shared a rough idea for this before, probably on kernel summit. But I haven't had a chance to prioritize this so far. Please let me know if you need more details. The second way is extending DAMOS filter to provide more detailed statistics per DAMON-region, and adding another DAMOS action that does nothing but only accounting the detailed statistics. Using the new DAMOS action, users will be able to know how much of specific DAMON-found regions are filtered out by the given filter. Because we have DAMOS filter type for cgroups, we can know how much of workingset (or, warm memory) belongs to specific groups. This can be applied to not only cgroups, but for any DAMOS filter types that exist (e.g., anonymous page, young page).
I believe the second way is simpler to implement while providing information that sufficient for most possible use cases. I was anyway planning to do this.
It would be good to have something that will work for different backing implementations, DAMON, MGLRU, or active/inactive LRU.
I think we can do this using the filter statistics, with new filter types. For example, we can add new DAMOS filter that filters pages if it is for specific range of MGLRU-gen of the page, or whether the page belongs to active or inactive LRU lists.
Use cases
[...]
Access frequency is only half the picture. Whether you need to keep memory with a given frequency resident depends on the speed of the backing device.
[...]
Benchmarks
Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux compile and redis benchmarks from openbenchmarking.org. The policy and runner is referred to as WMO (Workload Memory Optimization). The results were based on v3 of the series, but v4 doesn't change the core of the working set reporting and just adds the ballooning counterpart.
The timed Linux kernel compilation benchmark shows improvements in peak memory usage with a policy of "swap out all bytes colder than 10 seconds every 40 seconds". A swapfile is configured on SSD.
[...]
You can do this with a recent (>2018) upstream kernel and ~100 lines of python [1]. It also works on both LRU implementations.
[1] https://github.com/facebookincubator/senpai
We use this approach in virtually the entire Meta fleet, to offload unneeded memory, estimate available capacity for job scheduling, plan future capacity needs, and provide accurate memory usage feedback to application developers.
It works over a wide variety of CPU and storage configurations with no specific tuning.
The paper I referenced above provides a detailed breakdown of how it all works together.
I would be curious to see a more in-depth comparison to the prior art in this space. At first glance, your proposal seems more complex and less robust/versatile, at least for offloading and capacity gauging.
We have implemented TMO PSI-based proactive reclaim and compared it to a kstaled-based reclaimer (reclaiming based on 2 minute working set and refaults). The PSI-based reclaimer was able to save more memory, but it also caused spikes of refaults and a lot higher decompressions/second. Overall the test workloads had better performance with the kstaled-based reclaimer. The conclusion was that it was a trade-off.
I agree it is only half of the picture, and there could be tradeoff. Motivated by those previous works, DAMOS provides PSI-based aggressiveness auto-tuning to use both ways.
I do agree there's not a good in-depth comparison with prior art though.
I would be more than happy to help the comparison work agains DAMON of current implementation and future plans, and any possible collaborations.
Thanks, SJ