On Wed, Nov 29, 2023 at 8:21 AM Johannes Weiner hannes@cmpxchg.org wrote:
On Mon, Nov 27, 2023 at 03:46:00PM -0800, Nhat Pham wrote:
Currently, we only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is initiated when there is memory pressure. The shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent overshrinking (i.e evicting warm pages that might be refaulted into memory), we build in the following heuristics:
- Estimate the number of warm pages residing in zswap, and attempt to protect this region of the zswap LRU.
- Scale the number of freeable objects by an estimate of the memory saving factor. The better zswap compresses the data, the fewer pages we will evict to swap (as we will otherwise incur IO for relatively small memory saving).
- During reclaim, if the shrinker encounters a page that is also being brought into memory, the shrinker will cautiously terminate its shrinking action, as this is a sign that it is touching the warmer region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds.
I think this is a really important piece of functionality for zswap.
Memory compression has been around and in use for a long time, but the question of zswap vs swap sizing has been in the room since the hybrid mode was first proposed. Because depending on the reuse patterns, putting zswap with a static size limit in front of an existing swap file can be a net negative for performance as it consumes more memory.
It's great to finally see a solution to this which makes zswap *much* more general purpose. And something that distributions might want to turn on per default when swap is configured.
Actually to the point where I think there should be a config option to enable the shrinker per default. Maybe not right away, but in a few releases when this feature has racked up some more production time.
Sure thingy - how does everyone feel about this?
@@ -687,6 +687,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, &page_allocated, false); if (unlikely(page_allocated)) swap_readpage(page, false, NULL);
zswap_lruvec_swapin(page);
The "lruvec" in the name vs the page parameter is a bit odd. zswap_page_swapin() would be slightly better, but it still also sounds like it would cause an actual swapin of some sort.
zswap_record_swapin()?
Hmm that sounds good to me. I'm not very good with naming, if that's not already evident :)
@@ -520,6 +575,95 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, return entry; }
+/********************************* +* shrinker functions +**********************************/ +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
spinlock_t *lock, void *arg);
+static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
struct shrink_control *sc)
+{
struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
unsigned long shrink_ret, nr_protected, lru_size;
struct zswap_pool *pool = shrinker->private_data;
bool encountered_page_in_swapcache = false;
nr_protected =
atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
lru_size = list_lru_shrink_count(&pool->list_lru, sc);
/*
* Abort if the shrinker is disabled or if we are shrinking into the
* protected region.
*/
if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
sc->nr_scanned = 0;
return SHRINK_STOP;
}
I'm scratching my head at the protection check. zswap_shrinker_count() already factors protection into account, so sc->nr_to_scan should only be what is left on the list after excluding the protected area.
Do we even get here if the whole list is protected? Is this to protect against concurrent shrinking of the list through multiple shrinkers or swapins? If so, a comment would be nice :)
Yep, this is mostly for concurrent shrinkers. Please fact-check me, but IIUC if we have too many reclaimers all calling upon the zswap shrinker (before any of them can make substantial progress), we can have a situation where the total number of objects freed by the reclaimers will eat into the protection area of the zswap LRU (even if the number of freeable objects is scaled down by the compression ratio, and further scaled down internally in the shrinker/vmscan code). I've observed this tendency when there is a) a lot of worker threads in my benchmark and b) memory pressure. This is a crude/racy way to alleviate the issue.
I think this is actually a wider problem than just zswap and zswap shrinker - we need better reclaimer throttling logic IMO. Perhaps this check should be done higher up the stack - something along the lines of having each reclaimer "register" its intention (number of objects it wants to reclaim) to a particular shrinker, allowing the shrinker to deny a reclaimer if there is already a strong reclaim driving force. Or some other throttling heuristics based on the number of freeable objects and the reclaimer registration data.
Otherwise, this looks great to me!
Just nitpicks, no show stoppers:
Acked-by: Johannes Weiner hannes@cmpxchg.org