Re: [PATCH v3 2/5] zswap: make shrinking memcg-aware

19 Oct 2023

      On Wed, Oct 18, 2023 at 4:47 PM Nhat Pham nphamcs@gmail.com wrote:
...
On Wed, Oct 18, 2023 at 4:20 PM Yosry Ahmed yosryahmed@google.com wrote:
...
On Tue, Oct 17, 2023 at 4:21 PM Nhat Pham nphamcs@gmail.com wrote:
...
From: Domenico Cerasuolo cerasuolodomenico@gmail.com
Currently, we only have a single global LRU for zswap. This makes it
impossible to perform worload-specific shrinking - an memcg cannot
determine which pages in the pool it owns, and often ends up writing
pages from other memcgs. This issue has been previously observed in
practice and mitigated by simply disabling memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
This patch fully resolves the issue by replacing the global zswap LRU
with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
a) When a store attempt hits an memcg limit, it now triggers a
   synchronous reclaim attempt that, if successful, allows the new
   hotter page to be accepted by zswap.
b) If the store attempt instead hits the global zswap limit, it will
   trigger an asynchronous reclaim attempt, in which an memcg is
   selected for reclaim in a round-robin-like fashion.
Could you explain the rationale behind the difference in behavior here
between the global limit and the memcg limit?
The global limit hit reclaim behavior was previously asynchronous too.
We just added the round-robin part because now the zswap LRU is
cgroup-aware :)
For the cgroup limit hit, however, we cannot make it asynchronous,
as it is a bit hairy to add a per-cgroup shrink_work. So, we just
perform the reclaim synchronously.
The question is whether it makes sense to make the global limit
reclaim synchronous too. That is a task of its own IMO.
Let's add such context to the commit log, and perhaps an XXX comment
in the code asking whether we should consider doing the reclaim
synchronously for the global limit too.
...
(FWIW, this somewhat mirrors the direct reclaimer v.s kswapd
story to me, but don't quote me too hard on this).
[..]
...
...
...
    /* Hold a reference to prevent a free during writeback */
    zswap_entry_get(entry);
    spin_unlock(&tree->lock);

  ret = zswap_writeback_entry(entry, tree);

  writeback_result = zswap_writeback_entry(entry, tree);

  spin_lock(&tree->lock);

  if (ret) {

          /* Writeback failed, put entry back on LRU */

          spin_lock(&pool->lru_lock);

          list_move(&entry->lru, &pool->lru);

          spin_unlock(&pool->lru_lock);

  if (writeback_result) {

          zswap_reject_reclaim_fail++;

          memcg = get_mem_cgroup_from_entry(entry);

          spin_lock(lock);

          /* we cannot use zswap_lru_add here, because it increments node's lru count */

          list_lru_putback(&entry->pool->list_lru, item, entry_to_nid(entry), memcg);

          spin_unlock(lock);

          mem_cgroup_put(memcg);

          ret = LRU_RETRY;
          goto put_unlock;
  }

  zswap_written_back_pages++;

Why is this moved here from zswap_writeback_entry()? Also why is
zswap_reject_reclaim_fail incremented here instead of inside
zswap_writeback_entry()?
Domenico should know this better than me, but my understanding
is that moving it here protects concurrent modifications of
zswap_written_back_pages with the tree lock.
Is writeback single-threaded in the past? This counter is non-atomic,
and doesn't seem to be protected by any locks...
There definitely can be concurrent stores now though - with
a synchronous reclaim from cgroup-limit hit and another
from the old shrink worker.
(and with the new zswap shrinker, concurrent reclaim is
the expectation!)
The comment above the stats definition stats that they are left
unprotected purposefully. If we want to fix that let's do it
separately. If this patch makes it significantly worse such that it
would cause a regression, let's at least do it in a separate patch.
The diff here is too large already.
...
zswap_reject_reclaim_fail was previously incremented in
shrink_worker I think. We need it to be incremented
for the shrinker as well, so might as well move it here.
Wouldn't moving it inside zswap_writeback_entry() near incrementing
zswap_written_back_pages make it easier to follow?

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v3 2/5] zswap: make shrinking memcg-aware