 
            On 27.03.25 12:16, Jinjiang Tu wrote:
在 2025/3/26 20:46, David Hildenbrand 写道:
On 26.03.25 13:42, Jinjiang Tu wrote:
Hi,
Hi!
We notiched a 12.3% performance regression for LibMicro pwrite testcase due to commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
The testcase is executed as follows, and the file is tmpfs file. pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
Do we know how much that reflects real workloads? (IOW, how much should we care)
No, it's hard to say.
this testcase writes 1KB (only one page) to the tmpfs and repeats this step for many times. The Flame graph shows the performance regression comes from folio_mark_accessed() and workingset_activation().
folio_mark_accessed() is called for the same page for many times. Before this patch, each call will add the page to cpu_fbatches.activate. When the fbatch is full, the fbatch is drained and the page is promoted to active list. And then, folio_mark_accessed() does nothing in later calls.
But after this patch, the folio clear lru flags after it is added to cpu_fbatches.activate. After then, folio_mark_accessed will never call folio_activate() again due to the page is without lru flag, and the fbatch will not be full and the folio will not be marked active, later folio_mark_accessed() calls will always call workingset_activation(), leading to performance regression.
Would there be a good place to drain the LRU to effectively get that processed? (we can always try draining if the LRU flag is not set)
Maybe we could drain the search the cpu_fbatches.activate of the local cpu in __lru_cache_activate_folio()? Drain other fbatches is meaningless .
So the current behavior is that folio_mark_accessed() will end up calling folio_activate() once, and then __lru_cache_activate_folio() until the LRU was drained (which can take a looong time).
The old behavior was that folio_mark_accessed() would keep calling folio_activate() until the LRU was drained simply because it was full of "all the same pages" ?. Only *after* the LRU was drained, folio_mark_accessed() would actually not do anything (desired behavior).
So the overhead comes primarily from __lru_cache_activate_folio() searching through the list "more" repeatedly because the LRU does get drained less frequently; and it would never find it in there in this case.
So ... it used to be suboptimal before, now it's more suboptimal I guess?! :)
We'd need a way to better identify "this folio is already queued for activation". Searching that list as well would be one option, but the hole "search the list" is nasty.
Maybe we can simply set the folio as active when staging it for activation, after having cleared the LRU flag? We already do that during folio_add.
As the LRU flag was cleared, nobody should be messing with that folio until the cache was drained and the move was successful.
Pretty sure this doesn't work, but just to throw out an idea:
From c26e1c0ceda6c818826e5b89dc7c7c9279138f80 Mon Sep 17 00:00:00 2001 From: David Hildenbrand david@redhat.com Date: Tue, 1 Apr 2025 16:31:56 +0200 Subject: [PATCH] test
Signed-off-by: David Hildenbrand david@redhat.com --- mm/swap.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c index fc8281ef42415..bbf9aa76db87f 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) folios_put(fbatch); }
+static void lru_activate(struct lruvec *lruvec, struct folio *folio); + static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, struct folio *folio, move_fn_t move_fn, bool on_lru, bool disable_irq) @@ -191,6 +193,10 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, else local_lock(&cpu_fbatches.lock);
+ /* We'll only perform the actual list move deferred. */ + if (move_fn == lru_activate) + folio_set_active(folio); + if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) || lru_cache_disabled()) folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn); @@ -299,12 +305,14 @@ static void lru_activate(struct lruvec *lruvec, struct folio *folio) { long nr_pages = folio_nr_pages(folio);
- if (folio_test_active(folio) || folio_test_unevictable(folio)) - return; - + /* + * We set the folio active after clearing the LRU flag, and set the + * LRU flag only after moving it to the right list. + */ + VM_WARN_ON_ONCE(!folio_test_active(folio)); + VM_WARN_ON_ONCE(folio_test_unevictable(folio));
lruvec_del_folio(lruvec, folio); - folio_set_active(folio); lruvec_add_folio(lruvec, folio); trace_mm_lru_activate(folio);
@@ -342,7 +350,10 @@ void folio_activate(struct folio *folio) return;
lruvec = folio_lruvec_lock_irq(folio); - lru_activate(lruvec, folio); + if (!folio_test_unevictable(folio)) { + folio_set_active(folio); + lru_activate(lruvec, folio); + } unlock_page_lruvec_irq(lruvec); folio_set_lru(folio); }