Re: [PATCH] mm: avoid unconditional one-tick sleep when swapcache_prepare fails

23 Oct 2024

Kairui Song ryncsn@gmail.com writes:
...
On Wed, Oct 9, 2024 at 8:55 AM Huang, Ying ying.huang@intel.com wrote:
...
Barry Song 21cnbao@gmail.com writes:
...
On Thu, Oct 3, 2024 at 8:35 AM Huang, Ying ying.huang@intel.com wrote:
...
Barry Song 21cnbao@gmail.com writes:
...
On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying ying.huang@intel.com wrote:
...
Barry Song 21cnbao@gmail.com writes:
> On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying ying.huang@intel.com wrote:
>>
>> Barry Song 21cnbao@gmail.com writes:
>>
>> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying ying.huang@intel.com wrote:
>> >>
>> >> Hi, Barry,
>> >>
>> >> Barry Song 21cnbao@gmail.com writes:
>> >>
>> >> > From: Barry Song v-songbaohua@oppo.com
>> >> >
>> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
>> >> > fails, which has led to reports of UI stuttering on latency-sensitive
>> >> > Android devices. To address this, we can use a waitqueue to wake up
>> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> >> > sleeping for a full tick. While tasks may occasionally be woken by an
>> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
>> >> > rapid re-entry into page faults, which can cause livelocks, and
>> >> > multiple millisecond sleeps, which visibly degrade user experience.
>> >>
>> >> In general, I think that this works.  Why not extend the solution to
>> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
>> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
>> >
>> > Hi Ying,
>> > Thanks for your comments.
>> > I feel extending the solution to __read_swap_cache_async() should be done
>> > in a separate patch. On phones, I've never encountered any issues reported
>> > on that path, so it might be better suited for an optimization rather than a
>> > hotfix?
>>
>> Yes.  It's fine to do that in another patch as optimization.
>
> Ok. I'll prepare a separate patch for optimizing that path.
Thanks!
>>
>> >> overhead to call wake_up() when there's no task waiting, we can use an
>> >> atomic to count waiting tasks.
>> >
>> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
>> > waitqueue should have a very low cost on its own?
>>
>> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
>> shared lock.  On systems with many CPUs (such servers), this may cause
>> severe lock contention.  Even the cache ping-pong may hurt performance
>> much.
>
> I understand that cache synchronization was a significant issue before
> qspinlock, but it seems to be less of a concern after its implementation.
Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
discussed in the following thread.
https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kick...
> However, using a global atomic variable would still trigger cache broadcasts,
> correct?
We can only change the atomic variable to non-zero when
swapcache_prepare() returns non-zero, and call wake_up() when the atomic
variable is non-zero.  Because swapcache_prepare() returns 0 most times,
the atomic variable is 0 most times.  If we don't change the value of
atomic variable, cache ping-pong will not be triggered.
yes. this can be implemented by adding another atomic variable.
Just realized that we don't need another atomic variable for this, just
use waitqueue_active() before wake_up() should be enough.
...
...
Hi, Kairui,
Do you have some test cases to test parallel zram swap-in?  If so, that
can be used to verify whether cache ping-pong is an issue and whether it
can be fixed via a global atomic variable.
Yes, Kairui please run a test on your machine with lots of cores before
and after adding a global atomic variable as suggested by Ying. I am
sorry I don't have a server machine.
if it turns out you find cache ping-pong can be an issue, another
approach would be a waitqueue hash:
Yes.  waitqueue hash may help reduce lock contention.  And, we can have
both waitqueue_active() and waitqueue hash if necessary.  As the first
step, waitqueue_active() appears simpler.
Hi Andrew,
If there are no objections, can you please squash the below change? Oven
has already tested the change and the original issue was still fixed with
it. If you want me to send v2 instead, please let me know.
From a5ca401da89f3b628c3a0147e54541d0968654b2 Mon Sep 17 00:00:00 2001
From: Barry Song v-songbaohua@oppo.com
Date: Tue, 8 Oct 2024 20:18:27 +0800
Subject: [PATCH] mm: wake_up only when swapcache_wq waitqueue is active
wake_up() will acquire spinlock even waitqueue is empty. This might
involve cache sync overhead. Let's only call wake_up() when waitqueue
is active.
Suggested-by: "Huang, Ying" ying.huang@intel.com
Signed-off-by: Barry Song v-songbaohua@oppo.com

mm/memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fe21bd3beff5..4adb2d0bcc7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4623,7 +4623,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
      /* Clear the swap cache pin for direct swapin after PTL unlock */
      if (need_clear_cache) {
              swapcache_clear(si, entry, nr_pages);

        wake_up(&swapcache_wq);




        if (waitqueue_active(&swapcache_wq))


                wake_up(&swapcache_wq);
}
if (si)
        put_swap_device(si);



@@ -4641,7 +4642,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
      }
      if (need_clear_cache) {
              swapcache_clear(si, entry, nr_pages);

        wake_up(&swapcache_wq);




        if (waitqueue_active(&swapcache_wq))


                wake_up(&swapcache_wq);
}
if (si)
        put_swap_device(si);



Hi, Kairui,
Do you have time to give this patch (combined with the previous patch
from Barry) a test to check whether the overhead introduced in the
previous patch has been eliminated?
Hi Ying, Barry
I did a rebase on mm tree and run more tests with the latest patch:
Before the two patches:
make -j96 (64k): 33814.45 35061.25 35667.54 36618.30 37381.60 37678.75
make -j96: 20456.03 20460.36 20511.55 20584.76 20751.07 20780.79
make -j64:7490.83 7515.55 7535.30 7544.81 7564.77 7583.41
After adding workqueue:
make -j96 (64k): 33190.60 35049.57 35732.01 36263.81 37154.05 37815.50
make -j96: 20373.27 20382.96 20428.78 20459.73 20534.59 20548.48
make -j64: 7469.18 7522.57 7527.38 7532.69 7543.36 7546.28
After adding workqueue with workqueue_active() check:
make -j96 (64k): 33321.03 35039.68 35552.86 36474.95 37502.76 37549.04
make -j96: 20601.39 20639.08 20692.81 20693.91 20701.35 20740.71
make -j64: 7538.63 7542.27 7564.86 7567.36 7594.14 7600.96
So I think it's just noise level performance change, it should be OK
in either way.
Thanks for your test results.  There should be bottlenecks in other
places.
--
Best Regards,
Huang, Ying

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm: avoid unconditional one-tick sleep when swapcache_prepare fails