Re: [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung

25 Jul 2025

On Fri, Jul 25, 2025 at 11:52 AM Baolin Wang
baolin.wang@linux.alibaba.com wrote:
...
On 2025/7/25 02:16, Kairui Song wrote:
...
On Fri, Jul 25, 2025 at 1:02 AM Kairui Song ryncsn@gmail.com wrote:
...
On Thu, Jul 10, 2025 at 11:37 AM Kairui Song ryncsn@gmail.com wrote:
...
From: Kairui Song kasong@tencent.com
The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.
The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:
 CPU1                          CPU2

shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
   folio = swap_cache_get_folio
   /* folio = NULL */
   order = xa_get_order
   /* order > 0 */
   folio = shmem_swap_alloc_folio
   /* mTHP alloc failure, folio = NULL */
   <... Interrupted ...>
                                  shmem_swapin_folio
                                  /* S1 is swapped in */
                                  shmem_writeout
                                  /* S1 is swapped out, folio cached */
   shmem_split_large_entry(..., S1)
   /* S1 is split, but the folio covering it has order > 0 now */
Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0.  The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.
And this looks fragile.  So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio.  And drop the
redundant tree walks before the insertion.
This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.
And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true.  The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention.  The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song kasong@tencent.com
Reviewed-by: Kemeng Shi shikemeng@huaweicloud.com
Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com
Tested-by: Baolin Wang baolin.wang@linux.alibaba.com
Cc: stable@vger.kernel.org

mm/shmem.c | 30 +++++++++++++++++++++---------
  1 file changed, 21 insertions(+), 9 deletions(-)
Hi All,
Just found some issue here with this patch...
...

diff --git a/mm/shmem.c b/mm/shmem.c
index 334b7b4a61a0..e3c9a1365ff4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
                                    pgoff_t index, void *expected, gfp_t gfp)
  {
         XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));

  long nr = folio_nr_pages(folio);




  unsigned long nr = folio_nr_pages(folio);


  swp_entry_t iter, swap;


  void *entry;

   VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
   VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);



@@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
     gfp &= GFP_RECLAIM_MASK;
     folio_throttle_swaprate(folio, gfp);


  swap = iter = radix_to_swp_entry(expected);

   do {
           xas_lock_irq(&xas);



I missed a xas_reset here, also better reset iter value too.
...

          if (expected != xas_find_conflict(&xas)) {


                  xas_set_err(&xas, -EEXIST);


                  goto unlock;




          xas_for_each_conflict(&xas, entry) {


                  /*


                   * The range must either be empty, or filled with


                   * expected swap entries. Shmem swap entries are never


                   * partially freed without split of both entry and


                   * folio, so there shouldn't be any holes.


                   */


                  if (!expected || entry != swp_to_radix_entry(iter)) {


                          xas_set_err(&xas, -EEXIST);


                          goto unlock;


                  }


                  iter.val += 1 << xas_get_order(&xas);
           }




          if (expected && xas_find_conflict(&xas)) {




          if (expected && iter.val - nr != swap.val) {
                   xas_set_err(&xas, -EEXIST);
                   goto unlock;
           }



@@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
                         error = -ENOMEM;
                         goto failed;
                 }

  } else if (order != folio_order(folio)) {




  } else if (order > folio_order(folio)) {
           /*
            * Swap readahead may swap in order 0 folios into swapcache
            * asynchronously, while the shmem mapping can still stores



@@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
                     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
             }


  } else if (order < folio_order(folio)) {


          swap.val = round_down(swap.val, 1 << folio_order(folio));
   }


alloced:
       /* We have to do this with folio locked to prevent races */
       folio_lock(folio);
       if ((!skip_swapcache && !folio_test_swapcache(folio)) ||



      / ||


      !shmem_confirm_swap(mapping, index, swap) ||


      xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {



And this part is incorrect. This `shmem_confirm_swap(mapping, index,
swap) ` can't be simply omitted. Some functions below before the
shmem_add_to_page_cache shouldn't be called on folios might have
already been mapped by others. This shmem_confirm_swap ensures that
won't happen.
OK, thanks for the reminding. But could you elaborate a bit? Which
function should not be called, and what problem might be caused?
Yes, first shmem_add_to_page_cache itself will reset the
folio->mapping and index before verifying the mapping.
So even if the folio is still a valid swap cache folio and the
folio->swap.val matches swap.val, a parallel swapin could have swapped
in the freed this folio from swap, and now it's possible that the
folio is now part of anon memory:
CPU1                      CPU2
/* Start swap in of swap entry S1 */
shmem_swapin_folio
/* Interrupted */
                          /* Raced swap in of swap entry S1 */
                          shmem_swapin_folio
                          /* Swapin done, S1 is freed */
/* Anon swapout of folio A using S1 */
                          pageout(folio) != PAGE_SUCCESS
                          /* Now anon folio A is in swpa cache */
folio = swap_cache_get_folio
/* Got folio A */
if (!folio_test_swapcache(folio)
    folio->swap.val != swap.val))
       error = -EEXIST;
/* Check passed, folio A is using S1 as swap entry */
shmem_add_to_page_cache
  folio->mapping = mapping
  /* BUG: folio->mapping is an anon mapping, info lost */
And I managed to trigger this issue, it will result in at least an RSS
counter error like this:
[  1944.374356] BUG: Bad rss-counter state mm:ffff0000c1539640
type:MM_ANONPAGES val:1
[  1944.374384] BUG: Bad rss-counter state mm:ffff0000c1539640
type:MM_SHMEMPAGES val:-1
Clearly it will trigger even more issues.
And other helpers like arch_swap_restore and shmem_replace_folio, they
seems to be OK, but if the folio is not part of shmem anymore they
better stay off of it too. So for safety measure I think we'd better
add the shmem_confirm_swap back. And only checking the first swap
entry is good enough.
...
...
...
It may seem like a small change, but it leads to some minor conflicts
in one or two following commits, the benchmark result will change too.
So I'll have to send a V6 I think.
We can remove this `shmem_confirm_swap`, but not in this series I
think, maybe after this. Need to re-arrange some functions, with some
clean ups for shmem_add_to_page_cache and others.
...

      folio->swap.val != swap.val) {
           error = -EEXIST;
           goto unlock;
   }



--
2.50.0
In summary, I'll squash this patch into it and do a rebase of later commits:
diff --git a/mm/shmem.c b/mm/shmem.c
index e3c9a1365ff4..4ca0b665b79e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -898,9 +898,11 @@ static int shmem_add_to_page_cache(struct folio *folio,
     gfp &= GFP_RECLAIM_MASK;
     folio_throttle_swaprate(folio, gfp);


  swap = iter = radix_to_swp_entry(expected);




  swap = radix_to_swp_entry(expected);

   do {


          iter = swap;


          xas_reset(&xas);



Correction: this xas_reset is not needed, the iter = swap is needed.
Indeed, my tests do not cover the scenario where xas_nomem() returns true.
...
...
             xas_lock_irq(&xas);
             xas_for_each_conflict(&xas, entry) {
                     /*

@@ -2365,9 +2367,16 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
         }
alloced:
And it needs `nr_pages = folio_nr_pages(folio); index =
round_down(index, nr_pages);` here...
IIUC, the index alignment should move into the 'order <
folio_order(folio)' branch?
Ok, I'll move it here. It should be fine either way.
...
...
...

  /* We have to do this with folio locked to prevent races */




  /*


   * We have to do this with folio locked to prevent races.


   * The shmem_confirm_swap below only checks if the first swap


   * entry matches the folio, that's enough to ensure the folio


   * is not used outside of shmem, as shmem swap entrie


   * and swap cache folios are never partially freed.


   */
   folio_lock(folio);
   if ((!skip_swapcache && !folio_test_swapcache(folio)) ||


      !shmem_confirm_swap(mapping, index, swap) ||
       folio->swap.val != swap.val) {
           error = -EEXIST;
           goto unlock;




And I'll do some clean up afterward to get rid of this
shmem_confirm_swap. How do you think?

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung