Re: [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure

20 Oct 2025

Lisa accidentally dropped all Tos/CCs, so here is her mail with my reply forwarded:
-------- Forwarded Message --------
Subject: Re: [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure
Date: Mon, 20 Oct 2025 14:35:45 +0200
From: David Hildenbrand david@redhat.com
To: Lisa Wang wyihan@google.com
On 17.10.25 17:31, Lisa Wang wrote:
...
On Thu, Oct 16, 2025 at 10:18:17PM +0200, David Hildenbrand wrote:
...
On 15.10.25 20:58, Lisa Wang wrote:
...
The .error_remove_folio a_ops is used by different filesystems to handle
folio truncation upon discovery of a memory failure in the memory
associated with the given folio.
Currently, MF_DELAYED is treated as an error, causing "Failed to punch
page" to be written to the console. MF_DELAYED is then relayed to the
caller of truncat_error_folio() as MF_FAILED. This further causes
memory_failure() to return -EBUSY, which then always causes a SIGBUS.
This is also implies that regardless of whether the thread's memory
corruption kill policy is PR_MCE_KILL_EARLY or PR_MCE_KILL_LATE, a
memory failure within guest_memfd memory will always cause a SIGBUS.
Update truncate_error_folio() to return MF_DELAYED to the caller if the
.error_remove_folio() callback reports MF_DELAYED.
Generalize the comment: MF_DELAYED means memory failure was handled and
some other part of memory failure will be handled later (e.g. a next
access will result in the process being killed). Specifically for
guest_memfd, a next access by the guest will result in an error returned
to the userspace VMM.
With delayed handling, the filemap continues to hold refcounts on the
folio. Hence, take that into account when checking for extra refcounts
in me_pagecache_clean(). This is aligned with the implementation in
me_swapcache_dirty(), where, if a folio is still in the swap cache,
extra_pins is set to true.
Signed-off-by: Lisa Wang wyihan@google.com
mm/memory-failure.c | 24 +++++++++++++++---------
   1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index df6ee59527dd..77f665c16a73 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -922,9 +922,11 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
    * by the m-f() handler immediately.
    *
    * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed.


The page is unmapped, and is removed from the LRU or file mapping.



An attempt to access the page again will trigger page fault and the



PF handler will kill the process.





It means memory_failure was handled (e.g. removed from file mapping or the



LRU) and some other part of memory failure will be handled later (e.g. a



next access will result in the process being killed). Specifically for



guest_memfd, a next access by the guest will result in an error returned to



the userspace VMM.

MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
The page has been completely isolated, that is, unmapped, taken out of





@@ -999,6 +1001,9 @@ static int truncate_error_folio(struct folio *folio, unsigned long pfn,
   	if (mapping->a_ops->error_remove_folio) {
   		int err = mapping->a_ops->error_remove_folio(mapping, folio);

if (err == MF_DELAYED)


	return err;


if (err != 0)
	pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
else if (!filemap_release_folio(folio, GFP_NOIO))



@@ -1108,18 +1113,19 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
   		goto out;
   	}

/*
* The shmem page is kept in page cache instead of truncating


* so is expected to have an extra refcount after error-handling.


*/


extra_pins = shmem_mapping(mapping);
/*
Truncation is a bit tricky. Enable it per file system for now.

Open: to take i_rwsem or not for this? Right now we don't.

*/
  ret = truncate_error_folio(folio, page_to_pfn(p), mapping);



/*
* The shmem page, or any page with MF_DELAYED error handling, is kept in


* page cache instead of truncating, so is expected to have an extra


* refcount after error-handling.


*/


extra_pins = shmem_mapping(mapping) || ret == MF_DELAYED;

Hello David,
Thank you for reviewing these patches!
...
Well, to do it cleanly shouldn't we let shmem_error_remove_folio() also
return MF_DELAYED and remove this shmem special case?
I agree shmem_error_remove_folio() should probably also return MF_DELAYED.
MF_DELAYED sounds right because shmem does not truncate, and hence it
should not call filemap_release_folio() to release fs-specific metadata on
a folio.
Just to clarify for others, this is the code we are talking about in filemap_release_folio():
if (mapping->a_ops->error_remove_folio) {
    int err = mapping->a_ops->error_remove_folio(mapping, folio);
if (err != 0)
    	pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
    else if (!filemap_release_folio(folio, GFP_NOIO))
    	pr_info("%#lx: failed to release buffers\n", pfn);
    else
    	ret = MF_RECOVERED;
} ...
...
There's no bug now in memory failure handling for shmem calling
filemap_release_folio(), because
Right, because shmem error_remove_folio() will currently return 0.
...
shmem does not have folio->private
=> filemap_release_folio() is a no-op anyway
=> filemap_release_folio() returns true
=> truncate_error_folio() returns MF_RECOVERED
Agreed.
...
=> truncate_error_folio()'s caller cleans MF_RECOVERED up to eventually
return 0.
Yes.
...
...
Or is there a good reason shmem_mapping() wants to return 0 -- and maybe
guest_memfd would also wan to do that?
The tradeoff is if I change shmem_error_remove_folio()'s return, mf_stats
will be changed.
But it actually sounds like the right thing to do, no? Who cares about
the stats being delayed vs. recovered?
...
I'd be happy to update shmem_error_remove_folio() to
return MF_DELAYED as well, but is it okay that the userspace-visible
behavior in the form of statistics will change?
They never really were "recovered", but always "delayed", correct?
In that case, it almost sounds like a fix.
...
...
Just reading the code here the inconsistency is unclear.
Another option is to add kvm_gmem_mapping() like shmem_mapping().
Oh no.
As an alternative we could introduce a new MF_* to handle this case.
But it almost sounds like MF_DELAYED does exactly what we want, so I
would suggest to try that first to see if there is any pushback/good
reason to let shmem result in "recovered" when it's really "delayed".
BTW, the behavior from truncate (recovered) -> keep (delayed) was added in
commit a7605426666196c5a460dd3de6f8dac1d3c21f00
Author: Yang Shi shy828301@gmail.com
Date:   Fri Jan 14 14:05:19 2022 -0800
mm: shmem: don't truncate page if memory failure happens
           The current behavior of memory failure is to truncate the page cache
      regardless of dirty or clean.  If the page is dirty the later access
      will get the obsolete data from disk without any notification to the
      users.  This may cause silent data loss.  It is even worse for shmem
      since shmem is in-memory filesystem, truncating page cache means
      discarding data blocks.  The later read would return all zero.
           The right approach is to keep the corrupted page in page cache, any
      later access would return error for syscalls or SIGBUS for page fault,
      until the file is truncated, hole punched or removed.  The regular
      storage backed filesystems would be more complicated so this patch is
      focused on shmem.  This also unblock the support for soft offlining
      shmem THP.
-- 
Cheers

David / dhildenb



    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure

Signed-off-by: Lisa Wang wyihan@google.com