The patch titled Subject: mm/hwpoison: fix incorrect "not recovered" report for recovered clean pages has been added to the -mm mm-unstable branch. Its filename is mm-hwpoison-fix-incorrect-not-recovered-report-for-recovered-clean-pages.patch
This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches...
This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days
------------------------------------------------------ From: Shuai Xue xueshuai@linux.alibaba.com Subject: mm/hwpoison: fix incorrect "not recovered" report for recovered clean pages Date: Mon, 17 Feb 2025 14:33:34 +0800
When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed.
If the CMCI wins that race, the page is marked poisoned when uc_decode_notifier() calls memory_failure(). For dirty pages, memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, converting the PTE to a hwpoison entry. As a result, kill_accessing_process():
- call walk_page_range() and return 1 regardless of whether try_to_unmap() succeeds or fails, - call kill_proc() to make sure a SIGBUS is sent - return -EHWPOISON to indicate that SIGBUS is already sent to the process and kill_me_maybe() doesn't have to send it again.
However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted to a hwpoison entry. Conversely, for clean pages where PTE entries are not marked as hwpoison, kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a SIGBUS.
Console log looks like this:
Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered Memory failure: 0x827ca68: already hardware poisoned mce: Memory error not recovered
To fix it, return 0 for "corrupted page was clean", preventing an unnecessary SIGBUS.
Link: https://lkml.kernel.org/r/20250217063335.22257-5-xueshuai@linux.alibaba.com Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") Signed-off-by: Shuai Xue xueshuai@linux.alibaba.com Cc: stable@vger.kernel.org Cc: Acked-by:Thomas Gleixner tglx@linutronix.de Cc: Baolin Wang baolin.wang@linux.alibaba.com Cc: Borislav Betkov bp@alien8.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: Jane Chu jane.chu@oracle.com Cc: Jarkko Sakkinen jarkko@kernel.org Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Josh Poimboeuf jpoimboe@kernel.org Cc: linmiaohe linmiaohe@huawei.com Cc: "Luck, Tony" tony.luck@intel.com Cc: Naoya Horiguchi nao.horiguchi@gmail.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ruidong Tian tianruidong@linux.alibaba.com Cc: Yazen Ghannam yazen.ghannam@amd.com Signed-off-by: Andrew Morton akpm@linux-foundation.org ---
mm/memory-failure.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
--- a/mm/memory-failure.c~mm-hwpoison-fix-incorrect-not-recovered-report-for-recovered-clean-pages +++ a/mm/memory-failure.c @@ -881,12 +881,17 @@ static int kill_accessing_process(struct mmap_read_lock(p->mm); ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwpoison_walk_ops, (void *)&priv); + /* + * ret = 1 when CMCI wins, regardless of whether try_to_unmap() + * succeeds or fails, then kill the process with SIGBUS. + * ret = 0 when poison page is a clean page and it's dropped, no + * SIGBUS is needed. + */ if (ret == 1 && priv.tk.addr) kill_proc(&priv.tk, pfn, flags); - else - ret = 0; mmap_read_unlock(p->mm); - return ret > 0 ? -EHWPOISON : -EFAULT; + + return ret > 0 ? -EHWPOISON : 0; }
/* _
Patches currently in -mm which might be from xueshuai@linux.alibaba.com are
x86-mce-collect-error-message-for-severities-below-mce_panic_severity.patch x86-mce-dump-error-msg-from-severities.patch x86-mce-add-ex_type_efault_reg-as-in-kernel-recovery-context-to-fix-copy-from-user-operations-regression.patch mm-hwpoison-fix-incorrect-not-recovered-report-for-recovered-clean-pages.patch mm-memory-failure-move-return-value-documentation-to-function-declaration.patch
linux-stable-mirror@lists.linaro.org