This is a request for adding the following patches to stable 5.10.y.
Poisoned shmem and hugetlb pages are removed from the pagecache. Subsequent access to the offset in the file results in a NEW zero filled page. Application code does not get notified of the data loss, and the only 'clue' is a message in the system log. Data loss has been experienced by real users.
This was addressed upstream. Most commits were marked for backports, but some were not. This was discussed here [1] and here [2].
Patches apply cleanly to v5.4.224 and pass tests checking for this specific data loss issue. LTP mm tests show no regressions.
All patches except 4 "mm: hwpoison: handle non-anonymous THP correctly" required a small bit of change to apply correctly: mostly for context.
linux-mm Cc'ed as it would be great to get at least an ACK from others familiar with this issue.
[1] https://lore.kernel.org/linux-mm/Y2UTUNBHVY5U9si2@monkey/ [2] https://lore.kernel.org/stable/20221114131403.GA3807058@u2004/
James Houghton (1): hugetlbfs: don't delete error page from pagecache
Yang Shi (5): mm: hwpoison: remove the unnecessary THP check mm: filemap: check if THP has hwpoisoned subpage for PMD page fault mm: hwpoison: refactor refcount check handling mm: hwpoison: handle non-anonymous THP correctly mm: shmem: don't truncate page if memory failure happens
fs/hugetlbfs/inode.c | 13 ++-- include/linux/page-flags.h | 23 ++++++ mm/huge_memory.c | 2 + mm/hugetlb.c | 4 + mm/memory-failure.c | 153 ++++++++++++++++++++++++------------- mm/memory.c | 9 +++ mm/page_alloc.c | 4 +- mm/shmem.c | 51 +++++++++++-- 8 files changed, 191 insertions(+), 68 deletions(-)
From: Yang Shi shy828301@gmail.com
commit c7cb42e94473aafe553c0f2a3d8ca904599399ed upstream
When handling THP hwpoison checked if the THP is in allocation or free stage since hwpoison may mistreat it as hugetlb page. After commit 415c64c1453a ("mm/memory-failure: split thp earlier in memory error handling") the problem has been fixed, so this check is no longer needed. Remove it. The side effect of the removal is hwpoison may report unsplit THP instead of unknown error for shmem THP. It seems not like a big deal.
The following patch "mm: filemap: check if THP has hwpoisoned subpage for PMD page fault" depends on this, which fixes shmem THP with hwpoisoned subpage(s) are mapped PMD wrongly. So this patch needs to be backported to -stable as well.
Link: https://lkml.kernel.org/r/20211020210755.23964-2-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Suggested-by: Naoya Horiguchi naoya.horiguchi@nec.com Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- mm/memory-failure.c | 14 -------------- 1 file changed, 14 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index aef267c6a724..ae8b60e5f939 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -956,20 +956,6 @@ static int get_hwpoison_page(struct page *page) { struct page *head = compound_head(page);
- if (!PageHuge(head) && PageTransHuge(head)) { - /* - * Non anonymous thp exists only in allocation/free time. We - * can't handle such a case correctly, so let's give it up. - * This should be better than triggering BUG_ON when kernel - * tries to touch the "partially handled" page. - */ - if (!PageAnon(head)) { - pr_err("Memory failure: %#lx: non anonymous thp\n", - page_to_pfn(page)); - return 0; - } - } - if (get_page_unless_zero(head)) { if (head == compound_head(page)) return 1;
From: Yang Shi shy828301@gmail.com
commit eac96c3efdb593df1a57bb5b95dbe037bfa9a522 upstream
When handling shmem page fault the THP with corrupted subpage could be PMD mapped if certain conditions are satisfied. But kernel is supposed to send SIGBUS when trying to map hwpoisoned page.
There are two paths which may do PMD map: fault around and regular fault.
Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") the thing was even worse in fault around path. The THP could be PMD mapped as long as the VMA fits regardless what subpage is accessed and corrupted. After this commit as long as head page is not corrupted the THP could be PMD mapped.
In the regular fault path the THP could be PMD mapped as long as the corrupted page is not accessed and the VMA fits.
This loophole could be fixed by iterating every subpage to check if any of them is hwpoisoned or not, but it is somewhat costly in page fault path.
So introduce a new page flag called HasHWPoisoned on the first tail page. It indicates the THP has hwpoisoned subpage(s). It is set if any subpage of THP is found hwpoisoned by memory failure and after the refcount is bumped successfully, then cleared when the THP is freed or split.
The soft offline path doesn't need this since soft offline handler just marks a subpage hwpoisoned when the subpage is migrated successfully. But shmem THP didn't get split then migrated at all.
Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Yang Shi shy828301@gmail.com Reviewed-by: Naoya Horiguchi naoya.horiguchi@nec.com Suggested-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Hugh Dickins hughd@google.com Cc: Matthew Wilcox willy@infradead.org Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- include/linux/page-flags.h | 23 +++++++++++++++++++++++ mm/huge_memory.c | 2 ++ mm/memory-failure.c | 14 ++++++++++++++ mm/memory.c | 9 +++++++++ mm/page_alloc.c | 4 +++- 5 files changed, 51 insertions(+), 1 deletion(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4f6ba9379112..3f431736cfc0 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -169,6 +169,15 @@ enum pageflags { /* Compound pages. Stored in first tail page's flags */ PG_double_map = PG_workingset,
+#ifdef CONFIG_MEMORY_FAILURE + /* + * Compound pages. Stored in first tail page's flags. + * Indicates that at least one subpage is hwpoisoned in the + * THP. + */ + PG_has_hwpoisoned = PG_mappedtodisk, +#endif + /* non-lru isolated movable page */ PG_isolated = PG_reclaim,
@@ -588,6 +597,20 @@ static inline void ClearPageCompound(struct page *page) } #endif
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE) +/* + * PageHasHWPoisoned indicates that at least one subpage is hwpoisoned in the + * compound page. + * + * This flag is set by hwpoison handler. Cleared by THP split or free page. + */ +PAGEFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND) + TESTSCFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND) +#else +PAGEFLAG_FALSE(HasHWPoisoned) + TESTSCFLAG_FALSE(HasHWPoisoned) +#endif + #define PG_head_mask ((1UL << PG_head))
#ifdef CONFIG_HUGETLB_PAGE diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cb7b0aead709..cda0ea6c8623 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2464,6 +2464,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); }
+ ClearPageHasHWPoisoned(head); + for (i = nr - 1; i >= 1; i--) { __split_huge_page_tail(head, i, lruvec, list); /* Some pages can be beyond i_size: drop them from page cache */ diff --git a/mm/memory-failure.c b/mm/memory-failure.c index ae8b60e5f939..b6ab841d1db7 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1367,6 +1367,20 @@ int memory_failure(unsigned long pfn, int flags) }
if (PageTransHuge(hpage)) { + /* + * The flag must be set after the refcount is bumped + * otherwise it may race with THP split. + * And the flag can't be set in get_hwpoison_page() since + * it is called by soft offline too and it is just called + * for !MF_COUNT_INCREASE. So here seems to be the best + * place. + * + * Don't need care about the above error handling paths for + * get_hwpoison_page() since they handle either free page + * or unhandlable page. The refcount is bumped iff the + * page is a valid handlable page. + */ + SetPageHasHWPoisoned(hpage); if (try_to_split_thp_page(p, "Memory Failure") < 0) { action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); return -EBUSY; diff --git a/mm/memory.c b/mm/memory.c index cbc0a163d705..1e9f96caabbe 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3814,6 +3814,15 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) if (compound_order(page) != HPAGE_PMD_ORDER) return ret;
+ /* + * Just backoff if any subpage of a THP is corrupted otherwise + * the corrupted page may mapped by PMD silently to escape the + * check. This kind of THP just can be PTE mapped. Access to + * the corrupted subpage should trigger SIGBUS as expected. + */ + if (unlikely(PageHasHWPoisoned(page))) + return ret; + /* * Archs like ppc64 need additonal space to store information * related to pte entry. Use the preallocated table for that. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a56f2b9df5a0..8eb745920cac 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1232,8 +1232,10 @@ static __always_inline bool free_pages_prepare(struct page *page,
VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
- if (compound) + if (compound) { ClearPageDoubleMap(page); + ClearPageHasHWPoisoned(page); + } for (i = 1; i < (1 << order); i++) { if (compound) bad += free_tail_pages_check(page, page + i);
From: Yang Shi shy828301@gmail.com
commit dd0f230a0a80ff396c7ce587f16429f2a8131344 upstream.
Memory failure will report failure if the page still has extra pinned refcount other than from hwpoison after the handler is done. Actually the check is not necessary for all handlers, so move the check into specific handlers. This would make the following keeping shmem page in page cache patch easier.
There may be expected extra pin for some cases, for example, when the page is dirty and in swapcache.
Link: https://lkml.kernel.org/r/20211020210755.23964-5-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Signed-off-by: Naoya Horiguchi naoya.horiguchi@nec.com Suggested-by: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- mm/memory-failure.c | 107 +++++++++++++++++++++++++++++--------------- 1 file changed, 72 insertions(+), 35 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index b6ab841d1db7..0f77ec1985dc 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -650,12 +650,42 @@ static int truncate_error_page(struct page *p, unsigned long pfn, return ret; }
+struct page_state { + unsigned long mask; + unsigned long res; + enum mf_action_page_type type; + int (*action)(struct page_state *ps, struct page *p); +}; + +/* + * Return true if page is still referenced by others, otherwise return + * false. + * + * The extra_pins is true when one extra refcount is expected. + */ +static bool has_extra_refcount(struct page_state *ps, struct page *p, + bool extra_pins) +{ + int count = page_count(p) - 1; + + if (extra_pins) + count -= 1; + + if (count > 0) { + pr_err("%#lx: %s still referenced by %d users\n", + page_to_pfn(p), action_page_types[ps->type], count); + return true; + } + + return false; +} + /* * Error hit kernel page. * Do nothing, try to be lucky and not touch this instead. For a few cases we * could be more sophisticated. */ -static int me_kernel(struct page *p, unsigned long pfn) +static int me_kernel(struct page_state *ps, struct page *p) { return MF_IGNORED; } @@ -663,18 +693,19 @@ static int me_kernel(struct page *p, unsigned long pfn) /* * Page in unknown state. Do nothing. */ -static int me_unknown(struct page *p, unsigned long pfn) +static int me_unknown(struct page_state *ps, struct page *p) { - pr_err("Memory failure: %#lx: Unknown page state\n", pfn); + pr_err("Memory failure: %#lx: Unknown page state\n", page_to_pfn(p)); return MF_FAILED; }
/* * Clean (or cleaned) page cache page. */ -static int me_pagecache_clean(struct page *p, unsigned long pfn) +static int me_pagecache_clean(struct page_state *ps, struct page *p) { struct address_space *mapping; + int ret;
delete_from_lru_cache(p);
@@ -705,7 +736,12 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn) * * Open: to take i_mutex or not for this? Right now we don't. */ - return truncate_error_page(p, pfn, mapping); + ret = truncate_error_page(p, page_to_pfn(p), mapping); + + if (has_extra_refcount(ps, p, false)) + ret = MF_FAILED; + + return ret; }
/* @@ -713,7 +749,7 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn) * Issues: when the error hit a hole page the error is not properly * propagated. */ -static int me_pagecache_dirty(struct page *p, unsigned long pfn) +static int me_pagecache_dirty(struct page_state *ps, struct page *p) { struct address_space *mapping = page_mapping(p);
@@ -757,7 +793,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) mapping_set_error(mapping, -EIO); }
- return me_pagecache_clean(p, pfn); + return me_pagecache_clean(ps, p); }
/* @@ -779,26 +815,38 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * Clean swap cache pages can be directly isolated. A later page fault will * bring in the known good data from disk. */ -static int me_swapcache_dirty(struct page *p, unsigned long pfn) +static int me_swapcache_dirty(struct page_state *ps, struct page *p) { + bool extra_pins = false; + int ret; + ClearPageDirty(p); /* Trigger EIO in shmem: */ ClearPageUptodate(p);
- if (!delete_from_lru_cache(p)) - return MF_DELAYED; - else - return MF_FAILED; + ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED; + + if (ret == MF_DELAYED) + extra_pins = true; + + if (has_extra_refcount(ps, p, extra_pins)) + ret = MF_FAILED; + + return ret; }
-static int me_swapcache_clean(struct page *p, unsigned long pfn) +static int me_swapcache_clean(struct page_state *ps, struct page *p) { + int ret; + delete_from_swap_cache(p);
- if (!delete_from_lru_cache(p)) - return MF_RECOVERED; - else - return MF_FAILED; + ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED; + + if (has_extra_refcount(ps, p, false)) + ret = MF_FAILED; + + return ret; }
/* @@ -807,7 +855,7 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn) * - Error on hugepage is contained in hugepage unit (not in raw page unit.) * To narrow down kill region to one page, we need to break up pmd. */ -static int me_huge_page(struct page *p, unsigned long pfn) +static int me_huge_page(struct page_state *ps, struct page *p) { int res = 0; struct page *hpage = compound_head(p); @@ -818,7 +866,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
mapping = page_mapping(hpage); if (mapping) { - res = truncate_error_page(hpage, pfn, mapping); + res = truncate_error_page(hpage, page_to_pfn(p), mapping); } else { unlock_page(hpage); /* @@ -833,6 +881,9 @@ static int me_huge_page(struct page *p, unsigned long pfn) lock_page(hpage); }
+ if (has_extra_refcount(ps, p, false)) + res = MF_FAILED; + return res; }
@@ -858,12 +909,7 @@ static int me_huge_page(struct page *p, unsigned long pfn) #define slab (1UL << PG_slab) #define reserved (1UL << PG_reserved)
-static struct page_state { - unsigned long mask; - unsigned long res; - enum mf_action_page_type type; - int (*action)(struct page *p, unsigned long pfn); -} error_states[] = { +static struct page_state error_states[] = { { reserved, reserved, MF_MSG_KERNEL, me_kernel }, /* * free pages are specially detected outside this table: @@ -923,18 +969,9 @@ static int page_action(struct page_state *ps, struct page *p, unsigned long pfn) { int result; - int count;
- result = ps->action(p, pfn); + result = ps->action(ps, p);
- count = page_count(p) - 1; - if (ps->action == me_swapcache_dirty && result == MF_DELAYED) - count--; - if (count > 0) { - pr_err("Memory failure: %#lx: %s still referenced by %d users\n", - pfn, action_page_types[ps->type], count); - result = MF_FAILED; - } action_result(pfn, ps->type, result);
/* Could do more checks here if page looks ok */
From: Yang Shi shy828301@gmail.com
commit 4966455d9100236fd6dd72b0cd00818435fdb25d upstream.
Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP support for tmpfs and read-only file cache has been added. They could be offlined by split THP, just like anonymous THP.
Link: https://lkml.kernel.org/r/20211020210755.23964-7-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org --- mm/memory-failure.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0f77ec1985dc..1d37de089008 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1154,14 +1154,11 @@ static int identify_page_state(unsigned long pfn, struct page *p, static int try_to_split_thp_page(struct page *page, const char *msg) { lock_page(page); - if (!PageAnon(page) || unlikely(split_huge_page(page))) { + if (unlikely(split_huge_page(page))) { unsigned long pfn = page_to_pfn(page);
unlock_page(page); - if (!PageAnon(page)) - pr_info("%s: %#lx: non anonymous thp\n", msg, pfn); - else - pr_info("%s: %#lx: thp split failed\n", msg, pfn); + pr_info("%s: %#lx: thp split failed\n", msg, pfn); put_page(page); return -EBUSY; }
From: Yang Shi shy828301@gmail.com
commit a7605426666196c5a460dd3de6f8dac1d3c21f00 upstream.
The current behavior of memory failure is to truncate the page cache regardless of dirty or clean. If the page is dirty the later access will get the obsolete data from disk without any notification to the users. This may cause silent data loss. It is even worse for shmem since shmem is in-memory filesystem, truncating page cache means discarding data blocks. The later read would return all zero.
The right approach is to keep the corrupted page in page cache, any later access would return error for syscalls or SIGBUS for page fault, until the file is truncated, hole punched or removed. The regular storage backed filesystems would be more complicated so this patch is focused on shmem. This also unblock the support for soft offlining shmem THP.
[akpm@linux-foundation.org: coding style fixes] [arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()] Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org [Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a slight different implementation from what Ajay Garg ajaygargnsit@gmail.com and Muchun Song songmuchun@bytedance.com proposed and reworked the error handling of shmem_write_begin() suggested by Linus] Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.co...
Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com Link: https://lkml.kernel.org/r/20211116193247.21102-1-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Signed-off-by: Arnd Bergmann arnd@arndb.de Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: Ajay Garg ajaygargnsit@gmail.com Cc: Muchun Song songmuchun@bytedance.com Cc: Andy Lavr andy.lavr@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- mm/memory-failure.c | 10 ++++++++- mm/shmem.c | 51 +++++++++++++++++++++++++++++++++++++++------ 2 files changed, 54 insertions(+), 7 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 1d37de089008..7d96be8e93b7 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -56,6 +56,7 @@ #include <linux/kfifo.h> #include <linux/ratelimit.h> #include <linux/page-isolation.h> +#include <linux/shmem_fs.h> #include "internal.h" #include "ras/ras_event.h"
@@ -705,6 +706,7 @@ static int me_unknown(struct page_state *ps, struct page *p) static int me_pagecache_clean(struct page_state *ps, struct page *p) { struct address_space *mapping; + bool extra_pins; int ret;
delete_from_lru_cache(p); @@ -731,6 +733,12 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) return MF_FAILED; }
+ /* + * The shmem page is kept in page cache instead of truncating + * so is expected to have an extra refcount after error-handling. + */ + extra_pins = shmem_mapping(mapping); + /* * Truncation is a bit tricky. Enable it per file system for now. * @@ -738,7 +746,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) */ ret = truncate_error_page(p, page_to_pfn(p), mapping);
- if (has_extra_refcount(ps, p, false)) + if (has_extra_refcount(ps, p, extra_pins)) ret = MF_FAILED;
return ret; diff --git a/mm/shmem.c b/mm/shmem.c index d3d8c5e7a296..8b539033dfe2 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2521,6 +2521,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_SHIFT; + int ret = 0;
/* i_mutex is held by caller */ if (unlikely(info->seals & (F_SEAL_GROW | @@ -2531,7 +2532,19 @@ shmem_write_begin(struct file *file, struct address_space *mapping, return -EPERM; }
- return shmem_getpage(inode, index, pagep, SGP_WRITE); + ret = shmem_getpage(inode, index, pagep, SGP_WRITE); + + if (ret) + return ret; + + if (PageHWPoison(*pagep)) { + unlock_page(*pagep); + put_page(*pagep); + *pagep = NULL; + return -EIO; + } + + return 0; }
static int @@ -2618,6 +2631,12 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) if (sgp == SGP_CACHE) set_page_dirty(page); unlock_page(page); + + if (PageHWPoison(page)) { + put_page(page); + error = -EIO; + break; + } }
/* @@ -3210,7 +3229,8 @@ static const char *shmem_get_link(struct dentry *dentry, page = find_get_page(inode->i_mapping, 0); if (!page) return ERR_PTR(-ECHILD); - if (!PageUptodate(page)) { + if (PageHWPoison(page) || + !PageUptodate(page)) { put_page(page); return ERR_PTR(-ECHILD); } @@ -3218,6 +3238,13 @@ static const char *shmem_get_link(struct dentry *dentry, error = shmem_getpage(inode, 0, &page, SGP_READ); if (error) return ERR_PTR(error); + if (!page) + return ERR_PTR(-ECHILD); + if (PageHWPoison(page)) { + unlock_page(page); + put_page(page); + return ERR_PTR(-ECHILD); + } unlock_page(page); } set_delayed_call(done, shmem_put_link, page); @@ -3866,6 +3893,13 @@ static void shmem_destroy_inodecache(void) kmem_cache_destroy(shmem_inode_cachep); }
+/* Keep the page in page cache instead of truncating it */ +static int shmem_error_remove_page(struct address_space *mapping, + struct page *page) +{ + return 0; +} + static const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_no_writeback, @@ -3876,7 +3910,7 @@ static const struct address_space_operations shmem_aops = { #ifdef CONFIG_MIGRATION .migratepage = migrate_page, #endif - .error_remove_page = generic_error_remove_page, + .error_remove_page = shmem_error_remove_page, };
static const struct file_operations shmem_file_operations = { @@ -4316,9 +4350,14 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL, NULL, NULL); if (error) - page = ERR_PTR(error); - else - unlock_page(page); + return ERR_PTR(error); + + unlock_page(page); + if (PageHWPoison(page)) { + put_page(page); + return ERR_PTR(-EIO); + } + return page; #else /*
From: James Houghton jthoughton@google.com
commit 8625147cafaa9ba74713d682f5185eb62cb2aedb upstream.
This change is very similar to the change that was made for shmem [1], and it solves the same problem but for HugeTLBFS instead.
Currently, when poison is found in a HugeTLB page, the page is removed from the page cache. That means that attempting to map or read that hugepage in the future will result in a new hugepage being allocated instead of notifying the user that the page was poisoned. As [1] states, this is effectively memory corruption.
The fix is to leave the page in the page cache. If the user attempts to use a poisoned HugeTLB page with a syscall, the syscall will fail with EIO, the same error code that shmem uses. For attempts to map the page, the thread will get a BUS_MCEERR_AR SIGBUS.
[1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com Signed-off-by: James Houghton jthoughton@google.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Reviewed-by: Naoya Horiguchi naoya.horiguchi@nec.com Tested-by: Naoya Horiguchi naoya.horiguchi@nec.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: James Houghton jthoughton@google.com Cc: Miaohe Lin linmiaohe@huawei.com Cc: Muchun Song songmuchun@bytedance.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- fs/hugetlbfs/inode.c | 13 ++++++------- mm/hugetlb.c | 4 ++++ mm/memory-failure.c | 5 ++++- 3 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a2f43f1a85f8..580fcf26a48f 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -361,6 +361,12 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to) } else { unlock_page(page);
+ if (PageHWPoison(page)) { + put_page(page); + retval = -EIO; + break; + } + /* * We have the page, copy it to user space buffer. */ @@ -994,13 +1000,6 @@ static int hugetlbfs_migrate_page(struct address_space *mapping, static int hugetlbfs_error_remove_page(struct address_space *mapping, struct page *page) { - struct inode *inode = mapping->host; - pgoff_t index = page->index; - - remove_huge_page(page); - if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1))) - hugetlb_fix_reserve_counts(inode); - return 0; }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d8c63d79af20..6c99b217a03e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4775,6 +4775,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, ptl = huge_pte_lockptr(h, dst_mm, dst_pte); spin_lock(ptl);
+ ret = -EIO; + if (PageHWPoison(page)) + goto out_release_unlock; + /* * Recheck the i_size after holding PT lock to make sure not * to leave any page mapped (as page_mapped()) beyond the end diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7d96be8e93b7..f0cfd7d9c425 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -868,6 +868,7 @@ static int me_huge_page(struct page_state *ps, struct page *p) int res = 0; struct page *hpage = compound_head(p); struct address_space *mapping; + bool extra_pins = false;
if (!PageHuge(hpage)) return MF_DELAYED; @@ -875,6 +876,8 @@ static int me_huge_page(struct page_state *ps, struct page *p) mapping = page_mapping(hpage); if (mapping) { res = truncate_error_page(hpage, page_to_pfn(p), mapping); + /* The page is kept in page cache. */ + extra_pins = true; } else { unlock_page(hpage); /* @@ -889,7 +892,7 @@ static int me_huge_page(struct page_state *ps, struct page *p) lock_page(hpage); }
- if (has_extra_refcount(ps, p, false)) + if (has_extra_refcount(ps, p, extra_pins)) res = MF_FAILED;
return res;
On 2022/11/24 AM3:54, Mike Kravetz wrote:
This is a request for adding the following patches to stable 5.10.y.
Poisoned shmem and hugetlb pages are removed from the pagecache. Subsequent access to the offset in the file results in a NEW zero filled page. Application code does not get notified of the data loss, and the only 'clue' is a message in the system log. Data loss has been experienced by real users.
This was addressed upstream. Most commits were marked for backports, but some were not. This was discussed here [1] and here [2].
Patches apply cleanly to v5.4.224 and pass tests checking for this specific data loss issue. LTP mm tests show no regressions.
All patches except 4 "mm: hwpoison: handle non-anonymous THP correctly" required a small bit of change to apply correctly: mostly for context.
linux-mm Cc'ed as it would be great to get at least an ACK from others familiar with this issue.
[1] https://lore.kernel.org/linux-mm/Y2UTUNBHVY5U9si2@monkey/ [2] https://lore.kernel.org/stable/20221114131403.GA3807058@u2004/
James Houghton (1): hugetlbfs: don't delete error page from pagecache
Yang Shi (5): mm: hwpoison: remove the unnecessary THP check mm: filemap: check if THP has hwpoisoned subpage for PMD page fault mm: hwpoison: refactor refcount check handling mm: hwpoison: handle non-anonymous THP correctly mm: shmem: don't truncate page if memory failure happens
fs/hugetlbfs/inode.c | 13 ++-- include/linux/page-flags.h | 23 ++++++ mm/huge_memory.c | 2 + mm/hugetlb.c | 4 + mm/memory-failure.c | 153 ++++++++++++++++++++++++------------- mm/memory.c | 9 +++ mm/page_alloc.c | 4 +- mm/shmem.c | 51 +++++++++++-- 8 files changed, 191 insertions(+), 68 deletions(-)
Hi, folks
Thank you for your effort. Data loss will break the data consistency of end users and it is critical to notify users.
I tried to apply this patch set to 5.10.168 stable release[1] and run mm_regression[3] test cases following steps[4] provided by Naoya. All four cases passed.
#./run.sh project summary -p Project Name: debug PASS mm/hwpoison/shmem_link/link-hard.auto3 PASS mm/hwpoison/shmem_link/link-sym.auto3 PASS mm/hwpoison/shmem_rw/thp-always.auto3 PASS mm/hwpoison/shmem_rw/thp-never.auto3 Progress: 4 / 4 (100%)
Tested-by: Shuai Xue xueshuai@linux.alibaba.com
Cheers, Shuai
[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tag/?h=v5.1... [2] https://github.com/nhoriguchi/mm_regression [3] https://lore.kernel.org/stable/20221116235842.GA62826@u2004/
linux-stable-mirror@lists.linaro.org