As it stands, memory_failure() gets thoroughly confused by dev_pagemap backed mappings. The recovery code has specific enabling for several possible page states and needs new enabling to handle poison in dax mappings.
In order to support reliable reverse mapping of user space addresses add new locking in the fsdax implementation to prevent races between page-address_space disassociation events and the rmap performed in the memory_failure() path. Additionally, since dev_pagemap pages are hidden from the page allocator, add a mechanism to determine the size of the mapping that encompasses a given poisoned pfn. Lastly, since pmem errors can be repaired, change the speculatively accessed poison protection, mce_unmap_kpfn(), to be reversible and otherwise allow ongoing access from the kernel.
---
Dan Williams (11): device-dax: convert to vmf_insert_mixed and vm_fault_t device-dax: cleanup vm_fault de-reference chains device-dax: enable page_mapping() device-dax: set page->index filesystem-dax: set page->index filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock mm, madvise_inject_error: fix page count leak x86, memory_failure: introduce {set,clear}_mce_nospec() mm, memory_failure: pass page size to kill_proc() mm, memory_failure: teach memory_failure() about dev_pagemap pages libnvdimm, pmem: restore page attributes when clearing errors
arch/x86/include/asm/set_memory.h | 29 ++++++ arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 --- arch/x86/kernel/cpu/mcheck/mce.c | 38 +------- drivers/dax/device.c | 91 ++++++++++++-------- drivers/nvdimm/pmem.c | 26 ++++++ drivers/nvdimm/pmem.h | 13 +++ fs/dax.c | 102 ++++++++++++++++++++-- include/linux/huge_mm.h | 5 + include/linux/set_memory.h | 14 +++ mm/huge_memory.c | 4 - mm/madvise.c | 11 ++ mm/memory-failure.c | 133 +++++++++++++++++++++++++++-- 12 files changed, 370 insertions(+), 111 deletions(-)
The madvise_inject_error() routine uses get_user_pages() to lookup the pfn and other information for injected error, but it fails to release that pin.
The dax-dma-vs-truncate warning catches this failure with the following signature:
Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000 Memory failure: 0x208900: reserved kernel page still referenced by 1 users Memory failure: 0x208900: recovery action for reserved kernel page: Failed WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90 CPU: 37 PID: 9566 Comm: umount Tainted: G W OE 4.17.0-rc6+ #1900 [..] RIP: 0010:dax_disassociate_entry+0x4e/0x90 RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002 RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900 RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008 RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800 R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001 FS: 00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0 Call Trace: __dax_invalidate_mapping_entry+0xab/0xe0 dax_delete_mapping_entry+0xf/0x20 truncate_exceptional_pvec_entries.part.14+0x1d4/0x210 truncate_inode_pages_range+0x291/0x920 ? kmem_cache_free+0x1f8/0x300 ? lock_acquire+0x9f/0x200 ? truncate_inode_pages_final+0x31/0x50 ext4_evict_inode+0x69/0x740
Cc: stable@vger.kernel.org Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...") Cc: Michal Hocko mhocko@suse.com Cc: Andi Kleen ak@linux.intel.com Cc: Wu Fengguang fengguang.wu@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- mm/madvise.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 4d3c922ea1a1..246fa4d4eee2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
for (; start < end; start += PAGE_SIZE << order) { + unsigned long pfn; int ret;
ret = get_user_pages_fast(start, 1, 0, &page); if (ret != 1) return ret; + pfn = page_to_pfn(page);
/* * When soft offlining hugepages, after migrating the page @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", - page_to_pfn(page), start); + pfn, start);
ret = soft_offline_page(page, MF_COUNT_INCREASED); + put_page(page); if (ret) return ret; continue; } + put_page(page); + pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", - page_to_pfn(page), start); + pfn, start);
- ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED); + ret = memory_failure(pfn, MF_COUNT_INCREASED); if (ret) return ret; }
On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
The madvise_inject_error() routine uses get_user_pages() to lookup the pfn and other information for injected error, but it fails to release that pin.
The dax-dma-vs-truncate warning catches this failure with the following signature:
Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000 Memory failure: 0x208900: reserved kernel page still referenced by 1 users Memory failure: 0x208900: recovery action for reserved kernel page: Failed WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90 CPU: 37 PID: 9566 Comm: umount Tainted: G W OE 4.17.0-rc6+ #1900 [..] RIP: 0010:dax_disassociate_entry+0x4e/0x90 RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002 RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900 RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008 RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800 R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001 FS: 00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0 Call Trace: __dax_invalidate_mapping_entry+0xab/0xe0 dax_delete_mapping_entry+0xf/0x20 truncate_exceptional_pvec_entries.part.14+0x1d4/0x210 truncate_inode_pages_range+0x291/0x920 ? kmem_cache_free+0x1f8/0x300 ? lock_acquire+0x9f/0x200 ? truncate_inode_pages_final+0x31/0x50 ext4_evict_inode+0x69/0x740
Cc: stable@vger.kernel.org Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...") Cc: Michal Hocko mhocko@suse.com Cc: Andi Kleen ak@linux.intel.com Cc: Wu Fengguang fengguang.wu@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
mm/madvise.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 4d3c922ea1a1..246fa4d4eee2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior, for (; start < end; start += PAGE_SIZE << order) {
int ret;unsigned long pfn;
ret = get_user_pages_fast(start, 1, 0, &page); if (ret != 1) return ret;
pfn = page_to_pfn(page);
/* * When soft offlining hugepages, after migrating the page @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior, if (behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
page_to_pfn(page), start);
pfn, start);
ret = soft_offline_page(page, MF_COUNT_INCREASED);
}put_page(page); if (ret) return ret; continue;
put_page(page);
We keep the page count pinned after the isolation of the error page in order to make sure that the error page is disabled and never reused. This seems not explicit enough, so some comment should be helpful.
BTW, looking at the kernel message like "Memory failure: 0x208900: reserved kernel page still referenced by 1 users", memory_failure() considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL). If memory error handler recovers a dav_pagemap page in its special way, we can define a new action_page_types entry like MF_MSG_DAX. Reporting like "Memory failure: 0xXXXXX: recovery action for dax page: Failed" might be helpful for end user's perspective.
Thanks, Naoya Horiguchi
- pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
page_to_pfn(page), start);
pfn, start);
ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
if (ret) return ret; }ret = memory_failure(pfn, MF_COUNT_INCREASED);
On Tue, May 22, 2018 at 9:19 PM, Naoya Horiguchi n-horiguchi@ah.jp.nec.com wrote:
On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
The madvise_inject_error() routine uses get_user_pages() to lookup the pfn and other information for injected error, but it fails to release that pin.
The dax-dma-vs-truncate warning catches this failure with the following signature:
Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000 Memory failure: 0x208900: reserved kernel page still referenced by 1 users Memory failure: 0x208900: recovery action for reserved kernel page: Failed WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90 CPU: 37 PID: 9566 Comm: umount Tainted: G W OE 4.17.0-rc6+ #1900 [..] RIP: 0010:dax_disassociate_entry+0x4e/0x90 RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002 RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900 RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008 RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800 R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001 FS: 00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0 Call Trace: __dax_invalidate_mapping_entry+0xab/0xe0 dax_delete_mapping_entry+0xf/0x20 truncate_exceptional_pvec_entries.part.14+0x1d4/0x210 truncate_inode_pages_range+0x291/0x920 ? kmem_cache_free+0x1f8/0x300 ? lock_acquire+0x9f/0x200 ? truncate_inode_pages_final+0x31/0x50 ext4_evict_inode+0x69/0x740
Cc: stable@vger.kernel.org Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...") Cc: Michal Hocko mhocko@suse.com Cc: Andi Kleen ak@linux.intel.com Cc: Wu Fengguang fengguang.wu@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
mm/madvise.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 4d3c922ea1a1..246fa4d4eee2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
for (; start < end; start += PAGE_SIZE << order) {
unsigned long pfn; int ret; ret = get_user_pages_fast(start, 1, 0, &page); if (ret != 1) return ret;
pfn = page_to_pfn(page); /* * When soft offlining hugepages, after migrating the page
@@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
page_to_pfn(page), start);
pfn, start); ret = soft_offline_page(page, MF_COUNT_INCREASED);
put_page(page); if (ret) return ret; continue; }
put_page(page);
We keep the page count pinned after the isolation of the error page in order to make sure that the error page is disabled and never reused. This seems not explicit enough, so some comment should be helpful.
As far as I can see this extra reference count to keep the page from being should be taken internal to memory_failure(), not assumed from the inject error path. I might be overlooking something, but I do not see who is responsible for taking this extra reference in the case where memory_failure() is called by the machine check code rather than madvise_inject_error()?
BTW, looking at the kernel message like "Memory failure: 0x208900: reserved kernel page still referenced by 1 users", memory_failure() considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL). If memory error handler recovers a dav_pagemap page in its special way, we can define a new action_page_types entry like MF_MSG_DAX. Reporting like "Memory failure: 0xXXXXX: recovery action for dax page: Failed" might be helpful for end user's perspective.
Sounds good, I'll take a look at this.
linux-stable-mirror@lists.linaro.org