Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

4 May 2020


      On Mon, May 4, 2020 at 1:26 PM Andy Lutomirski luto@kernel.org wrote:
...
On Mon, May 4, 2020 at 1:05 PM Luck, Tony tony.luck@intel.com wrote:
...
...
When a copy function hits a bad page and the page is not yet known to
be bad, what does it do?  (I.e. the page was believed to be fine but
the copy function gets #MC.)  Does it unmap it right away?  What does
it return?
I suspect that we will only ever find a handful of situations where the
kernel can recover from memory that has gone bad that are worth fixing
(got to be some code path that touches a meaningful fraction of memory,
otherwise we get code complexity without any meaningful payoff).
I don't think we'd want different actions for the cases of "we just found out
now that this page is bad" and "we got a notification an hour ago that this
page had gone bad". Currently we treat those the same for application
errors ... SIGBUS either way[1].
Oh, I agree that the end result should be the same.  I'm thinking more
about the mechanism and the internal API.  As a somewhat silly example
of why there's a difference, the first time we try to read from bad
memory, we can expect #MC (I assume, on a sensibly functioning
platform).  But, once we get the #MC, I imagine that the #MC handler
will want to unmap the page to prevent a storm of additional #MC
events on the same page -- given the awful x86 #MC design, too many
all at once is fatal.  So the next time we copy_mc_to_user() or
whatever from the memory, we'll get #PF instead.  Or maybe that #MC
will defer the unmap?
After the consumption the PMEM driver arranges for the page to never
be mapped again via its "badblocks" list.
...
So the point of my questions is that the overall design should be at
least somewhat settled before anyone tries to review just the copy
functions.
I would say that DAX / PMEM stretches the Linux memory error handling
model beyond what it was originally designed. The primary concepts
that bend the assumptions of mm/memory-failure.c are:
1/ DAX pages can not be offlined via the page allocator.
2/ DAX pages (well cachelines in those pages) can be asynchronously
marked poisoned by a platform or device patrol scrub facility.
3/ DAX pages might be repaired by writes.
Currently 1/ and 2/ are managed by a per-block-device "badblocks" list
that is populated by scrub results and also amended when #MC is raised
(see nfit_handle_mce()). When fs/dax.c services faults it will decline
to map the page if the physical file extent intersects a bad block.
There is also support for sending SIGBUS if userspace races the
scrubber to consume the badblock. However, that uses the standard
'struct page' error model and assumes that a file backed page is 1:1
mapped to a file. This requirement prevents filesystems from enabling
reflink. That collision and the desire to enable reflink is why we are
now investigating supplanting the mm/memory-failure.c model. When the
page is "owned" by a filesystem invoke the filesystem to handle the
memory error across all impacted files.
The presence of 3/ means that any action error handling takes to
disable access to the page needs to be capable of being undone, which
runs counter to the mm/memory-failure.c assumption that offlining is a
one-way trip.

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()