Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl

18 May 2023

      On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen axelrasmussen@google.com wrote:
...
On Wed, May 17, 2023 at 3:20 PM Peter Xu peterx@redhat.com wrote:
...
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
...
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
...
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
axelrasmussen@google.com wrote:
...
So the basic way to use this new feature is:

On the new host, the guest's memory is registered with userfaultfd, in
either MISSING or MINOR mode (doesn't really matter for this purpose).
On any first access, we get a userfaultfd event. At this point we can
communicate with the old host to find out if the page was poisoned.
If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
so any future accesses will SIGBUS. Because the pte is now "present",
future accesses won't generate more userfaultfd events, they'll just
SIGBUS directly.

I want to clarify the SIGBUS mechanism here when KVM is involved,
keeping in mind that we need to be able to inject an MCE into the
guest for this to be useful.

vCPU gets an EPT violation --> KVM attempts GUP.
GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
KVM finds that GUP failed and returns -EFAULT.

This is different than if GUP found poison, in which case KVM will
actually queue up a SIGBUS *containing the address of the fault*, and
userspace can use it to inject an appropriate MCE into the guest. With
UFFDIO_SIGBUS, we are missing the address!
I see three options:

Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think

this is pointless.
2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
accesses, just like how we get repeated signals for real poison.
3. Use this in conjunction with the additional KVM EFAULT info that
Anish proposed (the first part of [1]).
I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case
this is why option 4) was ruled out:
    They expect that once poisoned, pages can never become
    "un-poisoned". So, when we live migrate the VM, we need to preserve
    the poisoned status of these pages.

Just to supplement on this point: we do have unpoison (echoing to
"debug/hwpoison/hwpoison_unpoison"), or am I wrong?
If I read unpoison_memory() correctly, once there is a real hardware
memory corruption (hw_memory_failure will be set), unpoison will stop
working and return EOPNOTSUPP.
I know some cloud providers evacuating VMs once a single memory error
happens, so not supporting unpoison is probably not a big deal for
them. BUT others do keep VM running until more errors show up later,
which could be long after the 1st error.
...
...
...
Besides what James mentioned on "missing addr", I didn't quickly see what's
the major difference comparing to the old hwpoison injection methods even
without the addr requirement. If we want the addr for MCE then it's more of
a question to ask.
I also didn't quickly see why for whatever new way to inject a pte error we
need to have it registered with uffd.  Could it be something like
MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even
without an userfault context (but still usable when uffd registered)?
And it'll be alawys nice to have a cover letter too (if there'll be a new
version) explaining the bits.
I do plan a v2, if for no other reason than to update the
documentation. Happy to add a cover letter with it as well.
+Jiaqi back to CC, this is one piece of a larger memory poisoning /
recovery design Jiaqi is working on, so he may have some ideas why
MADV_HWPOISON or MADV_PGER will or won't work.
Per https://man7.org/linux/man-pages/man2/madvise.2.html,
MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN)
processes." So for a non-root VMM, MADV_HWPOISON is out of option.
Another issue with MADV_HWPOISON is, it requires to first successfully
get_user_pages_fast(). I don't think it will work if memory is not
mapped yet.
With the UFFDIO_SIGBUS feature introduced in this patchset, it may
even be possible to free the emulated-hwpoison page back to the kernel
so we don't lose a 4K page.
I didn't find any ref/doc for MADV_PGERR. Is it something you suggest
to build, Peter?
...
One idea is, at least for our use case, we have to have the range be
userfaultfd registered, because we need to intercept the first access
and check at that point whether or not it should be poisoned. But, I
think in principle a scheme like this could work:

Intercept first access with UFFD
Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the

poisoned page in place
3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and SIGBUS
It's arguably slightly weird, since normally UFFD events are resolved
with UFFDIO_* operations, but I don't see why it *couldn't* work.
Then again I am not super familiar with MADV_HWPOISON, I will have to
do a bit of reading to understand if its semantics are the same
(future accesses to this address get SIGBUS).
...
...
Thanks,
--
Peter Xu
--
Peter Xu

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl