The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not.
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose). - On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned. - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- fs/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++ include/linux/swapops.h | 3 +- include/linux/userfaultfd_k.h | 4 ++ include/uapi/linux/userfaultfd.h | 25 +++++++++++-- mm/memory.c | 4 ++ mm/userfaultfd.c | 62 ++++++++++++++++++++++++++++++- 6 files changed, 156 insertions(+), 5 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 0fd96d6e39ce..edc2928dae2b 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1966,6 +1966,66 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; }
+static inline int userfaultfd_sigbus(struct userfaultfd_ctx *ctx, unsigned long arg) +{ + __s64 ret; + struct uffdio_sigbus uffdio_sigbus; + struct uffdio_sigbus __user *user_uffdio_sigbus; + struct userfaultfd_wake_range range; + + user_uffdio_sigbus = (struct uffdio_sigbus __user *)arg; + + ret = -EAGAIN; + if (atomic_read(&ctx->mmap_changing)) + goto out; + + ret = -EFAULT; + if (copy_from_user(&uffdio_sigbus, user_uffdio_sigbus, + /* don't copy the output fields */ + sizeof(uffdio_sigbus) - (sizeof(__s64)))) + goto out; + + ret = validate_range(ctx->mm, uffdio_sigbus.range.start, + uffdio_sigbus.range.len); + if (ret) + goto out; + + ret = -EINVAL; + /* double check for wraparound just in case. */ + if (uffdio_sigbus.range.start + uffdio_sigbus.range.len <= + uffdio_sigbus.range.start) { + goto out; + } + if (uffdio_sigbus.mode & ~UFFDIO_SIGBUS_MODE_DONTWAKE) + goto out; + + if (mmget_not_zero(ctx->mm)) { + ret = mfill_atomic_sigbus(ctx->mm, uffdio_sigbus.range.start, + uffdio_sigbus.range.len, + &ctx->mmap_changing, 0); + mmput(ctx->mm); + } else { + return -ESRCH; + } + + if (unlikely(put_user(ret, &user_uffdio_sigbus->updated))) + return -EFAULT; + if (ret < 0) + goto out; + + /* len == 0 would wake all */ + BUG_ON(!ret); + range.len = ret; + if (!(uffdio_sigbus.mode & UFFDIO_SIGBUS_MODE_DONTWAKE)) { + range.start = uffdio_sigbus.range.start; + wake_userfault(ctx, &range); + } + ret = range.len == uffdio_sigbus.range.len ? 0 : -EAGAIN; + +out: + return ret; +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -2067,6 +2127,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd, case UFFDIO_CONTINUE: ret = userfaultfd_continue(ctx, arg); break; + case UFFDIO_SIGBUS: + ret = userfaultfd_sigbus(ctx, arg); + break; } return ret; } diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 3a451b7afcb3..fa778a0ae730 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -405,7 +405,8 @@ typedef unsigned long pte_marker;
#define PTE_MARKER_UFFD_WP BIT(0) #define PTE_MARKER_SWAPIN_ERROR BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +#define PTE_MARKER_UFFD_SIGBUS BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1)
static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index d78b01524349..6de1084939c5 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -46,6 +46,7 @@ enum mfill_atomic_mode { MFILL_ATOMIC_COPY, MFILL_ATOMIC_ZEROPAGE, MFILL_ATOMIC_CONTINUE, + MFILL_ATOMIC_SIGBUS, NR_MFILL_ATOMIC_MODES, };
@@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, atomic_t *mmap_changing, uffd_flags_t flags); +extern ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags); extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 66dd4cd277bd..616e33d3db97 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -39,7 +39,8 @@ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ - UFFD_FEATURE_WP_UNPOPULATED) + UFFD_FEATURE_WP_UNPOPULATED | \ + UFFD_FEATURE_SIGBUS_IOCTL) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -49,12 +50,14 @@ (__u64)1 << _UFFDIO_COPY | \ (__u64)1 << _UFFDIO_ZEROPAGE | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \ - (__u64)1 << _UFFDIO_CONTINUE) + (__u64)1 << _UFFDIO_CONTINUE | \ + (__u64)1 << _UFFDIO_SIGBUS) #define UFFD_API_RANGE_IOCTLS_BASIC \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY | \ + (__u64)1 << _UFFDIO_WRITEPROTECT | \ (__u64)1 << _UFFDIO_CONTINUE | \ - (__u64)1 << _UFFDIO_WRITEPROTECT) + (__u64)1 << _UFFDIO_SIGBUS)
/* * Valid ioctl command number range with this API is from 0x00 to @@ -71,6 +74,7 @@ #define _UFFDIO_ZEROPAGE (0x04) #define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_CONTINUE (0x07) +#define _UFFDIO_SIGBUS (0x08) #define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */ @@ -91,6 +95,8 @@ struct uffdio_writeprotect) #define UFFDIO_CONTINUE _IOWR(UFFDIO, _UFFDIO_CONTINUE, \ struct uffdio_continue) +#define UFFDIO_SIGBUS _IOWR(UFFDIO, _UFFDIO_SIGBUS, \ + struct uffdio_sigbus)
/* read() structure */ struct uffd_msg { @@ -225,6 +231,7 @@ struct uffdio_api { #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) +#define UFFD_FEATURE_SIGBUS_IOCTL (1<<14) __u64 features;
__u64 ioctls; @@ -321,6 +328,18 @@ struct uffdio_continue { __s64 mapped; };
+struct uffdio_sigbus { + struct uffdio_range range; +#define UFFDIO_SIGBUS_MODE_DONTWAKE ((__u64)1<<0) + __u64 mode; + + /* + * Fields below here are written by the ioctl and must be at the end: + * the copy_from_user will not read past here. + */ + __s64 updated; +}; + /* * Flags for the userfaultfd(2) system call itself. */ diff --git a/mm/memory.c b/mm/memory.c index f69fbc251198..e4b4207c2590 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3675,6 +3675,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (WARN_ON_ONCE(!marker)) return VM_FAULT_SIGBUS;
+ /* SIGBUS explicitly requested for this PTE. */ + if (marker & PTE_MARKER_UFFD_SIGBUS) + return VM_FAULT_SIGBUS; + /* Higher priority than uffd-wp when data corrupted */ if (marker & PTE_MARKER_SWAPIN_ERROR) return VM_FAULT_SIGBUS; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e97a0b4889fc..933587eebd5d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -278,6 +278,51 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, goto out; }
+/* Handles UFFDIO_SIGBUS for all non-hugetlb VMAs. */ +static int mfill_atomic_pte_sigbus(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + uffd_flags_t flags) +{ + int ret; + struct mm_struct *dst_mm = dst_vma->vm_mm; + pte_t _dst_pte, *dst_pte; + spinlock_t *ptl; + + _dst_pte = make_pte_marker(PTE_MARKER_UFFD_SIGBUS); + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + + if (vma_is_shmem(dst_vma)) { + struct inode *inode; + pgoff_t offset, max_off; + + /* serialize against truncate with the page table lock */ + inode = dst_vma->vm_file->f_inode; + offset = linear_page_index(dst_vma, dst_addr); + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + ret = -EFAULT; + if (unlikely(offset >= max_off)) + goto out_unlock; + } + + ret = -EEXIST; + /* + * For now, we don't handle unmapping pages, so only support filling in + * none PTEs, or replacing PTE markers. + */ + if (!pte_none_mostly(*dst_pte)) + goto out_unlock; + + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(dst_vma, dst_addr, dst_pte); + ret = 0; +out_unlock: + pte_unmap_unlock(dst_pte, ptl); + return ret; +} + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -328,8 +373,12 @@ static __always_inline ssize_t mfill_atomic_hugetlb( * supported by hugetlb. A PMD_SIZE huge pages may exist as used * by THP. Since we can not reliably insert a zero page, this * feature is not supported. + * + * PTE marker handling for hugetlb is a bit special, so for now + * UFFDIO_SIGBUS is not supported. */ - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) || + uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) { mmap_read_unlock(dst_mm); return -EINVAL; } @@ -473,6 +522,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { return mfill_atomic_pte_continue(dst_pmd, dst_vma, dst_addr, flags); + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) { + return mfill_atomic_pte_sigbus(dst_pmd, dst_vma, + dst_addr, flags); }
/* @@ -694,6 +746,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); }
+ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags) +{ + return mfill_atomic(dst_mm, start, 0, len, mmap_changing, + uffd_flags_set_mode(flags, MFILL_ATOMIC_SIGBUS)); +} + long uffd_wp_range(struct vm_area_struct *dst_vma, unsigned long start, unsigned long len, bool enable_wp) {
Previously, we had "one fault handler to rule them all", which used several branches to deal with all of the scenarios required by all of the various tests.
In upcoming patches, I plan to add a new test, which has its own slightly different fault handling logic. Instead of continuing to add cruft to the existing fault handler, let's allow tests to define custom ones, separate from other tests.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/mm/uffd-common.c | 5 ++++- tools/testing/selftests/mm/uffd-common.h | 3 +++ tools/testing/selftests/mm/uffd-stress.c | 6 ++---- 3 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/mm/uffd-common.c b/tools/testing/selftests/mm/uffd-common.c index 61c6250adf93..c9756dbffe7d 100644 --- a/tools/testing/selftests/mm/uffd-common.c +++ b/tools/testing/selftests/mm/uffd-common.c @@ -499,6 +499,9 @@ void *uffd_poll_thread(void *arg) int ret; char tmp_chr;
+ if (!args->handle_fault) + args->handle_fault = uffd_handle_page_fault; + pollfd[0].fd = uffd; pollfd[0].events = POLLIN; pollfd[1].fd = pipefd[cpu*2]; @@ -527,7 +530,7 @@ void *uffd_poll_thread(void *arg) err("unexpected msg event %u\n", msg.event); break; case UFFD_EVENT_PAGEFAULT: - uffd_handle_page_fault(&msg, args); + args->handle_fault(&msg, args); break; case UFFD_EVENT_FORK: close(uffd); diff --git a/tools/testing/selftests/mm/uffd-common.h b/tools/testing/selftests/mm/uffd-common.h index 6068f2346b86..b28d88b9937e 100644 --- a/tools/testing/selftests/mm/uffd-common.h +++ b/tools/testing/selftests/mm/uffd-common.h @@ -77,6 +77,9 @@ struct uffd_args { unsigned long missing_faults; unsigned long wp_faults; unsigned long minor_faults; + + /* A custom fault handler; defaults to uffd_handle_page_fault. */ + void (*handle_fault)(struct uffd_msg *msg, struct uffd_args *args); };
struct uffd_test_ops { diff --git a/tools/testing/selftests/mm/uffd-stress.c b/tools/testing/selftests/mm/uffd-stress.c index f1ad9eef1c3a..47e1464935a8 100644 --- a/tools/testing/selftests/mm/uffd-stress.c +++ b/tools/testing/selftests/mm/uffd-stress.c @@ -199,10 +199,8 @@ static int stress(struct uffd_args *args) locking_thread, (void *)cpu)) return 1; if (bounces & BOUNCE_POLL) { - if (pthread_create(&uffd_threads[cpu], &attr, - uffd_poll_thread, - (void *)&args[cpu])) - return 1; + if (pthread_create(&uffd_threads[cpu], &attr, uffd_poll_thread, &args[cpu])) + err("uffd_poll_thread create"); } else { if (pthread_create(&uffd_threads[cpu], &attr, uffd_read_thread,
The test is pretty basic, and exercises UFFDIO_SIGBUS straightforwardly. We register a region with userfaultfd, in missing fault mode. For each fault, we either issue UFFDIO_ZEROPAGE (odd pages) or UFFDIO_SIGBUS (even pages). We read each page in the region, and assert that the odd pages are zeroed as expected, and the even pages yield a SIGBUS as expected.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/mm/uffd-unit-tests.c | 114 ++++++++++++++++++- 1 file changed, 110 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c index 269c86768a02..3eb5a6f9b51f 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -881,13 +881,13 @@ static void retry_uffdio_zeropage(int ufd, } }
-static bool do_uffdio_zeropage(int ufd, bool has_zeropage) +static bool do_uffdio_zeropage(int ufd, bool has_zeropage, bool test_retry, unsigned long offset) { struct uffdio_zeropage uffdio_zeropage = { 0 }; int ret; __s64 res;
- uffdio_zeropage.range.start = (unsigned long) area_dst; + uffdio_zeropage.range.start = (unsigned long) area_dst + offset; uffdio_zeropage.range.len = page_size; uffdio_zeropage.mode = 0; ret = ioctl(ufd, UFFDIO_ZEROPAGE, &uffdio_zeropage); @@ -901,7 +901,7 @@ static bool do_uffdio_zeropage(int ufd, bool has_zeropage) } else if (has_zeropage) { if (res != page_size) err("UFFDIO_ZEROPAGE unexpected size"); - else + else if (test_retry) retry_uffdio_zeropage(ufd, &uffdio_zeropage); return true; } else @@ -938,7 +938,7 @@ static void uffd_zeropage_test(uffd_test_args_t *args) /* Ignore the retval; we already have it */ uffd_register_detect_zeropage(uffd, area_dst_alias, page_size);
- if (do_uffdio_zeropage(uffd, has_zeropage)) + if (do_uffdio_zeropage(uffd, has_zeropage, true, 0)) for (i = 0; i < page_size; i++) if (area_dst[i] != 0) err("data non-zero at offset %d\n", i); @@ -952,6 +952,106 @@ static void uffd_zeropage_test(uffd_test_args_t *args) uffd_test_pass(); }
+static void do_uffdio_sigbus(int uffd, unsigned long offset) +{ + struct uffdio_sigbus uffdio_sigbus = { 0 }; + int ret; + __s64 res; + + uffdio_sigbus.range.start = (unsigned long) area_dst + offset; + uffdio_sigbus.range.len = page_size; + uffdio_sigbus.mode = 0; + ret = ioctl(uffd, UFFDIO_SIGBUS, &uffdio_sigbus); + res = uffdio_sigbus.updated; + + if (ret) + err("UFFDIO_SIGBUS error: %"PRId64, (int64_t)res); + else if (res != page_size) + err("UFFDIO_SIGBUS unexpected size: %"PRId64, (int64_t)res); +} + +static void uffd_sigbus_ioctl_handle_fault( + struct uffd_msg *msg, struct uffd_args *args) +{ + unsigned long offset; + + if (msg->event != UFFD_EVENT_PAGEFAULT) + err("unexpected msg event %u", msg->event); + + if (msg->arg.pagefault.flags & + (UFFD_PAGEFAULT_FLAG_WP | UFFD_PAGEFAULT_FLAG_MINOR)) + err("unexpected fault type %llu", msg->arg.pagefault.flags); + + offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst; + offset &= ~(page_size-1); + + /* Odd pages -> zeropage; even pages -> sigbus. */ + if (offset & page_size) { + if (!do_uffdio_zeropage(uffd, true, false, offset)) + err("UFFDIO_ZEROPAGE failed"); + } else { + do_uffdio_sigbus(uffd, offset); + } +} + +static void uffd_sigbus_ioctl_test(uffd_test_args_t *targs) +{ + pthread_t uffd_mon; + char c; + struct uffd_args args = { 0 }; + struct sigaction act = { 0 }; + unsigned long nr_sigbus = 0; + unsigned long nr; + + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); + + if (!uffd_register_detect_zeropage(uffd, area_dst, nr_pages * page_size)) + err("register failed: no zeropage support"); + + args.handle_fault = uffd_sigbus_ioctl_handle_fault; + if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &args)) + err("uffd_poll_thread create"); + + sigbuf = &jbuf; + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, &act, 0)) + err("sigaction"); + + for (nr = 0; nr < nr_pages; ++nr) { + unsigned long offset = nr * page_size; + const char *bytes = (const char *) area_dst + offset; + const char *i; + + if (sigsetjmp(*sigbuf, 1)) { + /* + * Access below triggered a SIGBUS, which was caught by + * sighndl, which then jumped here. Count this SIGBUS, + * and move on to next page. + */ + ++nr_sigbus; + continue; + } + + for (i = bytes; i < bytes + page_size; ++i) { + if (*i) + err("nonzero byte in area_dst (%p) at %p: %u", + area_dst, i, *i); + } + } + + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) + err("pipe write"); + if (pthread_join(uffd_mon, NULL)) + err("pthread_join()"); + + if (nr_sigbus != nr_pages / 2) + err("expected to receive %lu SIGBUS, actually received %lu", + nr_pages / 2, nr_sigbus); + + uffd_test_pass(); +} + /* * Test the returned uffdio_register.ioctls with different register modes. * Note that _UFFDIO_ZEROPAGE is tested separately in the zeropage test. @@ -1127,6 +1227,12 @@ uffd_test_case_t uffd_tests[] = { UFFD_FEATURE_PAGEFAULT_FLAG_WP | UFFD_FEATURE_WP_HUGETLBFS_SHMEM, }, + { + .name = "sigbus-ioctl", + .uffd_fn = uffd_sigbus_ioctl_test, + .mem_targets = MEM_ALL & ~(MEM_HUGETLB | MEM_HUGETLB_PRIVATE), + .uffd_feature_required = UFFD_FEATURE_SIGBUS_IOCTL, + }, };
static void usage(const char *prog)
On 05/11/23 11:24, Axel Rasmussen wrote:
The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not.
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
Just curious, what is this communication channel with the old host?
On Thu, May 11, 2023 at 1:29 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 05/11/23 11:24, Axel Rasmussen wrote:
The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not.
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
Just curious, what is this communication channel with the old host?
James can probably describe it in more detail / more correctly than I can. My (possibly wrong :) ) understanding is:
On the source machine we maintain a bitmap indicating which pages are clean or dirty (meaning, modified after the initial "precopy" of memory to the target machine) or poisoned. Eventually the entire bitmap is sent to the target machine, but this takes some time (maybe seconds on large machines). After this point though we have all the information we need, we no longer need to communicate with the source to find out the status of pages (although there may still be some memory contents to finish copying over).
In the meantime, I think the target machine can also ask the source machine about the status of individual pages (for quick on-demand paging).
As for the underlying mechanism, it's an internal protocol but the publicly-available thing it's most similar to is probably gRPC [1]. At a really basic level, we send binary serialized protocol buffers [2] over the network in a request / response fashion.
[1] https://grpc.io/ [2] https://protobuf.dev/
-- Mike Kravetz
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed.
On Thu, May 11, 2023 at 1:40 PM Axel Rasmussen axelrasmussen@google.com wrote:
On Thu, May 11, 2023 at 1:29 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 05/11/23 11:24, Axel Rasmussen wrote:
Apologies for the noise, I should have CC'ed +Jiaqi on this series too, since he is working on other parts of the memory poisoning / recovery stuff internally.
The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not.
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
Just curious, what is this communication channel with the old host?
James can probably describe it in more detail / more correctly than I can. My (possibly wrong :) ) understanding is:
On the source machine we maintain a bitmap indicating which pages are clean or dirty (meaning, modified after the initial "precopy" of memory to the target machine) or poisoned. Eventually the entire bitmap is sent to the target machine, but this takes some time (maybe seconds on large machines). After this point though we have all the information we need, we no longer need to communicate with the source to find out the status of pages (although there may still be some memory contents to finish copying over).
In the meantime, I think the target machine can also ask the source machine about the status of individual pages (for quick on-demand paging).
As for the underlying mechanism, it's an internal protocol but the publicly-available thing it's most similar to is probably gRPC [1]. At a really basic level, we send binary serialized protocol buffers [2] over the network in a request / response fashion.
[1] https://grpc.io/ [2] https://protobuf.dev/
-- Mike Kravetz
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed.
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen axelrasmussen@google.com wrote:
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
I want to clarify the SIGBUS mechanism here when KVM is involved, keeping in mind that we need to be able to inject an MCE into the guest for this to be useful.
1. vCPU gets an EPT violation --> KVM attempts GUP. 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS. 3. KVM finds that GUP failed and returns -EFAULT.
This is different than if GUP found poison, in which case KVM will actually queue up a SIGBUS *containing the address of the fault*, and userspace can use it to inject an appropriate MCE into the guest. With UFFDIO_SIGBUS, we are missing the address!
I see three options: 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think this is pointless. 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated accesses, just like how we get repeated signals for real poison. 3. Use this in conjunction with the additional KVM EFAULT info that Anish proposed (the first part of [1]).
I think option 3 is fine. :)
[1]: https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/
- James
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen axelrasmussen@google.com wrote:
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
I want to clarify the SIGBUS mechanism here when KVM is involved, keeping in mind that we need to be able to inject an MCE into the guest for this to be useful.
- vCPU gets an EPT violation --> KVM attempts GUP.
- GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
- KVM finds that GUP failed and returns -EFAULT.
This is different than if GUP found poison, in which case KVM will actually queue up a SIGBUS *containing the address of the fault*, and userspace can use it to inject an appropriate MCE into the guest. With UFFDIO_SIGBUS, we are missing the address!
I see three options:
- Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
this is pointless. 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated accesses, just like how we get repeated signals for real poison. 3. Use this in conjunction with the additional KVM EFAULT info that Anish proposed (the first part of [1]).
I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
Besides what James mentioned on "missing addr", I didn't quickly see what's the major difference comparing to the old hwpoison injection methods even without the addr requirement. If we want the addr for MCE then it's more of a question to ask.
I also didn't quickly see why for whatever new way to inject a pte error we need to have it registered with uffd. Could it be something like MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even without an userfault context (but still usable when uffd registered)?
And it'll be alawys nice to have a cover letter too (if there'll be a new version) explaining the bits.
Thanks,
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen axelrasmussen@google.com wrote:
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
I want to clarify the SIGBUS mechanism here when KVM is involved, keeping in mind that we need to be able to inject an MCE into the guest for this to be useful.
- vCPU gets an EPT violation --> KVM attempts GUP.
- GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
- KVM finds that GUP failed and returns -EFAULT.
This is different than if GUP found poison, in which case KVM will actually queue up a SIGBUS *containing the address of the fault*, and userspace can use it to inject an appropriate MCE into the guest. With UFFDIO_SIGBUS, we are missing the address!
I see three options:
- Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
this is pointless. 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated accesses, just like how we get repeated signals for real poison. 3. Use this in conjunction with the additional KVM EFAULT info that Anish proposed (the first part of [1]).
I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case this is why option 4) was ruled out:
They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
Just to supplement on this point: we do have unpoison (echoing to "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
Besides what James mentioned on "missing addr", I didn't quickly see what's the major difference comparing to the old hwpoison injection methods even without the addr requirement. If we want the addr for MCE then it's more of a question to ask.
I also didn't quickly see why for whatever new way to inject a pte error we need to have it registered with uffd. Could it be something like MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even without an userfault context (but still usable when uffd registered)?
And it'll be alawys nice to have a cover letter too (if there'll be a new version) explaining the bits.
Thanks,
-- Peter Xu
On Wed, May 17, 2023 at 3:20 PM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen axelrasmussen@google.com wrote:
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
I want to clarify the SIGBUS mechanism here when KVM is involved, keeping in mind that we need to be able to inject an MCE into the guest for this to be useful.
- vCPU gets an EPT violation --> KVM attempts GUP.
- GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
- KVM finds that GUP failed and returns -EFAULT.
This is different than if GUP found poison, in which case KVM will actually queue up a SIGBUS *containing the address of the fault*, and userspace can use it to inject an appropriate MCE into the guest. With UFFDIO_SIGBUS, we are missing the address!
I see three options:
- Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
this is pointless. 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated accesses, just like how we get repeated signals for real poison. 3. Use this in conjunction with the additional KVM EFAULT info that Anish proposed (the first part of [1]).
I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case this is why option 4) was ruled out:
They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
Just to supplement on this point: we do have unpoison (echoing to "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
Besides what James mentioned on "missing addr", I didn't quickly see what's the major difference comparing to the old hwpoison injection methods even without the addr requirement. If we want the addr for MCE then it's more of a question to ask.
I also didn't quickly see why for whatever new way to inject a pte error we need to have it registered with uffd. Could it be something like MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even without an userfault context (but still usable when uffd registered)?
And it'll be alawys nice to have a cover letter too (if there'll be a new version) explaining the bits.
I do plan a v2, if for no other reason than to update the documentation. Happy to add a cover letter with it as well.
+Jiaqi back to CC, this is one piece of a larger memory poisoning / recovery design Jiaqi is working on, so he may have some ideas why MADV_HWPOISON or MADV_PGER will or won't work.
One idea is, at least for our use case, we have to have the range be userfaultfd registered, because we need to intercept the first access and check at that point whether or not it should be poisoned. But, I think in principle a scheme like this could work:
1. Intercept first access with UFFD 2. Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the poisoned page in place 3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and SIGBUS
It's arguably slightly weird, since normally UFFD events are resolved with UFFDIO_* operations, but I don't see why it *couldn't* work.
Then again I am not super familiar with MADV_HWPOISON, I will have to do a bit of reading to understand if its semantics are the same (future accesses to this address get SIGBUS).
Thanks,
-- Peter Xu
-- Peter Xu
Hi, Axel,
On Wed, May 17, 2023 at 03:28:36PM -0700, Axel Rasmussen wrote:
I do plan a v2, if for no other reason than to update the documentation. Happy to add a cover letter with it as well.
+Jiaqi back to CC, this is one piece of a larger memory poisoning / recovery design Jiaqi is working on, so he may have some ideas why MADV_HWPOISON or MADV_PGER will or won't work.
One idea is, at least for our use case, we have to have the range be userfaultfd registered, because we need to intercept the first access and check at that point whether or not it should be poisoned. But, I think in principle a scheme like this could work:
- Intercept first access with UFFD
- Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the
poisoned page in place 3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and SIGBUS
It's arguably slightly weird, since normally UFFD events are resolved with UFFDIO_* operations, but I don't see why it *couldn't* work.
Then again I am not super familiar with MADV_HWPOISON, I will have to do a bit of reading to understand if its semantics are the same (future accesses to this address get SIGBUS).
Yes, it'll be great if this can be checked up before sending v2. What you said match exactly what I was in mind. I hope it will already work, or we can always discuss what is missing.
On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen axelrasmussen@google.com wrote:
On Wed, May 17, 2023 at 3:20 PM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen axelrasmussen@google.com wrote:
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
I want to clarify the SIGBUS mechanism here when KVM is involved, keeping in mind that we need to be able to inject an MCE into the guest for this to be useful.
- vCPU gets an EPT violation --> KVM attempts GUP.
- GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
- KVM finds that GUP failed and returns -EFAULT.
This is different than if GUP found poison, in which case KVM will actually queue up a SIGBUS *containing the address of the fault*, and userspace can use it to inject an appropriate MCE into the guest. With UFFDIO_SIGBUS, we are missing the address!
I see three options:
- Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
this is pointless. 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated accesses, just like how we get repeated signals for real poison. 3. Use this in conjunction with the additional KVM EFAULT info that Anish proposed (the first part of [1]).
I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case this is why option 4) was ruled out:
They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
Just to supplement on this point: we do have unpoison (echoing to "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
If I read unpoison_memory() correctly, once there is a real hardware memory corruption (hw_memory_failure will be set), unpoison will stop working and return EOPNOTSUPP.
I know some cloud providers evacuating VMs once a single memory error happens, so not supporting unpoison is probably not a big deal for them. BUT others do keep VM running until more errors show up later, which could be long after the 1st error.
Besides what James mentioned on "missing addr", I didn't quickly see what's the major difference comparing to the old hwpoison injection methods even without the addr requirement. If we want the addr for MCE then it's more of a question to ask.
I also didn't quickly see why for whatever new way to inject a pte error we need to have it registered with uffd. Could it be something like MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even without an userfault context (but still usable when uffd registered)?
And it'll be alawys nice to have a cover letter too (if there'll be a new version) explaining the bits.
I do plan a v2, if for no other reason than to update the documentation. Happy to add a cover letter with it as well.
+Jiaqi back to CC, this is one piece of a larger memory poisoning / recovery design Jiaqi is working on, so he may have some ideas why MADV_HWPOISON or MADV_PGER will or won't work.
Per https://man7.org/linux/man-pages/man2/madvise.2.html, MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN) processes." So for a non-root VMM, MADV_HWPOISON is out of option.
Another issue with MADV_HWPOISON is, it requires to first successfully get_user_pages_fast(). I don't think it will work if memory is not mapped yet.
With the UFFDIO_SIGBUS feature introduced in this patchset, it may even be possible to free the emulated-hwpoison page back to the kernel so we don't lose a 4K page.
I didn't find any ref/doc for MADV_PGERR. Is it something you suggest to build, Peter?
One idea is, at least for our use case, we have to have the range be userfaultfd registered, because we need to intercept the first access and check at that point whether or not it should be poisoned. But, I think in principle a scheme like this could work:
- Intercept first access with UFFD
- Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the
poisoned page in place 3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and SIGBUS
It's arguably slightly weird, since normally UFFD events are resolved with UFFDIO_* operations, but I don't see why it *couldn't* work.
Then again I am not super familiar with MADV_HWPOISON, I will have to do a bit of reading to understand if its semantics are the same (future accesses to this address get SIGBUS).
Thanks,
-- Peter Xu
-- Peter Xu
On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen axelrasmussen@google.com wrote:
On Wed, May 17, 2023 at 3:20 PM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen axelrasmussen@google.com wrote:
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
I want to clarify the SIGBUS mechanism here when KVM is involved, keeping in mind that we need to be able to inject an MCE into the guest for this to be useful.
- vCPU gets an EPT violation --> KVM attempts GUP.
- GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
- KVM finds that GUP failed and returns -EFAULT.
This is different than if GUP found poison, in which case KVM will actually queue up a SIGBUS *containing the address of the fault*, and userspace can use it to inject an appropriate MCE into the guest. With UFFDIO_SIGBUS, we are missing the address!
I see three options:
- Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
this is pointless. 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated accesses, just like how we get repeated signals for real poison. 3. Use this in conjunction with the additional KVM EFAULT info that Anish proposed (the first part of [1]).
I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case this is why option 4) was ruled out:
They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
Just to supplement on this point: we do have unpoison (echoing to "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
If I read unpoison_memory() correctly, once there is a real hardware memory corruption (hw_memory_failure will be set), unpoison will stop working and return EOPNOTSUPP.
I know some cloud providers evacuating VMs once a single memory error happens, so not supporting unpoison is probably not a big deal for them. BUT others do keep VM running until more errors show up later, which could be long after the 1st error.
We're talking about postcopy migrating a VM has poisoned page on src, rather than on dst host, am I right? IOW, the dest hwpoison should be fake.
If so, then I would assume that's the case where all the pages on the dest host is still all good (so hw_memory_failure not yet set, or I doubt the judgement of being a migration target after all)?
The other thing is even if dest host has hw poisoned page, I'm not sure whether hw_memory_failure is the only way to solve this.
I saw that this is something got worked on before from Zhenwei, David used to have some reasoning on why it was suggested like using a global knob:
https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
Two major issues here afaics:
- Zhenwei's approach only considered x86 hwpoison - it relies on kpte having !present in entries but that's x86 specific rather than generic to memory_failure.c.
- It is _assumed_ that hwpoison injection is for debugging only.
I'm not sure whether you can fix 1) by some other ways, e.g., what if the host just remember all the hardware poisoned pfns (or remember soft-poisoned ones, but then here we need to be careful on removing them from the list when it's hwpoisoned for real)? It sounds like there's opportunity on providing a generic solution rather than relying on !pte_present().
For 2) IMHO that's not a big issue, you can declare it'll be used in !debug but production systems so as to boost the feature importance with a real use case.
So far I'd say it'll be great to leverage what it's already there in linux and make it as generic as possible. The only issue is probably CAP_ADMIN... not sure whether we can have some way to provide !ADMIN somehow, or you can simply work around this issue.
Besides what James mentioned on "missing addr", I didn't quickly see what's the major difference comparing to the old hwpoison injection methods even without the addr requirement. If we want the addr for MCE then it's more of a question to ask.
I also didn't quickly see why for whatever new way to inject a pte error we need to have it registered with uffd. Could it be something like MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even without an userfault context (but still usable when uffd registered)?
And it'll be alawys nice to have a cover letter too (if there'll be a new version) explaining the bits.
I do plan a v2, if for no other reason than to update the documentation. Happy to add a cover letter with it as well.
+Jiaqi back to CC, this is one piece of a larger memory poisoning / recovery design Jiaqi is working on, so he may have some ideas why MADV_HWPOISON or MADV_PGER will or won't work.
Per https://man7.org/linux/man-pages/man2/madvise.2.html, MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN) processes." So for a non-root VMM, MADV_HWPOISON is out of option.
It makes sense to me especially when the page can be shared with other tasks.
Another issue with MADV_HWPOISON is, it requires to first successfully get_user_pages_fast(). I don't think it will work if memory is not mapped yet.
Fair point, so probably current MADV_HWPOISON got ruled out. hwpoison-inject seems fine where only the PFN is needed rather than the pte. But same issue on CAP_ADMIN indeed.
With the UFFDIO_SIGBUS feature introduced in this patchset, it may even be possible to free the emulated-hwpoison page back to the kernel so we don't lose a 4K page.
I didn't find any ref/doc for MADV_PGERR. Is it something you suggest to build, Peter?
That's something I made up just to show my question on why such an interface (even if wanted) needs to be bound to userfaultfd, e.g. a madvise() seems working if someone sololy want to install a poisoned pte.
IIUC even with an madvise one may not need CAP_ADMIN since we can limit the op to current mm only, I assume it's safe.
Here you'd want to return VM_FAULT_HWPOISON for whatever swap pte you'd like to install (in do_swap_page) with whatever new interface (assuming still a new madvise). As James mentioned, I think KVM liked that to recognize -EHWPOISON from -EFAULT. I'd say we can even consider reusing PTE_MARKER_SWAPIN_ERROR to let it just return VM_FAULT_HWPOISON directly if so.
Thanks,
On Thu, May 18, 2023 at 9:05 AM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen axelrasmussen@google.com wrote:
On Wed, May 17, 2023 at 3:20 PM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen axelrasmussen@google.com wrote: > > So the basic way to use this new feature is: > > - On the new host, the guest's memory is registered with userfaultfd, in > either MISSING or MINOR mode (doesn't really matter for this purpose). > - On any first access, we get a userfaultfd event. At this point we can > communicate with the old host to find out if the page was poisoned. > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker > so any future accesses will SIGBUS. Because the pte is now "present", > future accesses won't generate more userfaultfd events, they'll just > SIGBUS directly.
I want to clarify the SIGBUS mechanism here when KVM is involved, keeping in mind that we need to be able to inject an MCE into the guest for this to be useful.
- vCPU gets an EPT violation --> KVM attempts GUP.
- GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
- KVM finds that GUP failed and returns -EFAULT.
This is different than if GUP found poison, in which case KVM will actually queue up a SIGBUS *containing the address of the fault*, and userspace can use it to inject an appropriate MCE into the guest. With UFFDIO_SIGBUS, we are missing the address!
I see three options:
- Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
this is pointless. 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated accesses, just like how we get repeated signals for real poison. 3. Use this in conjunction with the additional KVM EFAULT info that Anish proposed (the first part of [1]).
I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case this is why option 4) was ruled out:
They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
Just to supplement on this point: we do have unpoison (echoing to "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
If I read unpoison_memory() correctly, once there is a real hardware memory corruption (hw_memory_failure will be set), unpoison will stop working and return EOPNOTSUPP.
I know some cloud providers evacuating VMs once a single memory error happens, so not supporting unpoison is probably not a big deal for them. BUT others do keep VM running until more errors show up later, which could be long after the 1st error.
We're talking about postcopy migrating a VM has poisoned page on src, rather than on dst host, am I right? IOW, the dest hwpoison should be fake.
If so, then I would assume that's the case where all the pages on the dest host is still all good (so hw_memory_failure not yet set, or I doubt the judgement of being a migration target after all)?
The other thing is even if dest host has hw poisoned page, I'm not sure whether hw_memory_failure is the only way to solve this.
I saw that this is something got worked on before from Zhenwei, David used to have some reasoning on why it was suggested like using a global knob:
https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
Two major issues here afaics:
Zhenwei's approach only considered x86 hwpoison - it relies on kpte having !present in entries but that's x86 specific rather than generic to memory_failure.c.
It is _assumed_ that hwpoison injection is for debugging only.
I'm not sure whether you can fix 1) by some other ways, e.g., what if the host just remember all the hardware poisoned pfns (or remember soft-poisoned ones, but then here we need to be careful on removing them from the list when it's hwpoisoned for real)? It sounds like there's opportunity on providing a generic solution rather than relying on !pte_present().
For 2) IMHO that's not a big issue, you can declare it'll be used in !debug but production systems so as to boost the feature importance with a real use case.
So far I'd say it'll be great to leverage what it's already there in linux and make it as generic as possible. The only issue is probably CAP_ADMIN... not sure whether we can have some way to provide !ADMIN somehow, or you can simply work around this issue.
As you mention below I think the key distinction is the scope - I think MADV_HWPOISON affects the whole system, including other processes.
For our purposes, we really just want to "poison" this particular virtual address (the HVA, from the VM's perspective), not even other mappings of the same shared memory. I think that behavior is different from MADV_HWPOISON, at least.
Besides what James mentioned on "missing addr", I didn't quickly see what's the major difference comparing to the old hwpoison injection methods even without the addr requirement. If we want the addr for MCE then it's more of a question to ask.
I also didn't quickly see why for whatever new way to inject a pte error we need to have it registered with uffd. Could it be something like MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even without an userfault context (but still usable when uffd registered)?
And it'll be alawys nice to have a cover letter too (if there'll be a new version) explaining the bits.
I do plan a v2, if for no other reason than to update the documentation. Happy to add a cover letter with it as well.
+Jiaqi back to CC, this is one piece of a larger memory poisoning / recovery design Jiaqi is working on, so he may have some ideas why MADV_HWPOISON or MADV_PGER will or won't work.
Per https://man7.org/linux/man-pages/man2/madvise.2.html, MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN) processes." So for a non-root VMM, MADV_HWPOISON is out of option.
It makes sense to me especially when the page can be shared with other tasks.
Another issue with MADV_HWPOISON is, it requires to first successfully get_user_pages_fast(). I don't think it will work if memory is not mapped yet.
Fair point, so probably current MADV_HWPOISON got ruled out. hwpoison-inject seems fine where only the PFN is needed rather than the pte. But same issue on CAP_ADMIN indeed.
With the UFFDIO_SIGBUS feature introduced in this patchset, it may even be possible to free the emulated-hwpoison page back to the kernel so we don't lose a 4K page.
I didn't find any ref/doc for MADV_PGERR. Is it something you suggest to build, Peter?
That's something I made up just to show my question on why such an interface (even if wanted) needs to be bound to userfaultfd, e.g. a madvise() seems working if someone sololy want to install a poisoned pte.
I look at it a bit differently...
Even existing UFFDIO_* operations could technically be separated from userfaultfd. You could imagine a MADV_MAP_PAGE instead of UFFDIO_CONTINUE. UFFDIO_COPY is a bit trickier since it takes an argument, but it could be done with process_madvise(). (Granted, I'm not sure this would be useful... But this is equally true for UFFDIO_SIGBUS; it seems non-live-migration use cases could use MADV_HWPOISON, and for live migration use cases we will be using UFFD.)
We've sort of setup a convention with userfaultfd where at a high level users are supposed to:
1. Receive events from the uffd 2. Resolve those events with UFFDIO_* ioctls 3. Wake up with UFFDIO_WAKE to retry the fault that generated the original event (can be combined with step 2 of course)
So for me, even if MADV_PGERR or similar existed, I would be tempted to add a UFFDIO_SIGBUS as well, even if it just calls the same underlying function to do the same thing, if only for consistency (with the idea "UFFD events are resolved by UFFD ioctls") from the user's perspective.
IIUC even with an madvise one may not need CAP_ADMIN since we can limit the op to current mm only, I assume it's safe.
Here you'd want to return VM_FAULT_HWPOISON for whatever swap pte you'd like to install (in do_swap_page) with whatever new interface (assuming still a new madvise). As James mentioned, I think KVM liked that to recognize -EHWPOISON from -EFAULT. I'd say we can even consider reusing PTE_MARKER_SWAPIN_ERROR to let it just return VM_FAULT_HWPOISON directly if so.
Thanks,
-- Peter Xu
On Thu, May 18, 2023 at 01:38:09PM -0700, Axel Rasmussen wrote:
On Thu, May 18, 2023 at 9:05 AM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen axelrasmussen@google.com wrote:
On Wed, May 17, 2023 at 3:20 PM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote: > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen > axelrasmussen@google.com wrote: > > > > So the basic way to use this new feature is: > > > > - On the new host, the guest's memory is registered with userfaultfd, in > > either MISSING or MINOR mode (doesn't really matter for this purpose). > > - On any first access, we get a userfaultfd event. At this point we can > > communicate with the old host to find out if the page was poisoned. > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker > > so any future accesses will SIGBUS. Because the pte is now "present", > > future accesses won't generate more userfaultfd events, they'll just > > SIGBUS directly. > > I want to clarify the SIGBUS mechanism here when KVM is involved, > keeping in mind that we need to be able to inject an MCE into the > guest for this to be useful. > > 1. vCPU gets an EPT violation --> KVM attempts GUP. > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS. > 3. KVM finds that GUP failed and returns -EFAULT. > > This is different than if GUP found poison, in which case KVM will > actually queue up a SIGBUS *containing the address of the fault*, and > userspace can use it to inject an appropriate MCE into the guest. With > UFFDIO_SIGBUS, we are missing the address! > > I see three options: > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think > this is pointless. > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated > accesses, just like how we get repeated signals for real poison. > 3. Use this in conjunction with the additional KVM EFAULT info that > Anish proposed (the first part of [1]). > > I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case this is why option 4) was ruled out:
They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
Just to supplement on this point: we do have unpoison (echoing to "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
If I read unpoison_memory() correctly, once there is a real hardware memory corruption (hw_memory_failure will be set), unpoison will stop working and return EOPNOTSUPP.
I know some cloud providers evacuating VMs once a single memory error happens, so not supporting unpoison is probably not a big deal for them. BUT others do keep VM running until more errors show up later, which could be long after the 1st error.
We're talking about postcopy migrating a VM has poisoned page on src, rather than on dst host, am I right? IOW, the dest hwpoison should be fake.
If so, then I would assume that's the case where all the pages on the dest host is still all good (so hw_memory_failure not yet set, or I doubt the judgement of being a migration target after all)?
The other thing is even if dest host has hw poisoned page, I'm not sure whether hw_memory_failure is the only way to solve this.
I saw that this is something got worked on before from Zhenwei, David used to have some reasoning on why it was suggested like using a global knob:
https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
Two major issues here afaics:
Zhenwei's approach only considered x86 hwpoison - it relies on kpte having !present in entries but that's x86 specific rather than generic to memory_failure.c.
It is _assumed_ that hwpoison injection is for debugging only.
I'm not sure whether you can fix 1) by some other ways, e.g., what if the host just remember all the hardware poisoned pfns (or remember soft-poisoned ones, but then here we need to be careful on removing them from the list when it's hwpoisoned for real)? It sounds like there's opportunity on providing a generic solution rather than relying on !pte_present().
For 2) IMHO that's not a big issue, you can declare it'll be used in !debug but production systems so as to boost the feature importance with a real use case.
So far I'd say it'll be great to leverage what it's already there in linux and make it as generic as possible. The only issue is probably CAP_ADMIN... not sure whether we can have some way to provide !ADMIN somehow, or you can simply work around this issue.
As you mention below I think the key distinction is the scope - I think MADV_HWPOISON affects the whole system, including other processes.
For our purposes, we really just want to "poison" this particular virtual address (the HVA, from the VM's perspective), not even other mappings of the same shared memory. I think that behavior is different from MADV_HWPOISON, at least.
Besides what James mentioned on "missing addr", I didn't quickly see what's the major difference comparing to the old hwpoison injection methods even without the addr requirement. If we want the addr for MCE then it's more of a question to ask.
I also didn't quickly see why for whatever new way to inject a pte error we need to have it registered with uffd. Could it be something like MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even without an userfault context (but still usable when uffd registered)?
And it'll be alawys nice to have a cover letter too (if there'll be a new version) explaining the bits.
I do plan a v2, if for no other reason than to update the documentation. Happy to add a cover letter with it as well.
+Jiaqi back to CC, this is one piece of a larger memory poisoning / recovery design Jiaqi is working on, so he may have some ideas why MADV_HWPOISON or MADV_PGER will or won't work.
Per https://man7.org/linux/man-pages/man2/madvise.2.html, MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN) processes." So for a non-root VMM, MADV_HWPOISON is out of option.
It makes sense to me especially when the page can be shared with other tasks.
Another issue with MADV_HWPOISON is, it requires to first successfully get_user_pages_fast(). I don't think it will work if memory is not mapped yet.
Fair point, so probably current MADV_HWPOISON got ruled out. hwpoison-inject seems fine where only the PFN is needed rather than the pte. But same issue on CAP_ADMIN indeed.
With the UFFDIO_SIGBUS feature introduced in this patchset, it may even be possible to free the emulated-hwpoison page back to the kernel so we don't lose a 4K page.
I didn't find any ref/doc for MADV_PGERR. Is it something you suggest to build, Peter?
That's something I made up just to show my question on why such an interface (even if wanted) needs to be bound to userfaultfd, e.g. a madvise() seems working if someone sololy want to install a poisoned pte.
I look at it a bit differently...
Even existing UFFDIO_* operations could technically be separated from userfaultfd. You could imagine a MADV_MAP_PAGE instead of UFFDIO_CONTINUE. UFFDIO_COPY is a bit trickier since it takes an argument, but it could be done with process_madvise(). (Granted, I'm not sure this would be useful... But this is equally true for UFFDIO_SIGBUS; it seems non-live-migration use cases could use MADV_HWPOISON, and for live migration use cases we will be using UFFD.)
We've sort of setup a convention with userfaultfd where at a high level users are supposed to:
- Receive events from the uffd
- Resolve those events with UFFDIO_* ioctls
- Wake up with UFFDIO_WAKE to retry the fault that generated the
original event (can be combined with step 2 of course)
So for me, even if MADV_PGERR or similar existed, I would be tempted to add a UFFDIO_SIGBUS as well, even if it just calls the same underlying function to do the same thing, if only for consistency (with the idea "UFFD events are resolved by UFFD ioctls") from the user's perspective.
I don't worry too much on "consistency", but I'm trying to understand whether it's more beneficial to combine it with uffd or being generic.
One thing I was thinking is if I have a library that manages some memory for the user, the library can use such madvise()/... to poison specific small pages (without registering uffd with sigbus mode, also no lose on page faults of other normal pages) so when illegal access it can trap it for current mm rather than silently happen (e.g. use after free). Unpoison is also easy there, we can simply DONTNEED it.
One defect of such general solution for your case is we need one more UFFDIO_WAKE, but since we're talking about real poisoned pages on src, so I guess it's not a concern (unlike most of the rest ioctls).
I've no strong opinion if you still want to do that with an userfault ioctl. After all, I can't provide a solid example but just some rough ideas. But I hope I explained why I think it's still different from other ioctls (e.g., an "atomic update a page" operation doesn't sound reasonable at all as generic operation for any !uffd context, so that definitely suites more as an uffd specific ioctl).
If with uffd, perhaps avoid calling it sigbus? As we have FEATURE_SIGBUS and I'm afraid it'll cause confusion. UFFDIO_HWPOISON may sound more suitable?
Thanks,
On 18.05.23 22:38, Axel Rasmussen wrote:
On Thu, May 18, 2023 at 9:05 AM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen axelrasmussen@google.com wrote:
On Wed, May 17, 2023 at 3:20 PM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote: > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen > axelrasmussen@google.com wrote: >> >> So the basic way to use this new feature is: >> >> - On the new host, the guest's memory is registered with userfaultfd, in >> either MISSING or MINOR mode (doesn't really matter for this purpose). >> - On any first access, we get a userfaultfd event. At this point we can >> communicate with the old host to find out if the page was poisoned. >> - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker >> so any future accesses will SIGBUS. Because the pte is now "present", >> future accesses won't generate more userfaultfd events, they'll just >> SIGBUS directly. > > I want to clarify the SIGBUS mechanism here when KVM is involved, > keeping in mind that we need to be able to inject an MCE into the > guest for this to be useful. > > 1. vCPU gets an EPT violation --> KVM attempts GUP. > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS. > 3. KVM finds that GUP failed and returns -EFAULT. > > This is different than if GUP found poison, in which case KVM will > actually queue up a SIGBUS *containing the address of the fault*, and > userspace can use it to inject an appropriate MCE into the guest. With > UFFDIO_SIGBUS, we are missing the address! > > I see three options: > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think > this is pointless. > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated > accesses, just like how we get repeated signals for real poison. > 3. Use this in conjunction with the additional KVM EFAULT info that > Anish proposed (the first part of [1]). > > I think option 3 is fine. :)
Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case this is why option 4) was ruled out:
They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
Just to supplement on this point: we do have unpoison (echoing to "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
If I read unpoison_memory() correctly, once there is a real hardware memory corruption (hw_memory_failure will be set), unpoison will stop working and return EOPNOTSUPP.
I know some cloud providers evacuating VMs once a single memory error happens, so not supporting unpoison is probably not a big deal for them. BUT others do keep VM running until more errors show up later, which could be long after the 1st error.
We're talking about postcopy migrating a VM has poisoned page on src, rather than on dst host, am I right? IOW, the dest hwpoison should be fake.
If so, then I would assume that's the case where all the pages on the dest host is still all good (so hw_memory_failure not yet set, or I doubt the judgement of being a migration target after all)?
The other thing is even if dest host has hw poisoned page, I'm not sure whether hw_memory_failure is the only way to solve this.
I saw that this is something got worked on before from Zhenwei, David used to have some reasoning on why it was suggested like using a global knob:
https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
Two major issues here afaics:
Zhenwei's approach only considered x86 hwpoison - it relies on kpte having !present in entries but that's x86 specific rather than generic to memory_failure.c.
It is _assumed_ that hwpoison injection is for debugging only.
I'm not sure whether you can fix 1) by some other ways, e.g., what if the host just remember all the hardware poisoned pfns (or remember soft-poisoned ones, but then here we need to be careful on removing them from the list when it's hwpoisoned for real)? It sounds like there's opportunity on providing a generic solution rather than relying on !pte_present().
For 2) IMHO that's not a big issue, you can declare it'll be used in !debug but production systems so as to boost the feature importance with a real use case.
So far I'd say it'll be great to leverage what it's already there in linux and make it as generic as possible. The only issue is probably CAP_ADMIN... not sure whether we can have some way to provide !ADMIN somehow, or you can simply work around this issue.
As you mention below I think the key distinction is the scope - I think MADV_HWPOISON affects the whole system, including other processes.
For our purposes, we really just want to "poison" this particular virtual address (the HVA, from the VM's perspective), not even other mappings of the same shared memory. I think that behavior is different from MADV_HWPOISON, at least.
MADV_HWPOISON really is the wrong interface to use. See "man madvise".
We don't want to allow arbitrary users to hwpoison+offline absolutely healthy physical memory, which is what MADV_HWPOISON is all about.
As you say, we want to turn an unpopulated (!present) virtual address to mimic like we had a MCE on a page that would have been previously mapped here: install a hwpoison marker without actually poisoning any present page. In fact, we'd even want to fail if there *is* something mapped.
Sure, one could teach MADV_HWPOISON to allow unprivileged users to do that for !present PTE entries, and fail for unprivileged users if there is a present PTE entry. I'm not sure if that's the cleanest approach, though, and a new MADV as suggested in this thread would eventually be cleaner.
On Fri, May 19, 2023 at 1:38 AM David Hildenbrand david@redhat.com wrote:
On 18.05.23 22:38, Axel Rasmussen wrote:
On Thu, May 18, 2023 at 9:05 AM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen axelrasmussen@google.com wrote:
On Wed, May 17, 2023 at 3:20 PM Peter Xu peterx@redhat.com wrote:
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote: > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote: >> On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen >> axelrasmussen@google.com wrote: >>> >>> So the basic way to use this new feature is: >>> >>> - On the new host, the guest's memory is registered with userfaultfd, in >>> either MISSING or MINOR mode (doesn't really matter for this purpose). >>> - On any first access, we get a userfaultfd event. At this point we can >>> communicate with the old host to find out if the page was poisoned. >>> - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker >>> so any future accesses will SIGBUS. Because the pte is now "present", >>> future accesses won't generate more userfaultfd events, they'll just >>> SIGBUS directly. >> >> I want to clarify the SIGBUS mechanism here when KVM is involved, >> keeping in mind that we need to be able to inject an MCE into the >> guest for this to be useful. >> >> 1. vCPU gets an EPT violation --> KVM attempts GUP. >> 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS. >> 3. KVM finds that GUP failed and returns -EFAULT. >> >> This is different than if GUP found poison, in which case KVM will >> actually queue up a SIGBUS *containing the address of the fault*, and >> userspace can use it to inject an appropriate MCE into the guest. With >> UFFDIO_SIGBUS, we are missing the address! >> >> I see three options: >> 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think >> this is pointless. >> 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a >> UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS >> instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated >> accesses, just like how we get repeated signals for real poison. >> 3. Use this in conjunction with the additional KVM EFAULT info that >> Anish proposed (the first part of [1]). >> >> I think option 3 is fine. :) > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
I just remember Axel mentioned this in the commit message, and just in case this is why option 4) was ruled out:
They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
Just to supplement on this point: we do have unpoison (echoing to "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
If I read unpoison_memory() correctly, once there is a real hardware memory corruption (hw_memory_failure will be set), unpoison will stop working and return EOPNOTSUPP.
I know some cloud providers evacuating VMs once a single memory error happens, so not supporting unpoison is probably not a big deal for them. BUT others do keep VM running until more errors show up later, which could be long after the 1st error.
We're talking about postcopy migrating a VM has poisoned page on src, rather than on dst host, am I right? IOW, the dest hwpoison should be fake.
Yes, for this we are on the same page. The scenario I want to describe is...
If so, then I would assume that's the case where all the pages on the dest host is still all good (so hw_memory_failure not yet set, or I doubt the
...target VM can get hw error anytime: before precopy (if cloud provider is not carefully monitoring the machine health), during precopy from src to target, during src blackout, during postcopy, after migration done, and keep running on host. Both MADV_HWPOISON[1] and hwpoison-inject[2] are subject to hw_memory_failure, so they just seems unreliable to me: if target is in memory error trouble before or in early phase of migration, we lose the unpoison feature in kernel.
[1] https://github.com/torvalds/linux/blob/2d1bcbc6cd703e64caf8df314e3669b4786e0... [2] https://github.com/torvalds/linux/blob/2d1bcbc6cd703e64caf8df314e3669b4786e0...
judgement of being a migration target after all)?
The other thing is even if dest host has hw poisoned page, I'm not sure whether hw_memory_failure is the only way to solve this.
I saw that this is something got worked on before from Zhenwei, David used to have some reasoning on why it was suggested like using a global knob:
https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
Two major issues here afaics:
Zhenwei's approach only considered x86 hwpoison - it relies on kpte having !present in entries but that's x86 specific rather than generic to memory_failure.c.
It is _assumed_ that hwpoison injection is for debugging only.
I'm not sure whether you can fix 1) by some other ways, e.g., what if the host just remember all the hardware poisoned pfns (or remember soft-poisoned ones, but then here we need to be careful on removing them from the list when it's hwpoisoned for real)? It sounds like there's opportunity on providing a generic solution rather than relying on !pte_present().
For 2) IMHO that's not a big issue, you can declare it'll be used in !debug but production systems so as to boost the feature importance with a real use case.
So far I'd say it'll be great to leverage what it's already there in linux and make it as generic as possible. The only issue is probably CAP_ADMIN... not sure whether we can have some way to provide !ADMIN somehow, or you can simply work around this issue.
I don't think CAP_ADMIN is something we can work around: a VMM must be a good citizen to avoid introducing any vulnerability to the host or guest.
On the other hand, "Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do." [3]. I am not familiar with the UFFD internals, but our use case seems to match what UFFD wants to provide: without affecting the whole world, give a specific userspace (without CAP_ADMIN) the ability to handle page faults (indirectly emulate a HWPOISON page (in my mind I treat it as SetHWPOISON(page) + TestHWPOISON(page) operation in kernel's PF code)). So is it fair to say what Axel provided here is "provide !ADMIN somehow"?
[3]https://docs.kernel.org/admin-guide/mm/userfaultfd.html
As you mention below I think the key distinction is the scope - I think MADV_HWPOISON affects the whole system, including other processes.
For our purposes, we really just want to "poison" this particular virtual address (the HVA, from the VM's perspective), not even other mappings of the same shared memory. I think that behavior is different from MADV_HWPOISON, at least.
MADV_HWPOISON really is the wrong interface to use. See "man madvise".
We don't want to allow arbitrary users to hwpoison+offline absolutely healthy physical memory, which is what MADV_HWPOISON is all about.
As you say, we want to turn an unpopulated (!present) virtual address to mimic like we had a MCE on a page that would have been previously mapped here: install a hwpoison marker without actually poisoning any present page. In fact, we'd even want to fail if there *is* something mapped.
Sure, one could teach MADV_HWPOISON to allow unprivileged users to do that for !present PTE entries, and fail for unprivileged users if there is a present PTE entry. I'm not sure if that's the cleanest approach, though, and a new MADV as suggested in this thread would eventually be cleaner.
-- Thanks,
David / dhildenb
Hi, Jiaqi,
On Fri, May 19, 2023 at 08:04:09AM -0700, Jiaqi Yan wrote:
I don't think CAP_ADMIN is something we can work around: a VMM must be a good citizen to avoid introducing any vulnerability to the host or guest.
On the other hand, "Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do." [3]. I am not familiar with the UFFD internals, but our use case seems to match what UFFD wants to provide: without affecting the whole world, give a specific userspace (without CAP_ADMIN) the ability to handle page faults (indirectly emulate a HWPOISON page (in my mind I treat it as SetHWPOISON(page) + TestHWPOISON(page) operation in kernel's PF code)). So is it fair to say what Axel provided here is "provide !ADMIN somehow"?
Userfault keywords on "user", IMHO. We don't strictly need userfault to resolve anything regarding CAP_ADMIN problems. MADV_DONTNEED also dosn't need CAP_ADMIN, same to any new madvise() if we want to make it useful for injecting poisoned ptes with !ADMIN and limit it within current->mm.
But I think you're right that userfaultfd always tried to avoid having ADMIN and keep everything within its own scope of permissions.
So again, totally no objection on make it uffd specific for now if you guys are all happy with it, but just to be clear that it's (to me) mostly for avoiding another WAKE, and afaics that's not really for solving the ADMIN issue here.
Thanks,
On Fri, May 19, 2023 at 9:20 AM Peter Xu peterx@redhat.com wrote:
Hi, Jiaqi,
On Fri, May 19, 2023 at 08:04:09AM -0700, Jiaqi Yan wrote:
I don't think CAP_ADMIN is something we can work around: a VMM must be a good citizen to avoid introducing any vulnerability to the host or guest.
On the other hand, "Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do." [3]. I am not familiar with the UFFD internals, but our use case seems to match what UFFD wants to provide: without affecting the whole world, give a specific userspace (without CAP_ADMIN) the ability to handle page faults (indirectly emulate a HWPOISON page (in my mind I treat it as SetHWPOISON(page) + TestHWPOISON(page) operation in kernel's PF code)). So is it fair to say what Axel provided here is "provide !ADMIN somehow"?
Userfault keywords on "user", IMHO. We don't strictly need userfault to resolve anything regarding CAP_ADMIN problems. MADV_DONTNEED also dosn't need CAP_ADMIN, same to any new madvise() if we want to make it useful for injecting poisoned ptes with !ADMIN and limit it within current->mm.
But I think you're right that userfaultfd always tried to avoid having ADMIN and keep everything within its own scope of permissions.
So again, totally no objection on make it uffd specific for now if you guys are all happy with it, but just to be clear that it's (to me) mostly for avoiding another WAKE, and afaics that's not really for solving the ADMIN issue here.
How about this plan:
Since the concrete use case we have (postcopy live migration) is UFFD-specific, let's leave it as a UFFDIO_* operation for now.
If in the future we come up with a non-UFFD use case, we can add a new MADV_* which does this operation at that point. From my perspective they could even share most of the same implementation code.
I don't think it's a big problem keeping the UFFDIO_* version too at that point, because it still provides some (perhaps small) value:
- Combines the operation + waking into one syscall - It allows us to support additional UFFD flags which modify / extend the operation in UFFD-specific ways, if we want to add those in the future
Seem reasonable?
If so, I'll send a v2 with documentation updates.
Thanks,
-- Peter Xu
On Fri, May 19, 2023 at 10:32:13AM -0700, Axel Rasmussen wrote:
On Fri, May 19, 2023 at 9:20 AM Peter Xu peterx@redhat.com wrote:
Hi, Jiaqi,
On Fri, May 19, 2023 at 08:04:09AM -0700, Jiaqi Yan wrote:
I don't think CAP_ADMIN is something we can work around: a VMM must be a good citizen to avoid introducing any vulnerability to the host or guest.
On the other hand, "Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do." [3]. I am not familiar with the UFFD internals, but our use case seems to match what UFFD wants to provide: without affecting the whole world, give a specific userspace (without CAP_ADMIN) the ability to handle page faults (indirectly emulate a HWPOISON page (in my mind I treat it as SetHWPOISON(page) + TestHWPOISON(page) operation in kernel's PF code)). So is it fair to say what Axel provided here is "provide !ADMIN somehow"?
Userfault keywords on "user", IMHO. We don't strictly need userfault to resolve anything regarding CAP_ADMIN problems. MADV_DONTNEED also dosn't need CAP_ADMIN, same to any new madvise() if we want to make it useful for injecting poisoned ptes with !ADMIN and limit it within current->mm.
But I think you're right that userfaultfd always tried to avoid having ADMIN and keep everything within its own scope of permissions.
So again, totally no objection on make it uffd specific for now if you guys are all happy with it, but just to be clear that it's (to me) mostly for avoiding another WAKE, and afaics that's not really for solving the ADMIN issue here.
How about this plan:
Since the concrete use case we have (postcopy live migration) is UFFD-specific, let's leave it as a UFFDIO_* operation for now.
If in the future we come up with a non-UFFD use case, we can add a new MADV_* which does this operation at that point. From my perspective they could even share most of the same implementation code.
I don't think it's a big problem keeping the UFFDIO_* version too at that point, because it still provides some (perhaps small) value:
- Combines the operation + waking into one syscall
- It allows us to support additional UFFD flags which modify / extend
the operation in UFFD-specific ways, if we want to add those in the future
Seem reasonable?
Ok here.
If so, I'll send a v2 with documentation updates.
I've reviewed v1 in this case, please have a look, thanks.
On Thu, May 11, 2023 at 11:24:24AM -0700, Axel Rasmussen wrote:
The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not.
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
[as used to suggest..] maybe UFFDIO_POISON sounds better.
so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
fs/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++ include/linux/swapops.h | 3 +- include/linux/userfaultfd_k.h | 4 ++ include/uapi/linux/userfaultfd.h | 25 +++++++++++-- mm/memory.c | 4 ++ mm/userfaultfd.c | 62 ++++++++++++++++++++++++++++++- 6 files changed, 156 insertions(+), 5 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 0fd96d6e39ce..edc2928dae2b 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1966,6 +1966,66 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; } +static inline int userfaultfd_sigbus(struct userfaultfd_ctx *ctx, unsigned long arg) +{
- __s64 ret;
- struct uffdio_sigbus uffdio_sigbus;
- struct uffdio_sigbus __user *user_uffdio_sigbus;
- struct userfaultfd_wake_range range;
- user_uffdio_sigbus = (struct uffdio_sigbus __user *)arg;
- ret = -EAGAIN;
- if (atomic_read(&ctx->mmap_changing))
goto out;
- ret = -EFAULT;
- if (copy_from_user(&uffdio_sigbus, user_uffdio_sigbus,
/* don't copy the output fields */
sizeof(uffdio_sigbus) - (sizeof(__s64))))
goto out;
- ret = validate_range(ctx->mm, uffdio_sigbus.range.start,
uffdio_sigbus.range.len);
- if (ret)
goto out;
- ret = -EINVAL;
- /* double check for wraparound just in case. */
- if (uffdio_sigbus.range.start + uffdio_sigbus.range.len <=
uffdio_sigbus.range.start) {
goto out;
- }
- if (uffdio_sigbus.mode & ~UFFDIO_SIGBUS_MODE_DONTWAKE)
goto out;
- if (mmget_not_zero(ctx->mm)) {
ret = mfill_atomic_sigbus(ctx->mm, uffdio_sigbus.range.start,
uffdio_sigbus.range.len,
&ctx->mmap_changing, 0);
mmput(ctx->mm);
- } else {
return -ESRCH;
- }
- if (unlikely(put_user(ret, &user_uffdio_sigbus->updated)))
return -EFAULT;
- if (ret < 0)
goto out;
- /* len == 0 would wake all */
- BUG_ON(!ret);
- range.len = ret;
- if (!(uffdio_sigbus.mode & UFFDIO_SIGBUS_MODE_DONTWAKE)) {
range.start = uffdio_sigbus.range.start;
wake_userfault(ctx, &range);
- }
- ret = range.len == uffdio_sigbus.range.len ? 0 : -EAGAIN;
+out:
- return ret;
+}
static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -2067,6 +2127,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd, case UFFDIO_CONTINUE: ret = userfaultfd_continue(ctx, arg); break;
- case UFFDIO_SIGBUS:
ret = userfaultfd_sigbus(ctx, arg);
} return ret;break;
} diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 3a451b7afcb3..fa778a0ae730 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -405,7 +405,8 @@ typedef unsigned long pte_marker; #define PTE_MARKER_UFFD_WP BIT(0) #define PTE_MARKER_SWAPIN_ERROR BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +#define PTE_MARKER_UFFD_SIGBUS BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1)
[as used to suggest..] I'd consider reusing SWAPIN_ERROR directly.
Actually.. I think maybe we should have 1 patch changing SWAPIN_ERROR from VM_FAULT_SIGBUS to VM_FAULT_HWPOISON.
Let's imagine a VM having anonymous page backing and got a swapin error when faulted on one of the guest page. Instead of crashing the hypervisor with sigbus we should probably make it a MCE injected into the guest too, because there's no page corrupt in bare metal in this specific case, however to the guest it's the same as having one page corrupted just like a real MCE.
static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index d78b01524349..6de1084939c5 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -46,6 +46,7 @@ enum mfill_atomic_mode { MFILL_ATOMIC_COPY, MFILL_ATOMIC_ZEROPAGE, MFILL_ATOMIC_CONTINUE,
- MFILL_ATOMIC_SIGBUS, NR_MFILL_ATOMIC_MODES,
}; @@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, atomic_t *mmap_changing, uffd_flags_t flags); +extern ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, atomic_t *mmap_changing,
uffd_flags_t flags);
extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 66dd4cd277bd..616e33d3db97 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -39,7 +39,8 @@ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
UFFD_FEATURE_WP_UNPOPULATED)
UFFD_FEATURE_WP_UNPOPULATED | \
UFFD_FEATURE_SIGBUS_IOCTL)
#define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -49,12 +50,14 @@ (__u64)1 << _UFFDIO_COPY | \ (__u64)1 << _UFFDIO_ZEROPAGE | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \
(__u64)1 << _UFFDIO_CONTINUE)
(__u64)1 << _UFFDIO_CONTINUE | \
(__u64)1 << _UFFDIO_SIGBUS)
#define UFFD_API_RANGE_IOCTLS_BASIC \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY | \
(__u64)1 << _UFFDIO_CONTINUE | \(__u64)1 << _UFFDIO_WRITEPROTECT | \
(__u64)1 << _UFFDIO_WRITEPROTECT)
(__u64)1 << _UFFDIO_SIGBUS)
/*
- Valid ioctl command number range with this API is from 0x00 to
@@ -71,6 +74,7 @@ #define _UFFDIO_ZEROPAGE (0x04) #define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_CONTINUE (0x07) +#define _UFFDIO_SIGBUS (0x08) #define _UFFDIO_API (0x3F) /* userfaultfd ioctl ids */ @@ -91,6 +95,8 @@ struct uffdio_writeprotect) #define UFFDIO_CONTINUE _IOWR(UFFDIO, _UFFDIO_CONTINUE, \ struct uffdio_continue) +#define UFFDIO_SIGBUS _IOWR(UFFDIO, _UFFDIO_SIGBUS, \
struct uffdio_sigbus)
/* read() structure */ struct uffd_msg { @@ -225,6 +231,7 @@ struct uffdio_api { #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) +#define UFFD_FEATURE_SIGBUS_IOCTL (1<<14) __u64 features; __u64 ioctls; @@ -321,6 +328,18 @@ struct uffdio_continue { __s64 mapped; }; +struct uffdio_sigbus {
- struct uffdio_range range;
+#define UFFDIO_SIGBUS_MODE_DONTWAKE ((__u64)1<<0)
- __u64 mode;
- /*
* Fields below here are written by the ioctl and must be at the end:
* the copy_from_user will not read past here.
*/
- __s64 updated;
+};
/*
- Flags for the userfaultfd(2) system call itself.
*/ diff --git a/mm/memory.c b/mm/memory.c index f69fbc251198..e4b4207c2590 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3675,6 +3675,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (WARN_ON_ONCE(!marker)) return VM_FAULT_SIGBUS;
- /* SIGBUS explicitly requested for this PTE. */
- if (marker & PTE_MARKER_UFFD_SIGBUS)
return VM_FAULT_SIGBUS;
- /* Higher priority than uffd-wp when data corrupted */ if (marker & PTE_MARKER_SWAPIN_ERROR) return VM_FAULT_SIGBUS;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e97a0b4889fc..933587eebd5d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -278,6 +278,51 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, goto out; } +/* Handles UFFDIO_SIGBUS for all non-hugetlb VMAs. */ +static int mfill_atomic_pte_sigbus(pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
uffd_flags_t flags)
+{
- int ret;
- struct mm_struct *dst_mm = dst_vma->vm_mm;
- pte_t _dst_pte, *dst_pte;
- spinlock_t *ptl;
- _dst_pte = make_pte_marker(PTE_MARKER_UFFD_SIGBUS);
- dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
- if (vma_is_shmem(dst_vma)) {
struct inode *inode;
pgoff_t offset, max_off;
/* serialize against truncate with the page table lock */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_unlock;
- }
- ret = -EEXIST;
- /*
* For now, we don't handle unmapping pages, so only support filling in
* none PTEs, or replacing PTE markers.
*/
- if (!pte_none_mostly(*dst_pte))
goto out_unlock;
- set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(dst_vma, dst_addr, dst_pte);
- ret = 0;
+out_unlock:
- pte_unmap_unlock(dst_pte, ptl);
- return ret;
+}
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -328,8 +373,12 @@ static __always_inline ssize_t mfill_atomic_hugetlb( * supported by hugetlb. A PMD_SIZE huge pages may exist as used * by THP. Since we can not reliably insert a zero page, this * feature is not supported.
*
* PTE marker handling for hugetlb is a bit special, so for now
* UFFDIO_SIGBUS is not supported.
Can you be more specific on this?
What's the plan when HGM will be merged? Is it possible that all memory just support this always so we only need 1 feature flag?
*/
- if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
- if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) ||
mmap_read_unlock(dst_mm); return -EINVAL; }uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
@@ -473,6 +522,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { return mfill_atomic_pte_continue(dst_pmd, dst_vma, dst_addr, flags);
- } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
return mfill_atomic_pte_sigbus(dst_pmd, dst_vma,
}dst_addr, flags);
/* @@ -694,6 +746,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); } +ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, atomic_t *mmap_changing,
uffd_flags_t flags)
+{
- return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
uffd_flags_set_mode(flags, MFILL_ATOMIC_SIGBUS));
+}
long uffd_wp_range(struct vm_area_struct *dst_vma, unsigned long start, unsigned long len, bool enable_wp) { -- 2.40.1.606.ga4b1b128d6-goog
On Tue, May 23, 2023 at 10:26 AM Peter Xu peterx@redhat.com wrote:
On Thu, May 11, 2023 at 11:24:24AM -0700, Axel Rasmussen wrote:
The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages.
When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not.
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
[as used to suggest..] maybe UFFDIO_POISON sounds better.
so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly.
UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
fs/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++ include/linux/swapops.h | 3 +- include/linux/userfaultfd_k.h | 4 ++ include/uapi/linux/userfaultfd.h | 25 +++++++++++-- mm/memory.c | 4 ++ mm/userfaultfd.c | 62 ++++++++++++++++++++++++++++++- 6 files changed, 156 insertions(+), 5 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 0fd96d6e39ce..edc2928dae2b 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1966,6 +1966,66 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; }
+static inline int userfaultfd_sigbus(struct userfaultfd_ctx *ctx, unsigned long arg) +{
__s64 ret;
struct uffdio_sigbus uffdio_sigbus;
struct uffdio_sigbus __user *user_uffdio_sigbus;
struct userfaultfd_wake_range range;
user_uffdio_sigbus = (struct uffdio_sigbus __user *)arg;
ret = -EAGAIN;
if (atomic_read(&ctx->mmap_changing))
goto out;
ret = -EFAULT;
if (copy_from_user(&uffdio_sigbus, user_uffdio_sigbus,
/* don't copy the output fields */
sizeof(uffdio_sigbus) - (sizeof(__s64))))
goto out;
ret = validate_range(ctx->mm, uffdio_sigbus.range.start,
uffdio_sigbus.range.len);
if (ret)
goto out;
ret = -EINVAL;
/* double check for wraparound just in case. */
if (uffdio_sigbus.range.start + uffdio_sigbus.range.len <=
uffdio_sigbus.range.start) {
goto out;
}
if (uffdio_sigbus.mode & ~UFFDIO_SIGBUS_MODE_DONTWAKE)
goto out;
if (mmget_not_zero(ctx->mm)) {
ret = mfill_atomic_sigbus(ctx->mm, uffdio_sigbus.range.start,
uffdio_sigbus.range.len,
&ctx->mmap_changing, 0);
mmput(ctx->mm);
} else {
return -ESRCH;
}
if (unlikely(put_user(ret, &user_uffdio_sigbus->updated)))
return -EFAULT;
if (ret < 0)
goto out;
/* len == 0 would wake all */
BUG_ON(!ret);
range.len = ret;
if (!(uffdio_sigbus.mode & UFFDIO_SIGBUS_MODE_DONTWAKE)) {
range.start = uffdio_sigbus.range.start;
wake_userfault(ctx, &range);
}
ret = range.len == uffdio_sigbus.range.len ? 0 : -EAGAIN;
+out:
return ret;
+}
static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -2067,6 +2127,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd, case UFFDIO_CONTINUE: ret = userfaultfd_continue(ctx, arg); break;
case UFFDIO_SIGBUS:
ret = userfaultfd_sigbus(ctx, arg);
break; } return ret;
} diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 3a451b7afcb3..fa778a0ae730 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -405,7 +405,8 @@ typedef unsigned long pte_marker;
#define PTE_MARKER_UFFD_WP BIT(0) #define PTE_MARKER_SWAPIN_ERROR BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +#define PTE_MARKER_UFFD_SIGBUS BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1)
[as used to suggest..] I'd consider reusing SWAPIN_ERROR directly.
Actually.. I think maybe we should have 1 patch changing SWAPIN_ERROR from VM_FAULT_SIGBUS to VM_FAULT_HWPOISON.
Let's imagine a VM having anonymous page backing and got a swapin error when faulted on one of the guest page. Instead of crashing the hypervisor with sigbus we should probably make it a MCE injected into the guest too, because there's no page corrupt in bare metal in this specific case, however to the guest it's the same as having one page corrupted just like a real MCE.
This is a great idea, you're right that injecting an MCE into the guest is exactly the end goal, and it seems like VM_FAULT_HWPOISON will "just work". Also the name UFFDIO_POISON resolves any confusion with UFFD_FEATURE_SIGBUS, so that's a nice side benefit.
I'll make this change in v2, thanks for the idea Peter!
static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index d78b01524349..6de1084939c5 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -46,6 +46,7 @@ enum mfill_atomic_mode { MFILL_ATOMIC_COPY, MFILL_ATOMIC_ZEROPAGE, MFILL_ATOMIC_CONTINUE,
MFILL_ATOMIC_SIGBUS, NR_MFILL_ATOMIC_MODES,
};
@@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, atomic_t *mmap_changing, uffd_flags_t flags); +extern ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, atomic_t *mmap_changing,
uffd_flags_t flags);
extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 66dd4cd277bd..616e33d3db97 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -39,7 +39,8 @@ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
UFFD_FEATURE_WP_UNPOPULATED)
UFFD_FEATURE_WP_UNPOPULATED | \
UFFD_FEATURE_SIGBUS_IOCTL)
#define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -49,12 +50,14 @@ (__u64)1 << _UFFDIO_COPY | \ (__u64)1 << _UFFDIO_ZEROPAGE | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \
(__u64)1 << _UFFDIO_CONTINUE)
(__u64)1 << _UFFDIO_CONTINUE | \
(__u64)1 << _UFFDIO_SIGBUS)
#define UFFD_API_RANGE_IOCTLS_BASIC \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY | \
(__u64)1 << _UFFDIO_WRITEPROTECT | \ (__u64)1 << _UFFDIO_CONTINUE | \
(__u64)1 << _UFFDIO_WRITEPROTECT)
(__u64)1 << _UFFDIO_SIGBUS)
/*
- Valid ioctl command number range with this API is from 0x00 to
@@ -71,6 +74,7 @@ #define _UFFDIO_ZEROPAGE (0x04) #define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_CONTINUE (0x07) +#define _UFFDIO_SIGBUS (0x08) #define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */ @@ -91,6 +95,8 @@ struct uffdio_writeprotect) #define UFFDIO_CONTINUE _IOWR(UFFDIO, _UFFDIO_CONTINUE, \ struct uffdio_continue) +#define UFFDIO_SIGBUS _IOWR(UFFDIO, _UFFDIO_SIGBUS, \
struct uffdio_sigbus)
/* read() structure */ struct uffd_msg { @@ -225,6 +231,7 @@ struct uffdio_api { #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) +#define UFFD_FEATURE_SIGBUS_IOCTL (1<<14) __u64 features;
__u64 ioctls;
@@ -321,6 +328,18 @@ struct uffdio_continue { __s64 mapped; };
+struct uffdio_sigbus {
struct uffdio_range range;
+#define UFFDIO_SIGBUS_MODE_DONTWAKE ((__u64)1<<0)
__u64 mode;
/*
* Fields below here are written by the ioctl and must be at the end:
* the copy_from_user will not read past here.
*/
__s64 updated;
+};
/*
- Flags for the userfaultfd(2) system call itself.
*/ diff --git a/mm/memory.c b/mm/memory.c index f69fbc251198..e4b4207c2590 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3675,6 +3675,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (WARN_ON_ONCE(!marker)) return VM_FAULT_SIGBUS;
/* SIGBUS explicitly requested for this PTE. */
if (marker & PTE_MARKER_UFFD_SIGBUS)
return VM_FAULT_SIGBUS;
/* Higher priority than uffd-wp when data corrupted */ if (marker & PTE_MARKER_SWAPIN_ERROR) return VM_FAULT_SIGBUS;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e97a0b4889fc..933587eebd5d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -278,6 +278,51 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, goto out; }
+/* Handles UFFDIO_SIGBUS for all non-hugetlb VMAs. */ +static int mfill_atomic_pte_sigbus(pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
uffd_flags_t flags)
+{
int ret;
struct mm_struct *dst_mm = dst_vma->vm_mm;
pte_t _dst_pte, *dst_pte;
spinlock_t *ptl;
_dst_pte = make_pte_marker(PTE_MARKER_UFFD_SIGBUS);
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
if (vma_is_shmem(dst_vma)) {
struct inode *inode;
pgoff_t offset, max_off;
/* serialize against truncate with the page table lock */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_unlock;
}
ret = -EEXIST;
/*
* For now, we don't handle unmapping pages, so only support filling in
* none PTEs, or replacing PTE markers.
*/
if (!pte_none_mostly(*dst_pte))
goto out_unlock;
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);
ret = 0;
+out_unlock:
pte_unmap_unlock(dst_pte, ptl);
return ret;
+}
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -328,8 +373,12 @@ static __always_inline ssize_t mfill_atomic_hugetlb( * supported by hugetlb. A PMD_SIZE huge pages may exist as used * by THP. Since we can not reliably insert a zero page, this * feature is not supported.
*
* PTE marker handling for hugetlb is a bit special, so for now
* UFFDIO_SIGBUS is not supported.
Can you be more specific on this?
What's the plan when HGM will be merged? Is it possible that all memory just support this always so we only need 1 feature flag?
We'll want hugetlbfs support for this operation too, but it's only really useful (at least for our use case) after HGM is merged. But, there's no strong reason not to just implement both all at once - I'll extend v2 to also work properly with hugetlbfs. Probably it isn't too hard, I just need to do a bit more reading of how swap markers are handled in hugetlbfs.
*/
if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) ||
uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) { mmap_read_unlock(dst_mm); return -EINVAL; }
@@ -473,6 +522,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { return mfill_atomic_pte_continue(dst_pmd, dst_vma, dst_addr, flags);
} else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
return mfill_atomic_pte_sigbus(dst_pmd, dst_vma,
dst_addr, flags); } /*
@@ -694,6 +746,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); }
+ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, atomic_t *mmap_changing,
uffd_flags_t flags)
+{
return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
uffd_flags_set_mode(flags, MFILL_ATOMIC_SIGBUS));
+}
long uffd_wp_range(struct vm_area_struct *dst_vma, unsigned long start, unsigned long len, bool enable_wp) { -- 2.40.1.606.ga4b1b128d6-goog
-- Peter Xu
On Tue, May 23, 2023 at 10:59:05AM -0700, Axel Rasmussen wrote:
Actually.. I think maybe we should have 1 patch changing SWAPIN_ERROR from VM_FAULT_SIGBUS to VM_FAULT_HWPOISON.
Let's imagine a VM having anonymous page backing and got a swapin error when faulted on one of the guest page. Instead of crashing the hypervisor with sigbus we should probably make it a MCE injected into the guest too, because there's no page corrupt in bare metal in this specific case, however to the guest it's the same as having one page corrupted just like a real MCE.
This is a great idea, you're right that injecting an MCE into the guest is exactly the end goal, and it seems like VM_FAULT_HWPOISON will "just work". Also the name UFFDIO_POISON resolves any confusion with UFFD_FEATURE_SIGBUS, so that's a nice side benefit.
Hopefully! Please double check it with KVM running altogether to make sure the patch works exactly as expected.
[...]
We'll want hugetlbfs support for this operation too, but it's only really useful (at least for our use case) after HGM is merged. But, there's no strong reason not to just implement both all at once - I'll extend v2 to also work properly with hugetlbfs. Probably it isn't too hard, I just need to do a bit more reading of how swap markers are handled in hugetlbfs.
Sure. We have too many flags separating different types of memory, so I think it'll be nice if when it can still trivially work for everything.
For this specific case, since your goal is definitely having hugetlb hgm supported so it'll be even more trickier if only hgm will be supported but not !hgm hugetlbs, so we'd better target it initially with all mem types.
Thanks,
linux-kselftest-mirror@lists.linaro.org