Changes since v1: - Rename memcpy_mcsafe() to copy_safe() since the x86-machine-check specifics have already been de-emphasized in a previous commit and are further de-emphasized by these changes. (Linus)
- Move copy_safe() out-of-line since it no longer reverts to plain memcpy (Linus)
- Move copy_safe() to its own stand-alone compilation unit where it no longer entangles with arch/x86/lib/memcpy_64.S. This also allows perf to stop tracking ongoing updates to that file due to copy_safe() updates. (Linus)
- Move the PowerPC implementation over to the new name.
[1]: http://lore.kernel.org/r/158654083112.1572482.8944305411228188871.stgit@dwil...
---
The primary motivation to go touch memcpy_mcsafe() is that the existing benefit of doing slow and careful copies is obviated on newer CPUs. That fact solves the problem of needing to detect machine-check recovery capability. Now the old "mcsafe_key" opt-in to careful copying can be made an opt-out from the default fast copy implementation.
The discussion with Linus further made clear that this facility had already lost its x86-machine-check specificity starting with commit 2c89130a56a ("x86/asm/memcpy_mcsafe: Add write-protection-fault handling"). The new changes to not require a "careful copy" further de-emphasizes the role that x86-MCA plays in the implementation to just one more source of recoverable trap during the operation.
With the above realizations the name "mcsafe" is no longer accurate and copy_safe() is proposed as its replacement. x86 grows a copy_safe_fast() implementation as a default implementation that is independent of detecting the presence of x86-MCA.
---
Dan Williams (2): copy_safe: Rename memcpy_mcsafe() to copy_safe() x86/copy_safe: Introduce copy_safe_fast()
arch/powerpc/Kconfig | 2 arch/powerpc/include/asm/string.h | 2 arch/powerpc/include/asm/uaccess.h | 4 arch/powerpc/lib/Makefile | 2 arch/powerpc/lib/copy_safe.S | 4 arch/x86/Kconfig | 2 arch/x86/Kconfig.debug | 2 arch/x86/include/asm/copy_safe.h | 18 ++ arch/x86/include/asm/copy_safe_test.h | 75 +++++++++ arch/x86/include/asm/mcsafe_test.h | 75 --------- arch/x86/include/asm/string_64.h | 32 ---- arch/x86/include/asm/uaccess_64.h | 21 --- arch/x86/kernel/cpu/mce/core.c | 9 - arch/x86/kernel/quirks.c | 10 - arch/x86/lib/Makefile | 1 arch/x86/lib/copy_safe.c | 66 ++++++++ arch/x86/lib/copy_safe_64.S | 163 ++++++++++++++++++++ arch/x86/lib/memcpy_64.S | 115 -------------- arch/x86/lib/usercopy_64.c | 21 --- drivers/md/dm-writecache.c | 12 + drivers/nvdimm/claim.c | 2 drivers/nvdimm/pmem.c | 6 - include/linux/string.h | 17 +- include/linux/uio.h | 10 + lib/Kconfig | 2 lib/iov_iter.c | 36 ++-- tools/arch/x86/include/asm/copy_safe_test.h | 13 ++ tools/arch/x86/include/asm/mcsafe_test.h | 13 -- tools/arch/x86/lib/memcpy_64.S | 115 -------------- tools/objtool/check.c | 5 - tools/perf/bench/Build | 1 tools/perf/bench/mem-memcpy-x86-64-lib.c | 24 --- tools/testing/nvdimm/test/nfit.c | 49 +++--- .../testing/selftests/powerpc/copyloops/.gitignore | 2 tools/testing/selftests/powerpc/copyloops/Makefile | 6 - .../selftests/powerpc/copyloops/copy_safe.S | 0 36 files changed, 429 insertions(+), 508 deletions(-) rename arch/powerpc/lib/{memcpy_mcsafe_64.S => copy_safe.S} (98%) create mode 100644 arch/x86/include/asm/copy_safe.h create mode 100644 arch/x86/include/asm/copy_safe_test.h delete mode 100644 arch/x86/include/asm/mcsafe_test.h create mode 100644 arch/x86/lib/copy_safe.c create mode 100644 arch/x86/lib/copy_safe_64.S create mode 100644 tools/arch/x86/include/asm/copy_safe_test.h delete mode 100644 tools/arch/x86/include/asm/mcsafe_test.h delete mode 100644 tools/perf/bench/mem-memcpy-x86-64-lib.c rename tools/testing/selftests/powerpc/copyloops/{memcpy_mcsafe_64.S => copy_safe.S} (100%)
base-commit: b8dcd632c06b8706d22934f9bf9bf16a42b1ecc7
The scope of memcpy_mcsafe() changed in commit 12c89130a56a ("x86/asm/memcpy_mcsafe: Add write-protection-fault handling") where it was no longer a facility focused on x86-machine-check exceptions, but a generic copy routine that handles any hardware fault, trap, or exception by aborting and reporting 'bytes successfully transferred'.
Now, the rename is motivated by this exchange with Linus and allows for additional cleanups that further disconnect a "copy_safe()" facility from the original x86-machine-check handling focused implementation:
> The writes can mmu-fault now that memcpy_mcsafe() is also used by > _copy_to_iter_mcsafe(). This allows a clean bypass of the block layer > in fs/dax.c in addition to the pmem driver access of poisoned memory. > Now that the fallback is a sane rep; movs; it can be considered for > plain copy_to_iter() for other user copies so you get exception > handling on kernel access of poison outside of persistent memory. To > Andy's point I think a recoverable copy (for exceptions or faults) is > generally useful.
I think that's completely independent.
If we have good reasons for having targets with exception handling, then that has absolutely nothing to do with machine checks or buggy hardware.
And it sure shouldn't be called "mcsafe", since it has nothing to do with that situation any more.
Note that the changes are purely a rename / code reorganization plus changing the polarity of mcsafe_slow_key from an opt-in to a blacklist in preparation for a future patch that introduces a common x86::copy_safe_fast() that is independent of platform machine-check-recovery capability. Otherwise, the current awkward situation of copy_safe() sometimes falling back to memcpy() (even on CONFIG_ARCH_HAS_COPY_SAFE=y x86) is maintained for now until it can be replaced with the new implementation that is always instrumented to handle the abort case.
One side-effect of this reorganization is that separating copy_safe_64.S to its own file means that perf no longer needs to track dependencies for its memcpy_64.S benchmarks.
Cc: x86@kernel.org Cc: stable@vger.kernel.org Cc: Ingo Molnar mingo@redhat.com Cc: Borislav Petkov bp@alien8.de Cc: "H. Peter Anvin" hpa@zytor.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Tony Luck tony.luck@intel.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Paul Mackerras paulus@samba.org Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Linus Torvalds torvalds@linux-foundation.org Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg... Signed-off-by: Dan Williams dan.j.williams@intel.com --- arch/powerpc/Kconfig | 2 arch/powerpc/include/asm/string.h | 2 arch/powerpc/include/asm/uaccess.h | 4 - arch/powerpc/lib/Makefile | 2 arch/powerpc/lib/copy_safe.S | 4 - arch/x86/Kconfig | 2 arch/x86/Kconfig.debug | 2 arch/x86/include/asm/copy_safe.h | 16 +++ arch/x86/include/asm/copy_safe_test.h | 75 ++++++++++++ arch/x86/include/asm/mcsafe_test.h | 75 ------------ arch/x86/include/asm/string_64.h | 32 ----- arch/x86/include/asm/uaccess_64.h | 21 --- arch/x86/kernel/cpu/mce/core.c | 9 - arch/x86/kernel/quirks.c | 10 -- arch/x86/lib/Makefile | 1 arch/x86/lib/copy_safe.c | 67 +++++++++++ arch/x86/lib/copy_safe_64.S | 123 ++++++++++++++++++++ arch/x86/lib/memcpy_64.S | 115 ------------------- arch/x86/lib/usercopy_64.c | 21 --- drivers/md/dm-writecache.c | 12 +- drivers/nvdimm/claim.c | 2 drivers/nvdimm/pmem.c | 6 - include/linux/string.h | 17 ++- include/linux/uio.h | 10 +- lib/Kconfig | 2 lib/iov_iter.c | 36 +++--- tools/arch/x86/include/asm/copy_safe_test.h | 13 ++ tools/arch/x86/include/asm/mcsafe_test.h | 13 -- tools/arch/x86/lib/memcpy_64.S | 115 ------------------- tools/objtool/check.c | 4 - tools/perf/bench/Build | 1 tools/perf/bench/mem-memcpy-x86-64-lib.c | 24 ---- tools/testing/nvdimm/test/nfit.c | 49 ++++---- .../testing/selftests/powerpc/copyloops/.gitignore | 2 tools/testing/selftests/powerpc/copyloops/Makefile | 6 - .../selftests/powerpc/copyloops/copy_safe.S | 0 36 files changed, 387 insertions(+), 508 deletions(-) rename arch/powerpc/lib/{memcpy_mcsafe_64.S => copy_safe.S} (98%) create mode 100644 arch/x86/include/asm/copy_safe.h create mode 100644 arch/x86/include/asm/copy_safe_test.h delete mode 100644 arch/x86/include/asm/mcsafe_test.h create mode 100644 arch/x86/lib/copy_safe.c create mode 100644 arch/x86/lib/copy_safe_64.S create mode 100644 tools/arch/x86/include/asm/copy_safe_test.h delete mode 100644 tools/arch/x86/include/asm/mcsafe_test.h delete mode 100644 tools/perf/bench/mem-memcpy-x86-64-lib.c rename tools/testing/selftests/powerpc/copyloops/{memcpy_mcsafe_64.S => copy_safe.S} (100%)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 924c541a9260..ba4e7ff02d09 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -133,7 +133,7 @@ config PPC select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !HIBERNATION) select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE - select ARCH_HAS_UACCESS_MCSAFE if PPC64 + select ARCH_HAS_COPY_SAFE if PPC64 select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_KEEP_MEMBLOCK diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h index b72692702f35..9bf6dffb4090 100644 --- a/arch/powerpc/include/asm/string.h +++ b/arch/powerpc/include/asm/string.h @@ -53,9 +53,7 @@ void *__memmove(void *to, const void *from, __kernel_size_t n); #ifndef CONFIG_KASAN #define __HAVE_ARCH_MEMSET32 #define __HAVE_ARCH_MEMSET64 -#define __HAVE_ARCH_MEMCPY_MCSAFE
-extern int memcpy_mcsafe(void *dst, const void *src, __kernel_size_t sz); extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t); extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t); extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t); diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 2f500debae21..0abca701fe38 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -416,12 +416,12 @@ raw_copy_to_user(void __user *to, const void *from, unsigned long n) }
static __always_inline unsigned long __must_check -copy_to_user_mcsafe(void __user *to, const void *from, unsigned long n) +copy_to_user_safe(void __user *to, const void *from, unsigned long n) { if (likely(check_copy_size(from, n, true))) { if (access_ok(to, n)) { allow_write_to_user(to, n); - n = memcpy_mcsafe((void *)to, from, n); + n = copy_safe((void *)to, from, n); prevent_write_to_user(to, n); } } diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile index b8de3be10eb4..40a72a19d317 100644 --- a/arch/powerpc/lib/Makefile +++ b/arch/powerpc/lib/Makefile @@ -39,7 +39,7 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o copypage_power7.o \ memcpy_power7.o
obj64-y += copypage_64.o copyuser_64.o mem_64.o hweight_64.o \ - memcpy_64.o memcpy_mcsafe_64.o + memcpy_64.o copy_safe.o
obj64-$(CONFIG_SMP) += locks.o obj64-$(CONFIG_ALTIVEC) += vmx-helper.o diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S b/arch/powerpc/lib/copy_safe.S similarity index 98% rename from arch/powerpc/lib/memcpy_mcsafe_64.S rename to arch/powerpc/lib/copy_safe.S index cb882d9a6d8a..c0aef739531e 100644 --- a/arch/powerpc/lib/memcpy_mcsafe_64.S +++ b/arch/powerpc/lib/copy_safe.S @@ -50,7 +50,7 @@ err3; stb r0,0(r3) blr
-_GLOBAL(memcpy_mcsafe) +_GLOBAL(copy_safe) mr r7,r5 cmpldi r5,16 blt .Lshort_copy @@ -239,4 +239,4 @@ err1; stb r0,0(r3) 15: li r3,0 blr
-EXPORT_SYMBOL_GPL(memcpy_mcsafe); +EXPORT_SYMBOL_GPL(copy_safe); diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index c017c97a9c8e..2ac55243dc24 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -72,7 +72,7 @@ config X86 select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 - select ARCH_HAS_UACCESS_MCSAFE if X86_64 && X86_MCE + select ARCH_HAS_COPY_SAFE if X86_64 select ARCH_HAS_SET_MEMORY select ARCH_HAS_SET_DIRECT_MAP select ARCH_HAS_STRICT_KERNEL_RWX diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index f909d3ce36e6..fa83915ccd15 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -59,7 +59,7 @@ config EARLY_PRINTK_USB_XDBC You should normally say N here, unless you want to debug early crashes or need a very simple printk logging facility.
-config MCSAFE_TEST +config COPY_SAFE_TEST def_bool n
config EFI_PGT_DUMP diff --git a/arch/x86/include/asm/copy_safe.h b/arch/x86/include/asm/copy_safe.h new file mode 100644 index 000000000000..1c130364dd61 --- /dev/null +++ b/arch/x86/include/asm/copy_safe.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _COPY_SAFE_H_ +#define _COPY_SAFE_H_ + +#ifdef CONFIG_ARCH_HAS_COPY_SAFE +void enable_copy_safe_slow(void); +#else +static inline void enable_copy_safe_slow(void) +{ +} +#endif + +__must_check unsigned long +copy_safe_slow(void *dst, const void *src, size_t cnt); + +#endif /* _COPY_SAFE_H_ */ diff --git a/arch/x86/include/asm/copy_safe_test.h b/arch/x86/include/asm/copy_safe_test.h new file mode 100644 index 000000000000..6f81b1e5d3e8 --- /dev/null +++ b/arch/x86/include/asm/copy_safe_test.h @@ -0,0 +1,75 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _COPY_SAFE_TEST_H_ +#define _COPY_SAFE_TEST_H_ + +#ifndef __ASSEMBLY__ +#ifdef CONFIG_COPY_SAFE_TEST +extern unsigned long copy_safe_test_src; +extern unsigned long copy_safe_test_dst; + +static inline void copy_safe_inject_src(void *addr) +{ + if (addr) + copy_safe_test_src = (unsigned long) addr; + else + copy_safe_test_src = ~0UL; +} + +static inline void copy_safe_inject_dst(void *addr) +{ + if (addr) + copy_safe_test_dst = (unsigned long) addr; + else + copy_safe_test_dst = ~0UL; +} +#else /* CONFIG_COPY_SAFE_TEST */ +static inline void copy_safe_inject_src(void *addr) +{ +} + +static inline void copy_safe_inject_dst(void *addr) +{ +} +#endif /* CONFIG_COPY_SAFE_TEST */ + +#else /* __ASSEMBLY__ */ +#include <asm/export.h> + +#ifdef CONFIG_COPY_SAFE_TEST +.macro COPY_SAFE_TEST_CTL + .pushsection .data + .align 8 + .globl copy_safe_test_src + copy_safe_test_src: + .quad 0 + EXPORT_SYMBOL_GPL(copy_safe_test_src) + .globl copy_safe_test_dst + copy_safe_test_dst: + .quad 0 + EXPORT_SYMBOL_GPL(copy_safe_test_dst) + .popsection +.endm + +.macro COPY_SAFE_TEST_SRC reg count target + leaq \count(\reg), %r9 + cmp copy_safe_test_src, %r9 + ja \target +.endm + +.macro COPY_SAFE_TEST_DST reg count target + leaq \count(\reg), %r9 + cmp copy_safe_test_dst, %r9 + ja \target +.endm +#else +.macro COPY_SAFE_TEST_CTL +.endm + +.macro COPY_SAFE_TEST_SRC reg count target +.endm + +.macro COPY_SAFE_TEST_DST reg count target +.endm +#endif /* CONFIG_COPY_SAFE_TEST */ +#endif /* __ASSEMBLY__ */ +#endif /* _COPY_SAFE_TEST_H_ */ diff --git a/arch/x86/include/asm/mcsafe_test.h b/arch/x86/include/asm/mcsafe_test.h deleted file mode 100644 index eb59804b6201..000000000000 --- a/arch/x86/include/asm/mcsafe_test.h +++ /dev/null @@ -1,75 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _MCSAFE_TEST_H_ -#define _MCSAFE_TEST_H_ - -#ifndef __ASSEMBLY__ -#ifdef CONFIG_MCSAFE_TEST -extern unsigned long mcsafe_test_src; -extern unsigned long mcsafe_test_dst; - -static inline void mcsafe_inject_src(void *addr) -{ - if (addr) - mcsafe_test_src = (unsigned long) addr; - else - mcsafe_test_src = ~0UL; -} - -static inline void mcsafe_inject_dst(void *addr) -{ - if (addr) - mcsafe_test_dst = (unsigned long) addr; - else - mcsafe_test_dst = ~0UL; -} -#else /* CONFIG_MCSAFE_TEST */ -static inline void mcsafe_inject_src(void *addr) -{ -} - -static inline void mcsafe_inject_dst(void *addr) -{ -} -#endif /* CONFIG_MCSAFE_TEST */ - -#else /* __ASSEMBLY__ */ -#include <asm/export.h> - -#ifdef CONFIG_MCSAFE_TEST -.macro MCSAFE_TEST_CTL - .pushsection .data - .align 8 - .globl mcsafe_test_src - mcsafe_test_src: - .quad 0 - EXPORT_SYMBOL_GPL(mcsafe_test_src) - .globl mcsafe_test_dst - mcsafe_test_dst: - .quad 0 - EXPORT_SYMBOL_GPL(mcsafe_test_dst) - .popsection -.endm - -.macro MCSAFE_TEST_SRC reg count target - leaq \count(\reg), %r9 - cmp mcsafe_test_src, %r9 - ja \target -.endm - -.macro MCSAFE_TEST_DST reg count target - leaq \count(\reg), %r9 - cmp mcsafe_test_dst, %r9 - ja \target -.endm -#else -.macro MCSAFE_TEST_CTL -.endm - -.macro MCSAFE_TEST_SRC reg count target -.endm - -.macro MCSAFE_TEST_DST reg count target -.endm -#endif /* CONFIG_MCSAFE_TEST */ -#endif /* __ASSEMBLY__ */ -#endif /* _MCSAFE_TEST_H_ */ diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index 75314c3dbe47..6e450827f677 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -82,38 +82,6 @@ int strcmp(const char *cs, const char *ct);
#endif
-#define __HAVE_ARCH_MEMCPY_MCSAFE 1 -__must_check unsigned long __memcpy_mcsafe(void *dst, const void *src, - size_t cnt); -DECLARE_STATIC_KEY_FALSE(mcsafe_key); - -/** - * memcpy_mcsafe - copy memory with indication if a machine check happened - * - * @dst: destination address - * @src: source address - * @cnt: number of bytes to copy - * - * Low level memory copy function that catches machine checks - * We only call into the "safe" function on systems that can - * actually do machine check recovery. Everyone else can just - * use memcpy(). - * - * Return 0 for success, or number of bytes not copied if there was an - * exception. - */ -static __always_inline __must_check unsigned long -memcpy_mcsafe(void *dst, const void *src, size_t cnt) -{ -#ifdef CONFIG_X86_MCE - if (static_branch_unlikely(&mcsafe_key)) - return __memcpy_mcsafe(dst, src, cnt); - else -#endif - memcpy(dst, src, cnt); - return 0; -} - #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1 void __memcpy_flushcache(void *dst, const void *src, size_t cnt); diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index bc10e3dc64fe..42111e95a7ca 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -46,21 +46,8 @@ copy_user_generic(void *to, const void *from, unsigned len) return ret; }
-static __always_inline __must_check unsigned long -copy_to_user_mcsafe(void *to, const void *from, unsigned len) -{ - unsigned long ret; - - __uaccess_begin(); - /* - * Note, __memcpy_mcsafe() is explicitly used since it can - * handle exceptions / faults. memcpy_mcsafe() may fall back to - * memcpy() which lacks this handling. - */ - ret = __memcpy_mcsafe(to, from, len); - __uaccess_end(); - return ret; -} +extern __must_check unsigned long +copy_to_user_safe(void *to, const void *from, unsigned len);
static __always_inline __must_check unsigned long raw_copy_from_user(void *dst, const void __user *src, unsigned long size) @@ -102,8 +89,4 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size) kasan_check_write(dst, size); return __copy_user_flushcache(dst, src, size); } - -unsigned long -mcsafe_handle_tail(char *to, char *from, unsigned len); - #endif /* _ASM_X86_UACCESS_64_H */ diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 02e1f165f148..0dc6d6adb0ea 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -40,12 +40,12 @@ #include <linux/debugfs.h> #include <linux/irq_work.h> #include <linux/export.h> -#include <linux/jump_label.h> #include <linux/set_memory.h>
#include <asm/intel-family.h> #include <asm/processor.h> #include <asm/traps.h> +#include <asm/copy_safe.h> #include <asm/tlbflush.h> #include <asm/mce.h> #include <asm/msr.h> @@ -1982,7 +1982,7 @@ void mce_disable_bank(int bank) and older. * mce=nobootlog Don't log MCEs from before booting. * mce=bios_cmci_threshold Don't program the CMCI threshold - * mce=recovery force enable memcpy_mcsafe() + * mce=recovery force enable copy_safe_slow() */ static int __init mcheck_enable(char *str) { @@ -2590,13 +2590,10 @@ static void __init mcheck_debugfs_init(void) static void __init mcheck_debugfs_init(void) { } #endif
-DEFINE_STATIC_KEY_FALSE(mcsafe_key); -EXPORT_SYMBOL_GPL(mcsafe_key); - static int __init mcheck_late_init(void) { if (mca_cfg.recovery) - static_branch_inc(&mcsafe_key); + enable_copy_safe_slow();
mcheck_debugfs_init();
diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c index 896d74cb5081..09ef9402b56e 100644 --- a/arch/x86/kernel/quirks.c +++ b/arch/x86/kernel/quirks.c @@ -8,6 +8,7 @@
#include <asm/hpet.h> #include <asm/setup.h> +#include <asm/copy_safe.h>
#if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)
@@ -624,10 +625,6 @@ static void amd_disable_seq_and_redirect_scrub(struct pci_dev *dev) DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_NB_F3, amd_disable_seq_and_redirect_scrub);
-#if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE) -#include <linux/jump_label.h> -#include <asm/string_64.h> - /* Ivy Bridge, Haswell, Broadwell */ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev) { @@ -636,7 +633,7 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev) pci_read_config_dword(pdev, 0x84, &capid0);
if (capid0 & 0x10) - static_branch_inc(&mcsafe_key); + enable_copy_safe_slow(); }
/* Skylake */ @@ -653,7 +650,7 @@ static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev) * enabled, so memory machine check recovery is also enabled. */ if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0)) - static_branch_inc(&mcsafe_key); + enable_copy_safe_slow();
} DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap); @@ -661,7 +658,6 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2fc0, quirk_intel_brickland_xeon_ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6fc0, quirk_intel_brickland_xeon_ras_cap); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2083, quirk_intel_purley_xeon_ras_cap); #endif -#endif
bool x86_apple_machine; EXPORT_SYMBOL(x86_apple_machine); diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index 6110bce7237b..b93cf4f3b10b 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -44,6 +44,7 @@ obj-$(CONFIG_SMP) += msr-smp.o cache-smp.o lib-y := delay.o misc.o cmdline.o cpu.o lib-y += usercopy_$(BITS).o usercopy.o getuser.o putuser.o lib-y += memcpy_$(BITS).o +lib-$(CONFIG_ARCH_HAS_COPY_SAFE) += copy_safe.o copy_safe_64.o lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o lib-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o diff --git a/arch/x86/lib/copy_safe.c b/arch/x86/lib/copy_safe.c new file mode 100644 index 000000000000..9dd3a831ff94 --- /dev/null +++ b/arch/x86/lib/copy_safe.c @@ -0,0 +1,67 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016-2020 Intel Corporation. All rights reserved. */ + +#include <linux/jump_label.h> +#include <linux/uaccess.h> +#include <linux/export.h> +#include <linux/string.h> +#include <linux/types.h> + +#include <asm/copy_safe.h> + +static DEFINE_STATIC_KEY_FALSE(copy_safe_slow_key); + +void enable_copy_safe_slow(void) +{ + static_branch_inc(©_safe_slow_key); +} + +/** + * copy_safe - memory copy that aborts on faults or exceptions + * + * @dst: destination address + * @src: source address + * @cnt: number of bytes to copy + * + * We only call into the slow version on systems that have trouble + * actually do machine check recovery. Everyone else can just + * use memcpy(). + * + * Return 0 for success, or number of bytes not copied if there was an + * exception. + */ +__must_check unsigned long copy_safe(void *dst, const void *src, size_t cnt) +{ + if (static_branch_unlikely(©_safe_slow_key)) + return copy_safe_slow(dst, src, cnt); + memcpy(dst, src, cnt); + return 0; +} +EXPORT_SYMBOL_GPL(copy_safe); + +/* + * Similar to copy_user_handle_tail, probe for the write fault point, or + * source exception point. + */ +__visible notrace unsigned long +copy_safe_slow_handle_tail(char *to, char *from, unsigned len) +{ + for (; len; --len, to++, from++) + if (copy_safe_slow(to, from, 1)) + break; + return len; +} + +__must_check unsigned long +copy_to_user_safe(void *to, const void *from, unsigned len) +{ + unsigned long ret; + + if (!static_branch_unlikely(©_safe_slow_key)) + return copy_user_generic(to, from, len); + + __uaccess_begin(); + ret = copy_safe_slow(to, from, len); + __uaccess_end(); + return ret; +} diff --git a/arch/x86/lib/copy_safe_64.S b/arch/x86/lib/copy_safe_64.S new file mode 100644 index 000000000000..46dfdd832bde --- /dev/null +++ b/arch/x86/lib/copy_safe_64.S @@ -0,0 +1,123 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* Copyright(c) 2016-2020 Intel Corporation. All rights reserved. */ + +#include <linux/linkage.h> +#include <asm/copy_safe_test.h> +#include <asm/export.h> +#include <asm/asm.h> + +#ifndef CONFIG_UML + +COPY_SAFE_TEST_CTL + +/* + * copy_safe_slow - copy memory with indication if an exception / fault happened + * + * The slow version is opted into by platform quirks. The fast version + * is equivalent to memcpy() regardless of CPU machine-check-recovery + * capability. + */ +SYM_FUNC_START(copy_safe_slow) + cmpl $8, %edx + /* Less than 8 bytes? Go to byte copy loop */ + jb .L_no_whole_words + + /* Check for bad alignment of source */ + testl $7, %esi + /* Already aligned */ + jz .L_8byte_aligned + + /* Copy one byte at a time until source is 8-byte aligned */ + movl %esi, %ecx + andl $7, %ecx + subl $8, %ecx + negl %ecx + subl %ecx, %edx +.L_read_leading_bytes: + movb (%rsi), %al + COPY_SAFE_TEST_SRC %rsi 1 .E_leading_bytes + COPY_SAFE_TEST_DST %rdi 1 .E_leading_bytes +.L_write_leading_bytes: + movb %al, (%rdi) + incq %rsi + incq %rdi + decl %ecx + jnz .L_read_leading_bytes + +.L_8byte_aligned: + movl %edx, %ecx + andl $7, %edx + shrl $3, %ecx + jz .L_no_whole_words + +.L_read_words: + movq (%rsi), %r8 + COPY_SAFE_TEST_SRC %rsi 8 .E_read_words + COPY_SAFE_TEST_DST %rdi 8 .E_write_words +.L_write_words: + movq %r8, (%rdi) + addq $8, %rsi + addq $8, %rdi + decl %ecx + jnz .L_read_words + + /* Any trailing bytes? */ +.L_no_whole_words: + andl %edx, %edx + jz .L_done_memcpy_trap + + /* Copy trailing bytes */ + movl %edx, %ecx +.L_read_trailing_bytes: + movb (%rsi), %al + COPY_SAFE_TEST_SRC %rsi 1 .E_trailing_bytes + COPY_SAFE_TEST_DST %rdi 1 .E_trailing_bytes +.L_write_trailing_bytes: + movb %al, (%rdi) + incq %rsi + incq %rdi + decl %ecx + jnz .L_read_trailing_bytes + + /* Copy successful. Return zero */ +.L_done_memcpy_trap: + xorl %eax, %eax +.L_done: + ret +SYM_FUNC_END(copy_safe_slow) +EXPORT_SYMBOL_GPL(copy_safe_slow) + + .section .fixup, "ax" + /* + * Return number of bytes not copied for any failure. Note that + * there is no "tail" handling since the source buffer is 8-byte + * aligned and poison is cacheline aligned. + */ +.E_read_words: + shll $3, %ecx +.E_leading_bytes: + addl %edx, %ecx +.E_trailing_bytes: + mov %ecx, %eax + jmp .L_done + + /* + * For write fault handling, given the destination is unaligned, + * we handle faults on multi-byte writes with a byte-by-byte + * copy up to the write-protected page. + */ +.E_write_words: + shll $3, %ecx + addl %edx, %ecx + movl %ecx, %edx + jmp copy_safe_slow_handle_tail + + .previous + + _ASM_EXTABLE_FAULT(.L_read_leading_bytes, .E_leading_bytes) + _ASM_EXTABLE_FAULT(.L_read_words, .E_read_words) + _ASM_EXTABLE_FAULT(.L_read_trailing_bytes, .E_trailing_bytes) + _ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes) + _ASM_EXTABLE(.L_write_words, .E_write_words) + _ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes) +#endif diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index 56b243b14c3a..4ed8d87b3b48 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -4,7 +4,6 @@ #include <linux/linkage.h> #include <asm/errno.h> #include <asm/cpufeatures.h> -#include <asm/mcsafe_test.h> #include <asm/alternative-asm.h> #include <asm/export.h>
@@ -183,117 +182,3 @@ SYM_FUNC_START_LOCAL(memcpy_orig) .Lend: retq SYM_FUNC_END(memcpy_orig) - -#ifndef CONFIG_UML - -MCSAFE_TEST_CTL - -/* - * __memcpy_mcsafe - memory copy with machine check exception handling - * Note that we only catch machine checks when reading the source addresses. - * Writes to target are posted and don't generate machine checks. - */ -SYM_FUNC_START(__memcpy_mcsafe) - cmpl $8, %edx - /* Less than 8 bytes? Go to byte copy loop */ - jb .L_no_whole_words - - /* Check for bad alignment of source */ - testl $7, %esi - /* Already aligned */ - jz .L_8byte_aligned - - /* Copy one byte at a time until source is 8-byte aligned */ - movl %esi, %ecx - andl $7, %ecx - subl $8, %ecx - negl %ecx - subl %ecx, %edx -.L_read_leading_bytes: - movb (%rsi), %al - MCSAFE_TEST_SRC %rsi 1 .E_leading_bytes - MCSAFE_TEST_DST %rdi 1 .E_leading_bytes -.L_write_leading_bytes: - movb %al, (%rdi) - incq %rsi - incq %rdi - decl %ecx - jnz .L_read_leading_bytes - -.L_8byte_aligned: - movl %edx, %ecx - andl $7, %edx - shrl $3, %ecx - jz .L_no_whole_words - -.L_read_words: - movq (%rsi), %r8 - MCSAFE_TEST_SRC %rsi 8 .E_read_words - MCSAFE_TEST_DST %rdi 8 .E_write_words -.L_write_words: - movq %r8, (%rdi) - addq $8, %rsi - addq $8, %rdi - decl %ecx - jnz .L_read_words - - /* Any trailing bytes? */ -.L_no_whole_words: - andl %edx, %edx - jz .L_done_memcpy_trap - - /* Copy trailing bytes */ - movl %edx, %ecx -.L_read_trailing_bytes: - movb (%rsi), %al - MCSAFE_TEST_SRC %rsi 1 .E_trailing_bytes - MCSAFE_TEST_DST %rdi 1 .E_trailing_bytes -.L_write_trailing_bytes: - movb %al, (%rdi) - incq %rsi - incq %rdi - decl %ecx - jnz .L_read_trailing_bytes - - /* Copy successful. Return zero */ -.L_done_memcpy_trap: - xorl %eax, %eax -.L_done: - ret -SYM_FUNC_END(__memcpy_mcsafe) -EXPORT_SYMBOL_GPL(__memcpy_mcsafe) - - .section .fixup, "ax" - /* - * Return number of bytes not copied for any failure. Note that - * there is no "tail" handling since the source buffer is 8-byte - * aligned and poison is cacheline aligned. - */ -.E_read_words: - shll $3, %ecx -.E_leading_bytes: - addl %edx, %ecx -.E_trailing_bytes: - mov %ecx, %eax - jmp .L_done - - /* - * For write fault handling, given the destination is unaligned, - * we handle faults on multi-byte writes with a byte-by-byte - * copy up to the write-protected page. - */ -.E_write_words: - shll $3, %ecx - addl %edx, %ecx - movl %ecx, %edx - jmp mcsafe_handle_tail - - .previous - - _ASM_EXTABLE_FAULT(.L_read_leading_bytes, .E_leading_bytes) - _ASM_EXTABLE_FAULT(.L_read_words, .E_read_words) - _ASM_EXTABLE_FAULT(.L_read_trailing_bytes, .E_trailing_bytes) - _ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes) - _ASM_EXTABLE(.L_write_words, .E_write_words) - _ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes) -#endif diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c index fff28c6f73a2..a3f9c93efc42 100644 --- a/arch/x86/lib/usercopy_64.c +++ b/arch/x86/lib/usercopy_64.c @@ -55,27 +55,6 @@ unsigned long clear_user(void __user *to, unsigned long n) } EXPORT_SYMBOL(clear_user);
-/* - * Similar to copy_user_handle_tail, probe for the write fault point, - * but reuse __memcpy_mcsafe in case a new read error is encountered. - * clac() is handled in _copy_to_iter_mcsafe(). - */ -__visible notrace unsigned long -mcsafe_handle_tail(char *to, char *from, unsigned len) -{ - for (; len; --len, to++, from++) { - /* - * Call the assembly routine back directly since - * memcpy_mcsafe() may silently fallback to memcpy. - */ - unsigned long rem = __memcpy_mcsafe(to, from, 1); - - if (rem) - break; - } - return len; -} - #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE /** * clean_cache_range - write back a cache range with CLWB diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c index 114927da9cc9..edd14472b740 100644 --- a/drivers/md/dm-writecache.c +++ b/drivers/md/dm-writecache.c @@ -49,7 +49,7 @@ do { \ #define pmem_assign(dest, src) ((dest) = (src)) #endif
-#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(DM_WRITECACHE_HAS_PMEM) +#if defined(__HAVE_ARCH_COPY_SAFE) && defined(DM_WRITECACHE_HAS_PMEM) #define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS #endif
@@ -954,7 +954,7 @@ static void writecache_resume(struct dm_target *ti) } wc->freelist_size = 0;
- r = memcpy_mcsafe(&sb_seq_count, &sb(wc)->seq_count, sizeof(uint64_t)); + r = copy_safe(&sb_seq_count, &sb(wc)->seq_count, sizeof(uint64_t)); if (r) { writecache_error(wc, r, "hardware memory error when reading superblock: %d", r); sb_seq_count = cpu_to_le64(0); @@ -970,7 +970,7 @@ static void writecache_resume(struct dm_target *ti) e->seq_count = -1; continue; } - r = memcpy_mcsafe(&wme, memory_entry(wc, e), sizeof(struct wc_memory_entry)); + r = copy_safe(&wme, memory_entry(wc, e), sizeof(struct wc_memory_entry)); if (r) { writecache_error(wc, r, "hardware memory error when reading metadata entry %lu: %d", (unsigned long)b, r); @@ -1132,7 +1132,7 @@ static void bio_copy_block(struct dm_writecache *wc, struct bio *bio, void *data
if (rw == READ) { int r; - r = memcpy_mcsafe(buf, data, size); + r = copy_safe(buf, data, size); flush_dcache_page(bio_page(bio)); if (unlikely(r)) { writecache_error(wc, r, "hardware memory error when reading data: %d", r); @@ -2275,7 +2275,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) } }
- r = memcpy_mcsafe(&s, sb(wc), sizeof(struct wc_memory_superblock)); + r = copy_safe(&s, sb(wc), sizeof(struct wc_memory_superblock)); if (r) { ti->error = "Hardware memory error when reading superblock"; goto bad; @@ -2286,7 +2286,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) ti->error = "Unable to initialize device"; goto bad; } - r = memcpy_mcsafe(&s, sb(wc), sizeof(struct wc_memory_superblock)); + r = copy_safe(&s, sb(wc), sizeof(struct wc_memory_superblock)); if (r) { ti->error = "Hardware memory error when reading superblock"; goto bad; diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c index 45964acba944..3f18c2119751 100644 --- a/drivers/nvdimm/claim.c +++ b/drivers/nvdimm/claim.c @@ -268,7 +268,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, if (rw == READ) { if (unlikely(is_bad_pmem(&nsio->bb, sector, sz_align))) return -EIO; - if (memcpy_mcsafe(buf, nsio->addr + offset, size) != 0) + if (copy_safe(buf, nsio->addr + offset, size) != 0) return -EIO; return 0; } diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 2df6994acf83..197d76885316 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -124,7 +124,7 @@ static blk_status_t read_pmem(struct page *page, unsigned int off, while (len) { mem = kmap_atomic(page); chunk = min_t(unsigned int, len, PAGE_SIZE - off); - rem = memcpy_mcsafe(mem + off, pmem_addr, chunk); + rem = copy_safe(mem + off, pmem_addr, chunk); kunmap_atomic(mem); if (rem) return BLK_STS_IOERR; @@ -302,7 +302,7 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
/* * Use the 'no check' versions of copy_from_iter_flushcache() and - * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds + * copy_to_iter_safe() to bypass HARDENED_USERCOPY overhead. Bounds * checking, both file offset and device offset, is handled by * dax_iomap_actor() */ @@ -315,7 +315,7 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - return _copy_to_iter_mcsafe(addr, bytes, i); + return _copy_to_iter_safe(addr, bytes, i); }
static const struct dax_operations pmem_dax_ops = { diff --git a/include/linux/string.h b/include/linux/string.h index 6dfbb2efa815..2c7de8317c43 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -161,20 +161,25 @@ extern int bcmp(const void *,const void *,__kernel_size_t); #ifndef __HAVE_ARCH_MEMCHR extern void * memchr(const void *,int,__kernel_size_t); #endif -#ifndef __HAVE_ARCH_MEMCPY_MCSAFE -static inline __must_check unsigned long memcpy_mcsafe(void *dst, - const void *src, size_t cnt) +#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE +static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt) { memcpy(dst, src, cnt); - return 0; } #endif -#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE -static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt) + +#ifdef CONFIG_ARCH_HAS_COPY_SAFE +extern __must_check unsigned long copy_safe(void *dst, const void *src, + size_t cnt); +#else /* CONFIG_ARCH_HAS_COPY_SAFE */ +static inline __must_check unsigned long copy_safe(void *dst, const void *src, + size_t cnt) { memcpy(dst, src, cnt); + return 0; } #endif + void *memchr_inv(const void *s, int c, size_t n); char *strreplace(char *s, char old, char new);
diff --git a/include/linux/uio.h b/include/linux/uio.h index 9576fd8158d7..069d590d3fcc 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -186,10 +186,10 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i); #define _copy_from_iter_flushcache _copy_from_iter_nocache #endif
-#ifdef CONFIG_ARCH_HAS_UACCESS_MCSAFE -size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i); +#ifdef CONFIG_ARCH_HAS_COPY_SAFE +size_t _copy_to_iter_safe(const void *addr, size_t bytes, struct iov_iter *i); #else -#define _copy_to_iter_mcsafe _copy_to_iter +#define _copy_to_iter_safe _copy_to_iter #endif
static __always_inline __must_check @@ -202,12 +202,12 @@ size_t copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i) }
static __always_inline __must_check -size_t copy_to_iter_mcsafe(void *addr, size_t bytes, struct iov_iter *i) +size_t copy_to_iter_safe(void *addr, size_t bytes, struct iov_iter *i) { if (unlikely(!check_copy_size(addr, bytes, true))) return 0; else - return _copy_to_iter_mcsafe(addr, bytes, i); + return _copy_to_iter_safe(addr, bytes, i); }
size_t iov_iter_zero(size_t bytes, struct iov_iter *); diff --git a/lib/Kconfig b/lib/Kconfig index e831e1f01767..ed0e92828285 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -628,7 +628,7 @@ config UACCESS_MEMCPY config ARCH_HAS_UACCESS_FLUSHCACHE bool
-config ARCH_HAS_UACCESS_MCSAFE +config ARCH_HAS_COPY_SAFE bool
# Temporary. Goes away when all archs are cleaned up diff --git a/lib/iov_iter.c b/lib/iov_iter.c index bf538c2bec77..df7b475b5018 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -636,30 +636,30 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i) } EXPORT_SYMBOL(_copy_to_iter);
-#ifdef CONFIG_ARCH_HAS_UACCESS_MCSAFE -static int copyout_mcsafe(void __user *to, const void *from, size_t n) +#ifdef CONFIG_ARCH_HAS_COPY_SAFE +static int copyout_safe(void __user *to, const void *from, size_t n) { if (access_ok(to, n)) { instrument_copy_to_user(to, from, n); - n = copy_to_user_mcsafe((__force void *) to, from, n); + n = copy_to_user_safe((__force void *) to, from, n); } return n; }
-static unsigned long memcpy_mcsafe_to_page(struct page *page, size_t offset, +static unsigned long copy_safe_to_page(struct page *page, size_t offset, const char *from, size_t len) { unsigned long ret; char *to;
to = kmap_atomic(page); - ret = memcpy_mcsafe(to + offset, from, len); + ret = copy_safe(to + offset, from, len); kunmap_atomic(to);
return ret; }
-static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes, +static size_t copy_pipe_to_iter_safe(const void *addr, size_t bytes, struct iov_iter *i) { struct pipe_inode_info *pipe = i->pipe; @@ -677,7 +677,7 @@ static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes, size_t chunk = min_t(size_t, n, PAGE_SIZE - off); unsigned long rem;
- rem = memcpy_mcsafe_to_page(pipe->bufs[i_head & p_mask].page, + rem = copy_safe_to_page(pipe->bufs[i_head & p_mask].page, off, addr, chunk); i->head = i_head; i->iov_offset = off + chunk - rem; @@ -694,7 +694,7 @@ static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes, }
/** - * _copy_to_iter_mcsafe - copy to user with source-read error exception handling + * _copy_to_iter_safe - copy to user with source-read error exception handling * @addr: source kernel address * @bytes: total transfer length * @iter: destination iterator @@ -702,8 +702,8 @@ static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes, * The pmem driver arranges for filesystem-dax to use this facility via * dax_copy_to_iter() for protecting read/write to persistent memory. * Unless / until an architecture can guarantee identical performance - * between _copy_to_iter_mcsafe() and _copy_to_iter() it would be a - * performance regression to switch more users to the mcsafe version. + * between _copy_to_iter_safe() and _copy_to_iter() it would be a + * performance regression to switch more users to the safe version. * * Otherwise, the main differences between this and typical _copy_to_iter(). * @@ -717,21 +717,21 @@ static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes, * Compare to copy_to_iter() where only ITER_IOVEC attempts might return * a short copy. * - * See MCSAFE_TEST for self-test. + * See COPY_SAFE_TEST for self-test. */ -size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i) +size_t _copy_to_iter_safe(const void *addr, size_t bytes, struct iov_iter *i) { const char *from = addr; unsigned long rem, curr_addr, s_addr = (unsigned long) addr;
if (unlikely(iov_iter_is_pipe(i))) - return copy_pipe_to_iter_mcsafe(addr, bytes, i); + return copy_pipe_to_iter_safe(addr, bytes, i); if (iter_is_iovec(i)) might_fault(); iterate_and_advance(i, bytes, v, - copyout_mcsafe(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len), + copyout_safe(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len), ({ - rem = memcpy_mcsafe_to_page(v.bv_page, v.bv_offset, + rem = copy_safe_to_page(v.bv_page, v.bv_offset, (from += v.bv_len) - v.bv_len, v.bv_len); if (rem) { curr_addr = (unsigned long) from; @@ -740,7 +740,7 @@ size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i) } }), ({ - rem = memcpy_mcsafe(v.iov_base, (from += v.iov_len) - v.iov_len, + rem = copy_safe(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len); if (rem) { curr_addr = (unsigned long) from; @@ -752,8 +752,8 @@ size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i)
return bytes; } -EXPORT_SYMBOL_GPL(_copy_to_iter_mcsafe); -#endif /* CONFIG_ARCH_HAS_UACCESS_MCSAFE */ +EXPORT_SYMBOL_GPL(_copy_to_iter_safe); +#endif /* CONFIG_ARCH_HAS_COPY_SAFE */
size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i) { diff --git a/tools/arch/x86/include/asm/copy_safe_test.h b/tools/arch/x86/include/asm/copy_safe_test.h new file mode 100644 index 000000000000..1a70c7d424cf --- /dev/null +++ b/tools/arch/x86/include/asm/copy_safe_test.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _COPY_SAFE_TEST_H_ +#define _COPY_SAFE_TEST_H_ + +.macro COPY_SAFE_TEST_CTL +.endm + +.macro COPY_SAFE_TEST_SRC reg count target +.endm + +.macro COPY_SAFE_TEST_DST reg count target +.endm +#endif /* _COPY_SAFE_TEST_H_ */ diff --git a/tools/arch/x86/include/asm/mcsafe_test.h b/tools/arch/x86/include/asm/mcsafe_test.h deleted file mode 100644 index 2ccd588fbad4..000000000000 --- a/tools/arch/x86/include/asm/mcsafe_test.h +++ /dev/null @@ -1,13 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _MCSAFE_TEST_H_ -#define _MCSAFE_TEST_H_ - -.macro MCSAFE_TEST_CTL -.endm - -.macro MCSAFE_TEST_SRC reg count target -.endm - -.macro MCSAFE_TEST_DST reg count target -.endm -#endif /* _MCSAFE_TEST_H_ */ diff --git a/tools/arch/x86/lib/memcpy_64.S b/tools/arch/x86/lib/memcpy_64.S index df767afc690f..6e45d3d1a96e 100644 --- a/tools/arch/x86/lib/memcpy_64.S +++ b/tools/arch/x86/lib/memcpy_64.S @@ -4,7 +4,6 @@ #include <linux/linkage.h> #include <asm/errno.h> #include <asm/cpufeatures.h> -#include <asm/mcsafe_test.h> #include <asm/alternative-asm.h> #include <asm/export.h>
@@ -183,117 +182,3 @@ SYM_FUNC_START(memcpy_orig) .Lend: retq SYM_FUNC_END(memcpy_orig) - -#ifndef CONFIG_UML - -MCSAFE_TEST_CTL - -/* - * __memcpy_mcsafe - memory copy with machine check exception handling - * Note that we only catch machine checks when reading the source addresses. - * Writes to target are posted and don't generate machine checks. - */ -SYM_FUNC_START(__memcpy_mcsafe) - cmpl $8, %edx - /* Less than 8 bytes? Go to byte copy loop */ - jb .L_no_whole_words - - /* Check for bad alignment of source */ - testl $7, %esi - /* Already aligned */ - jz .L_8byte_aligned - - /* Copy one byte at a time until source is 8-byte aligned */ - movl %esi, %ecx - andl $7, %ecx - subl $8, %ecx - negl %ecx - subl %ecx, %edx -.L_read_leading_bytes: - movb (%rsi), %al - MCSAFE_TEST_SRC %rsi 1 .E_leading_bytes - MCSAFE_TEST_DST %rdi 1 .E_leading_bytes -.L_write_leading_bytes: - movb %al, (%rdi) - incq %rsi - incq %rdi - decl %ecx - jnz .L_read_leading_bytes - -.L_8byte_aligned: - movl %edx, %ecx - andl $7, %edx - shrl $3, %ecx - jz .L_no_whole_words - -.L_read_words: - movq (%rsi), %r8 - MCSAFE_TEST_SRC %rsi 8 .E_read_words - MCSAFE_TEST_DST %rdi 8 .E_write_words -.L_write_words: - movq %r8, (%rdi) - addq $8, %rsi - addq $8, %rdi - decl %ecx - jnz .L_read_words - - /* Any trailing bytes? */ -.L_no_whole_words: - andl %edx, %edx - jz .L_done_memcpy_trap - - /* Copy trailing bytes */ - movl %edx, %ecx -.L_read_trailing_bytes: - movb (%rsi), %al - MCSAFE_TEST_SRC %rsi 1 .E_trailing_bytes - MCSAFE_TEST_DST %rdi 1 .E_trailing_bytes -.L_write_trailing_bytes: - movb %al, (%rdi) - incq %rsi - incq %rdi - decl %ecx - jnz .L_read_trailing_bytes - - /* Copy successful. Return zero */ -.L_done_memcpy_trap: - xorl %eax, %eax -.L_done: - ret -SYM_FUNC_END(__memcpy_mcsafe) -EXPORT_SYMBOL_GPL(__memcpy_mcsafe) - - .section .fixup, "ax" - /* - * Return number of bytes not copied for any failure. Note that - * there is no "tail" handling since the source buffer is 8-byte - * aligned and poison is cacheline aligned. - */ -.E_read_words: - shll $3, %ecx -.E_leading_bytes: - addl %edx, %ecx -.E_trailing_bytes: - mov %ecx, %eax - jmp .L_done - - /* - * For write fault handling, given the destination is unaligned, - * we handle faults on multi-byte writes with a byte-by-byte - * copy up to the write-protected page. - */ -.E_write_words: - shll $3, %ecx - addl %edx, %ecx - movl %ecx, %edx - jmp mcsafe_handle_tail - - .previous - - _ASM_EXTABLE_FAULT(.L_read_leading_bytes, .E_leading_bytes) - _ASM_EXTABLE_FAULT(.L_read_words, .E_read_words) - _ASM_EXTABLE_FAULT(.L_read_trailing_bytes, .E_trailing_bytes) - _ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes) - _ASM_EXTABLE(.L_write_words, .E_write_words) - _ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes) -#endif diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 0c732d586924..be20eff8f358 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -533,8 +533,8 @@ static const char *uaccess_safe_builtin[] = { "__ubsan_handle_shift_out_of_bounds", /* misc */ "csum_partial_copy_generic", - "__memcpy_mcsafe", - "mcsafe_handle_tail", + "copy_safe_slow", + "copy_safe_slow_handle_tail", "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */ NULL }; diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build index 042827385c87..5cc1ee5fd520 100644 --- a/tools/perf/bench/Build +++ b/tools/perf/bench/Build @@ -10,7 +10,6 @@ perf-y += epoll-wait.o perf-y += epoll-ctl.o perf-y += synthesize.o
-perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-lib.o perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o perf-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
diff --git a/tools/perf/bench/mem-memcpy-x86-64-lib.c b/tools/perf/bench/mem-memcpy-x86-64-lib.c deleted file mode 100644 index 4130734dde84..000000000000 --- a/tools/perf/bench/mem-memcpy-x86-64-lib.c +++ /dev/null @@ -1,24 +0,0 @@ -/* - * From code in arch/x86/lib/usercopy_64.c, copied to keep tools/ copy - * of the kernel's arch/x86/lib/memcpy_64.s used in 'perf bench mem memcpy' - * happy. - */ -#include <linux/types.h> - -unsigned long __memcpy_mcsafe(void *dst, const void *src, size_t cnt); -unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len); - -unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len) -{ - for (; len; --len, to++, from++) { - /* - * Call the assembly routine back directly since - * memcpy_mcsafe() may silently fallback to memcpy. - */ - unsigned long rem = __memcpy_mcsafe(to, from, 1); - - if (rem) - break; - } - return len; -} diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c index a8ee5c4d41eb..ac5c1fa796de 100644 --- a/tools/testing/nvdimm/test/nfit.c +++ b/tools/testing/nvdimm/test/nfit.c @@ -23,7 +23,8 @@ #include "nfit_test.h" #include "../watermark.h"
-#include <asm/mcsafe_test.h> +#include <asm/copy_safe_test.h> +#include <asm/copy_safe.h>
/* * Generate an NFIT table to describe the following topology: @@ -3052,7 +3053,7 @@ static struct platform_driver nfit_test_driver = { .id_table = nfit_test_id, };
-static char mcsafe_buf[PAGE_SIZE] __attribute__((__aligned__(PAGE_SIZE))); +static char copy_safe_buf[PAGE_SIZE] __attribute__((__aligned__(PAGE_SIZE)));
enum INJECT { INJECT_NONE, @@ -3060,7 +3061,7 @@ enum INJECT { INJECT_DST, };
-static void mcsafe_test_init(char *dst, char *src, size_t size) +static void copy_safe_test_init(char *dst, char *src, size_t size) { size_t i;
@@ -3069,7 +3070,7 @@ static void mcsafe_test_init(char *dst, char *src, size_t size) src[i] = (char) i; }
-static bool mcsafe_test_validate(unsigned char *dst, unsigned char *src, +static bool copy_safe_test_validate(unsigned char *dst, unsigned char *src, size_t size, unsigned long rem) { size_t i; @@ -3090,12 +3091,12 @@ static bool mcsafe_test_validate(unsigned char *dst, unsigned char *src, return true; }
-void mcsafe_test(void) +void copy_safe_test(void) { char *inject_desc[] = { "none", "source", "destination" }; enum INJECT inj;
- if (IS_ENABLED(CONFIG_MCSAFE_TEST)) { + if (IS_ENABLED(CONFIG_COPY_SAFE_TEST)) { pr_info("%s: run...\n", __func__); } else { pr_info("%s: disabled, skip.\n", __func__); @@ -3113,31 +3114,31 @@ void mcsafe_test(void)
switch (inj) { case INJECT_NONE: - mcsafe_inject_src(NULL); - mcsafe_inject_dst(NULL); - dst = &mcsafe_buf[2048]; - src = &mcsafe_buf[1024 - i]; + copy_safe_inject_src(NULL); + copy_safe_inject_dst(NULL); + dst = ©_safe_buf[2048]; + src = ©_safe_buf[1024 - i]; expect = 0; break; case INJECT_SRC: - mcsafe_inject_src(&mcsafe_buf[1024]); - mcsafe_inject_dst(NULL); - dst = &mcsafe_buf[2048]; - src = &mcsafe_buf[1024 - i]; + copy_safe_inject_src(©_safe_buf[1024]); + copy_safe_inject_dst(NULL); + dst = ©_safe_buf[2048]; + src = ©_safe_buf[1024 - i]; expect = 512 - i; break; case INJECT_DST: - mcsafe_inject_src(NULL); - mcsafe_inject_dst(&mcsafe_buf[2048]); - dst = &mcsafe_buf[2048 - i]; - src = &mcsafe_buf[1024]; + copy_safe_inject_src(NULL); + copy_safe_inject_dst(©_safe_buf[2048]); + dst = ©_safe_buf[2048 - i]; + src = ©_safe_buf[1024]; expect = 512 - i; break; }
- mcsafe_test_init(dst, src, 512); - rem = __memcpy_mcsafe(dst, src, 512); - valid = mcsafe_test_validate(dst, src, 512, expect); + copy_safe_test_init(dst, src, 512); + rem = copy_safe_slow(dst, src, 512); + valid = copy_safe_test_validate(dst, src, 512, expect); if (rem == expect && valid) continue; pr_info("%s: copy(%#lx, %#lx, %d) off: %d rem: %ld %s expect: %ld\n", @@ -3149,8 +3150,8 @@ void mcsafe_test(void) } }
- mcsafe_inject_src(NULL); - mcsafe_inject_dst(NULL); + copy_safe_inject_src(NULL); + copy_safe_inject_dst(NULL); }
static __init int nfit_test_init(void) @@ -3161,7 +3162,7 @@ static __init int nfit_test_init(void) libnvdimm_test(); acpi_nfit_test(); device_dax_test(); - mcsafe_test(); + copy_safe_test(); dax_pmem_test(); dax_pmem_core_test(); #ifdef CONFIG_DEV_DAX_PMEM_COMPAT diff --git a/tools/testing/selftests/powerpc/copyloops/.gitignore b/tools/testing/selftests/powerpc/copyloops/.gitignore index ddaf140b8255..5187f69e3789 100644 --- a/tools/testing/selftests/powerpc/copyloops/.gitignore +++ b/tools/testing/selftests/powerpc/copyloops/.gitignore @@ -12,4 +12,4 @@ memcpy_p7_t1 copyuser_64_exc_t0 copyuser_64_exc_t1 copyuser_64_exc_t2 -memcpy_mcsafe_64 +copy_safe diff --git a/tools/testing/selftests/powerpc/copyloops/Makefile b/tools/testing/selftests/powerpc/copyloops/Makefile index 0917983a1c78..2bec0eccdf7d 100644 --- a/tools/testing/selftests/powerpc/copyloops/Makefile +++ b/tools/testing/selftests/powerpc/copyloops/Makefile @@ -12,7 +12,7 @@ ASFLAGS = $(CFLAGS) -Wa,-mpower4 TEST_GEN_PROGS := copyuser_64_t0 copyuser_64_t1 copyuser_64_t2 \ copyuser_p7_t0 copyuser_p7_t1 \ memcpy_64_t0 memcpy_64_t1 memcpy_64_t2 \ - memcpy_p7_t0 memcpy_p7_t1 memcpy_mcsafe_64 \ + memcpy_p7_t0 memcpy_p7_t1 copy_safe \ copyuser_64_exc_t0 copyuser_64_exc_t1 copyuser_64_exc_t2
EXTRA_SOURCES := validate.c ../harness.c stubs.S @@ -45,9 +45,9 @@ $(OUTPUT)/memcpy_p7_t%: memcpy_power7.S $(EXTRA_SOURCES) -D SELFTEST_CASE=$(subst memcpy_p7_t,,$(notdir $@)) \ -o $@ $^
-$(OUTPUT)/memcpy_mcsafe_64: memcpy_mcsafe_64.S $(EXTRA_SOURCES) +$(OUTPUT)/copy_safe: copy_safe.S $(EXTRA_SOURCES) $(CC) $(CPPFLAGS) $(CFLAGS) \ - -D COPY_LOOP=test_memcpy_mcsafe \ + -D COPY_LOOP=test_copy_safe \ -o $@ $^
$(OUTPUT)/copyuser_64_exc_t%: copyuser_64.S exc_validate.c ../harness.c \ diff --git a/tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S b/tools/testing/selftests/powerpc/copyloops/copy_safe.S similarity index 100% rename from tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S rename to tools/testing/selftests/powerpc/copyloops/copy_safe.S
Hi
[This is an automated email]
This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: all
The bot has tested the following trees: v5.6.7, v5.4.35, v4.19.118, v4.14.177, v4.9.220, v4.4.220.
v5.6.7: Failed to apply! Possible dependencies: d0ef4c360f7e ("iov_iter: Use generic instrumented.h")
v5.4.35: Failed to apply! Possible dependencies: 30a2441cae7b ("x86/asm: Make more symbols local") 6dcc5627f6ae ("x86/asm: Change all ENTRY+ENDPROC to SYM_FUNC_*") 74d8b90a8890 ("x86/asm/crypto: Annotate local functions") e9b9d020c487 ("x86/asm: Annotate aliases") ef1e03152cb0 ("x86/asm: Make some functions local")
v4.19.118: Failed to apply! Possible dependencies: 02c02bf12c5d ("xarray: Change definition of sibling entries") 175967318c30 ("mm: introduce ARCH_HAS_PTE_DEVMAP") 3159f943aafd ("xarray: Replace exceptional entries") 3a08cd52c37c ("radix tree: Remove multiorder support") 3d0186bb068e ("Update email address") 461cef2a676e ("powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE") 66ee620f06f9 ("idr: Permit any valid kernel pointer to be stored") 74d609585d8b ("page cache: Add and replace pages using the XArray")
v4.14.177: Failed to apply! Possible dependencies: 07037db5d479 ("RISC-V: Paging and MMU") 10dac04c79b1 ("mips: fix an off-by-one in dma_capable") 175967318c30 ("mm: introduce ARCH_HAS_PTE_DEVMAP") 3010a5ea665a ("mm: introduce ARCH_HAS_PTE_SPECIAL") 31cb1bc0dc94 ("sched/core: Rework and clarify prepare_lock_switch()") 32ce3862af3c ("powerpc/lib: Implement PMEM API") 3ccfebedd8cf ("powerpc, membarrier: Skip memory barrier in switch_mm()") 461cef2a676e ("powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE") 75387b92635e ("arm64: handle 52-bit physical addresses in page table entries") 76d2a0493a17 ("RISC-V: Init and Halt Code") 7db91e57a0ac ("RISC-V: Task implementation") 8c4cdce8b1ab ("mtd: nand: qcom: add command elements in BAM transaction") c7c527dd6e76 ("Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"") cc6c98485f8e ("RISC-V: Move to the new GENERIC_IRQ_MULTI_HANDLER handler") d1b069f5febc ("EXPERT Kconfig menu: fix broken EXPERT menu") e6d588a8e3da ("arm64: head.S: handle 52-bit PAs in PTEs in early page table setup") ea8c64ace866 ("dma-mapping: move swiotlb arch helpers to a new header") ee333554fed5 ("ARM: 8749/1: Kconfig: Add ARCH_HAS_FORTIFY_SOURCE") f1e3a12b6543 ("membarrier/arm64: Provide core serializing command") fbe934d69eb7 ("RISC-V: Build Infrastructure")
v4.9.220: Failed to apply! Possible dependencies: 15accb3cbbcd ("MAINTAINERS: extend mvebu SoC entry with pinctrl drivers") 1cb0b57fec06 ("MAINTAINERS: add irqchip related drivers to Marvell EBU maintainers") 20bb5505e96f ("MAINTAINERS: cpufreq: add bmips-cpufreq.c") 21dd0ece34c2 ("ARM: dts: at91: add devicetree for the Axentia TSE-850") 2e7d1098c00c ("MAINTAINERS: add entry for dma mapping helpers") 384fe7a4d732 ("drivers: net: xgene-v2: Add DMA descriptor") 3b3f9a75d931 ("drivers: net: xgene-v2: Add base driver") 3f0d80b6d228 ("MAINTAINERS: Add bnxt_en maintainer info.") 413dfbbfca54 ("MAINTAINERS: add entry for Aspeed I2C driver") 420a3879d694 ("MAINTAINERS: change email address from atmel to microchip") 461cef2a676e ("powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE") 51c5d8447bd7 ("MMC: meson: initial support for GX platforms") 52c468fb37b6 ("MAINTAINERS: oxnas: Add new files definitions") 70dbd9b258d5 ("MAINTAINERS: Add entry for APM X-Gene SoC Ethernet (v2) driver") 7683e9e52925 ("Properly alphabetize MAINTAINERS file") 81ccd0cab29b ("drivers: net: xgene-v2: Add mac configuration") aa43112445f0 ("ASoC: atmel: tse850: add ASoC driver for the Axentia TSE-850") b105bcdaaa0e ("drivers: net: xgene-v2: Add transmit and receive") c821d30148ca ("ASoC: tse850: document axentia,tse850-pcm5142 bindings") e7c1572f6565 ("MAINTAINERS: sort F entries for Marvell EBU maintainers") ea8c64ace866 ("dma-mapping: move swiotlb arch helpers to a new header") fd33f3eca6bf ("MAINTAINERS: Add maintainers for the meson clock driver")
v4.4.220: Failed to apply! Possible dependencies: 1f664ab7d9d4 ("MAINTAINERS: update entry for Marvell ARM platform maintainers") 27eb6622ab67 ("config: add android config fragments") 384fe7a4d732 ("drivers: net: xgene-v2: Add DMA descriptor") 3b3f9a75d931 ("drivers: net: xgene-v2: Add base driver") 413dfbbfca54 ("MAINTAINERS: add entry for Aspeed I2C driver") 461cef2a676e ("powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE") 51c5d8447bd7 ("MMC: meson: initial support for GX platforms") 5255034d1701 ("ARM: mach-artpec: add entry to MAINTAINERS") 54176cc68038 ("maintainers: update rmk's email address(es)") 5b7551db8688 ("ARM: dts: keystone: Add minimum support for K2G evm") 5edafc29829b ("ARM: dts: k2*: Rename the k2* files to keystone-k2* files") 6683d91cde25 ("MAINTAINERS: ARM/Amlogic: add co-maintainer, misc. updates") 70dbd9b258d5 ("MAINTAINERS: Add entry for APM X-Gene SoC Ethernet (v2) driver") 7683e9e52925 ("Properly alphabetize MAINTAINERS file") 79318452cb36 ("MAINTAINERS: Extend info, add wiki and ml for meson arch") 81ccd0cab29b ("drivers: net: xgene-v2: Add mac configuration") 8bb0bce92ec9 ("MAINTAINERS: add maintainer and reviewers for the etnaviv DRM driver") 8c2ed9bcfbeb ("arm: Add Aspeed machine") 8cb555b603f3 ("MAINTAINERS: add Chanho Min as ARM/LG1K maintainer") b105bcdaaa0e ("drivers: net: xgene-v2: Add transmit and receive") dcc3068a757e ("MAINTAINERS: Extend dts entry for ARM64 mvebu files") e68d7c143a62 ("MAINTAINERS: Add missing platform maintainers for dts files") e7c1572f6565 ("MAINTAINERS: sort F entries for Marvell EBU maintainers") ea8c64ace866 ("dma-mapping: move swiotlb arch helpers to a new header") fd33f3eca6bf ("MAINTAINERS: Add maintainers for the meson clock driver") fd3a628e3f2a ("MAINTAINERS: Add entry for APM X-Gene SoC PMU driver")
NOTE: The patch will not be queued to stable trees until it is upstream.
How should we proceed with this patch?
The existing copy_safe() implementation satisfies two primary concerns. It provides a copy routine that avoids known unrecoverable scenarios (poison consumption via fast-string instructions / accesses across cacheline boundaries), and it provides a fallback to fast plain memcpy if the platform does not indicate recovery capability. The fallback is broken for two reasons. It fails the expected copy_safe() semantics that allow it to be used in scenarios where reads / writes trigger memory protection faults and it fails to enable machine check recovery on systems that do not need the careful semantics of copy_safe_slow().
With copy_safe_fast() in place copy_safe() never falls back to plain memcpy(), and it is fast regardless in every other scenario outside of the copy_safe_slow() blacklist scenario.
Cc: x86@kernel.org Cc: stable@vger.kernel.org Cc: Ingo Molnar mingo@redhat.com Cc: Borislav Petkov bp@alien8.de Cc: "H. Peter Anvin" hpa@zytor.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Tony Luck tony.luck@intel.com Cc: Linus Torvalds torvalds@linux-foundation.org Reported-by: Erwin Tsaur erwin.tsaur@intel.com Tested-by: Erwin Tsaur erwin.tsaur@intel.com Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()") Signed-off-by: Dan Williams dan.j.williams@intel.com --- arch/x86/include/asm/copy_safe.h | 2 ++ arch/x86/lib/copy_safe.c | 5 ++--- arch/x86/lib/copy_safe_64.S | 40 ++++++++++++++++++++++++++++++++++++++ tools/objtool/check.c | 1 + 4 files changed, 45 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/copy_safe.h b/arch/x86/include/asm/copy_safe.h index 1c130364dd61..306248ca819f 100644 --- a/arch/x86/include/asm/copy_safe.h +++ b/arch/x86/include/asm/copy_safe.h @@ -12,5 +12,7 @@ static inline void enable_copy_safe_slow(void)
__must_check unsigned long copy_safe_slow(void *dst, const void *src, size_t cnt); +__must_check unsigned long +copy_safe_fast(void *dst, const void *src, size_t cnt);
#endif /* _COPY_SAFE_H_ */ diff --git a/arch/x86/lib/copy_safe.c b/arch/x86/lib/copy_safe.c index 9dd3a831ff94..b8472e6a44d3 100644 --- a/arch/x86/lib/copy_safe.c +++ b/arch/x86/lib/copy_safe.c @@ -25,7 +25,7 @@ void enable_copy_safe_slow(void) * * We only call into the slow version on systems that have trouble * actually do machine check recovery. Everyone else can just - * use memcpy(). + * use copy_safe_fast(). * * Return 0 for success, or number of bytes not copied if there was an * exception. @@ -34,8 +34,7 @@ __must_check unsigned long copy_safe(void *dst, const void *src, size_t cnt) { if (static_branch_unlikely(©_safe_slow_key)) return copy_safe_slow(dst, src, cnt); - memcpy(dst, src, cnt); - return 0; + return copy_safe_fast(dst, src, cnt); } EXPORT_SYMBOL_GPL(copy_safe);
diff --git a/arch/x86/lib/copy_safe_64.S b/arch/x86/lib/copy_safe_64.S index 46dfdd832bde..551c781ae9fd 100644 --- a/arch/x86/lib/copy_safe_64.S +++ b/arch/x86/lib/copy_safe_64.S @@ -2,7 +2,9 @@ /* Copyright(c) 2016-2020 Intel Corporation. All rights reserved. */
#include <linux/linkage.h> +#include <asm/alternative-asm.h> #include <asm/copy_safe_test.h> +#include <asm/cpufeatures.h> #include <asm/export.h> #include <asm/asm.h>
@@ -120,4 +122,42 @@ EXPORT_SYMBOL_GPL(copy_safe_slow) _ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes) _ASM_EXTABLE(.L_write_words, .E_write_words) _ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes) + +/* + * copy_safe_fast - memory copy with exception handling + * + * Fast string copy + fault / exception handling. If the CPU does + * support machine check exception recovery, but does not support + * recovering from fast-string exceptions then this CPU needs to be + * added to the copy_safe_slow_key set of quirks. Otherwise, absent any + * machine check recovery support this version should be no slower than + * standard memcpy. + */ +SYM_FUNC_START(copy_safe_fast) + ALTERNATIVE "jmp copy_safe_slow", "", X86_FEATURE_ERMS + movq %rdi, %rax + movq %rdx, %rcx +.L_copy: + rep movsb + /* Copy successful. Return zero */ + xorl %eax, %eax + ret +SYM_FUNC_END(copy_safe_fast) +EXPORT_SYMBOL_GPL(copy_safe_fast) + + .section .fixup, "ax" +.E_copy: + /* + * On fault %rcx is updated such that the copy instruction could + * optionally be restarted at the fault position, i.e. it + * contains 'bytes remaining'. A non-zero return indicates error + * to copy_safe() users, or indicate short transfers to + * user-copy routines. + */ + movq %rcx, %rax + ret + + .previous + + _ASM_EXTABLE_FAULT(.L_copy, .E_copy) #endif diff --git a/tools/objtool/check.c b/tools/objtool/check.c index be20eff8f358..e8b6e2438d31 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -533,6 +533,7 @@ static const char *uaccess_safe_builtin[] = { "__ubsan_handle_shift_out_of_bounds", /* misc */ "csum_partial_copy_generic", + "copy_safe_fast", "copy_safe_slow", "copy_safe_slow_handle_tail", "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
Hi
[This is an automated email]
This commit has been processed because it contains a "Fixes:" tag fixing commit: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()").
The bot has tested the following trees: v5.6.7, v5.4.35, v4.19.118, v4.14.177, v4.9.220.
v5.6.7: Failed to apply! Possible dependencies: d0ef4c360f7e ("iov_iter: Use generic instrumented.h") d28c6860fe9f ("copy_safe: Rename memcpy_mcsafe() to copy_safe()")
v5.4.35: Failed to apply! Possible dependencies: 30a2441cae7b ("x86/asm: Make more symbols local") 6dcc5627f6ae ("x86/asm: Change all ENTRY+ENDPROC to SYM_FUNC_*") 74d8b90a8890 ("x86/asm/crypto: Annotate local functions") d28c6860fe9f ("copy_safe: Rename memcpy_mcsafe() to copy_safe()") e9b9d020c487 ("x86/asm: Annotate aliases") ef1e03152cb0 ("x86/asm: Make some functions local")
v4.19.118: Failed to apply! Possible dependencies: 02c02bf12c5d ("xarray: Change definition of sibling entries") 175967318c30 ("mm: introduce ARCH_HAS_PTE_DEVMAP") 3159f943aafd ("xarray: Replace exceptional entries") 3a08cd52c37c ("radix tree: Remove multiorder support") 3d0186bb068e ("Update email address") 461cef2a676e ("powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE") 66ee620f06f9 ("idr: Permit any valid kernel pointer to be stored") 74d609585d8b ("page cache: Add and replace pages using the XArray") d28c6860fe9f ("copy_safe: Rename memcpy_mcsafe() to copy_safe()")
v4.14.177: Failed to apply! Possible dependencies: 07037db5d479 ("RISC-V: Paging and MMU") 10dac04c79b1 ("mips: fix an off-by-one in dma_capable") 175967318c30 ("mm: introduce ARCH_HAS_PTE_DEVMAP") 3010a5ea665a ("mm: introduce ARCH_HAS_PTE_SPECIAL") 31cb1bc0dc94 ("sched/core: Rework and clarify prepare_lock_switch()") 32ce3862af3c ("powerpc/lib: Implement PMEM API") 3ccfebedd8cf ("powerpc, membarrier: Skip memory barrier in switch_mm()") 461cef2a676e ("powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE") 75387b92635e ("arm64: handle 52-bit physical addresses in page table entries") 76d2a0493a17 ("RISC-V: Init and Halt Code") 7db91e57a0ac ("RISC-V: Task implementation") 8c4cdce8b1ab ("mtd: nand: qcom: add command elements in BAM transaction") c7c527dd6e76 ("Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"") cc6c98485f8e ("RISC-V: Move to the new GENERIC_IRQ_MULTI_HANDLER handler") d1b069f5febc ("EXPERT Kconfig menu: fix broken EXPERT menu") d28c6860fe9f ("copy_safe: Rename memcpy_mcsafe() to copy_safe()") e6d588a8e3da ("arm64: head.S: handle 52-bit PAs in PTEs in early page table setup") ea8c64ace866 ("dma-mapping: move swiotlb arch helpers to a new header") ee333554fed5 ("ARM: 8749/1: Kconfig: Add ARCH_HAS_FORTIFY_SOURCE") f1e3a12b6543 ("membarrier/arm64: Provide core serializing command") fbe934d69eb7 ("RISC-V: Build Infrastructure")
v4.9.220: Failed to apply! Possible dependencies: 15accb3cbbcd ("MAINTAINERS: extend mvebu SoC entry with pinctrl drivers") 1cb0b57fec06 ("MAINTAINERS: add irqchip related drivers to Marvell EBU maintainers") 21dd0ece34c2 ("ARM: dts: at91: add devicetree for the Axentia TSE-850") 2e7d1098c00c ("MAINTAINERS: add entry for dma mapping helpers") 384fe7a4d732 ("drivers: net: xgene-v2: Add DMA descriptor") 3b3f9a75d931 ("drivers: net: xgene-v2: Add base driver") 3f0d80b6d228 ("MAINTAINERS: Add bnxt_en maintainer info.") 413dfbbfca54 ("MAINTAINERS: add entry for Aspeed I2C driver") 420a3879d694 ("MAINTAINERS: change email address from atmel to microchip") 461cef2a676e ("powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE") 51c5d8447bd7 ("MMC: meson: initial support for GX platforms") 52c468fb37b6 ("MAINTAINERS: oxnas: Add new files definitions") 70dbd9b258d5 ("MAINTAINERS: Add entry for APM X-Gene SoC Ethernet (v2) driver") 7683e9e52925 ("Properly alphabetize MAINTAINERS file") 81ccd0cab29b ("drivers: net: xgene-v2: Add mac configuration") aa43112445f0 ("ASoC: atmel: tse850: add ASoC driver for the Axentia TSE-850") b105bcdaaa0e ("drivers: net: xgene-v2: Add transmit and receive") c821d30148ca ("ASoC: tse850: document axentia,tse850-pcm5142 bindings") d28c6860fe9f ("copy_safe: Rename memcpy_mcsafe() to copy_safe()") e7c1572f6565 ("MAINTAINERS: sort F entries for Marvell EBU maintainers") ea8c64ace866 ("dma-mapping: move swiotlb arch helpers to a new header") fd33f3eca6bf ("MAINTAINERS: Add maintainers for the meson clock driver")
NOTE: The patch will not be queued to stable trees until it is upstream.
How should we proceed with this patch?
On Thu, Apr 30, 2020 at 1:41 AM Dan Williams dan.j.williams@intel.com wrote:
With the above realizations the name "mcsafe" is no longer accurate and copy_safe() is proposed as its replacement. x86 grows a copy_safe_fast() implementation as a default implementation that is independent of detecting the presence of x86-MCA.
How is this then different from "probe_kernel_read()" and "probe_kernel_write()"? Other than the obvious "it does it for both reads and writes"?
IOW, wouldn't it be sensible to try to match the naming and try to find some unified model for all these things?
"probe_kernel_copy()"?
Linus
On Thu, Apr 30, 2020 at 7:03 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Apr 30, 2020 at 1:41 AM Dan Williams dan.j.williams@intel.com wrote:
With the above realizations the name "mcsafe" is no longer accurate and copy_safe() is proposed as its replacement. x86 grows a copy_safe_fast() implementation as a default implementation that is independent of detecting the presence of x86-MCA.
How is this then different from "probe_kernel_read()" and "probe_kernel_write()"? Other than the obvious "it does it for both reads and writes"?
IOW, wouldn't it be sensible to try to match the naming and try to find some unified model for all these things?
"probe_kernel_copy()"?
I don't like this whole concept.
If I'm going to copy from memory that might be bad but is at least a valid pointer, I want a function to do this. If I'm going to copy from memory that might be entirely bogus, that's a different operation. In other words, if I'm writing e.g. filesystem that is touching get_user_pages()'d persistent memory, I don't want to panic if the memory fails, but I do want at least a very loud warning if I follow a wild pointer.
So I think that probe_kernel_copy() is not a valid replacement for memcpy_mcsafe().
--Andy
On Thu, Apr 30, 2020 at 9:52 AM Andy Lutomirski luto@kernel.org wrote:
If I'm going to copy from memory that might be bad but is at least a valid pointer, I want a function to do this. If I'm going to copy from memory that might be entirely bogus, that's a different operation. In other words, if I'm writing e.g. filesystem that is touching get_user_pages()'d persistent memory, I don't want to panic if the memory fails, but I do want at least a very loud warning if I follow a wild pointer.
So I think that probe_kernel_copy() is not a valid replacement for memcpy_mcsafe().
Fair enough.
That said, the part I do like about probe_kernel_read/write() is that it does indicate which part we think is possibly the one that needs more care.
Sure, it _might_ be both sides, but honestly, that's likely the much less common case. Kind of like "copy_{to,from}_user()" vs "copy_in_user()".
Yes, the "copy_in_user()" case exists, but it's the odd and unusual case.
Looking at the existing cases of "memcpy_mcsafe()", they do seem to generally have a very clearly defined direction, not "both sides can break".
I also find myself suspecting that one case people _do_ want to possibly do is to copy from nvdimm memory into user space. So then that needs yet another function.
And we have that copy_to_user_mcsafe() for that, and used in the disgustingly named "copyout_mcsafe()". Ugly incomprehensible BSD'ism.
But oddly we don't have the "from_user" case.
So this thing seems messy, the naming is odd and inconsistent, and I'd really like the whole "access with exception handling" to have some clear rules and clear names.
The whole "there are fifty different special cases" really drives me wild. It's why I think the hardware was so broken.
And now the special "writes can fault" rule still confuses me. _copy_to_iter_mcsafe() was mentioned, which makes me think that it's literally about that "copy from nvram to user space" issue.
But that can't just trap on the destination, that fundamentally needs special user space accesses anyway. Even on x86 you have the whole STAC/CLAC issue, on other architectures the stores may not be normal stores at all.
So a "copy_safe()" model doesn't actually work for that at all.
So I'm a bit (maybe a _lot_) confused about what the semantics should actually be. And I want the naming to reflect whatever those semantics are. And I don't think "copy_safe()" reflects any semantics at all.
Linus
On Thu, Apr 30, 2020 at 10:17 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Apr 30, 2020 at 9:52 AM Andy Lutomirski luto@kernel.org wrote:
If I'm going to copy from memory that might be bad but is at least a valid pointer, I want a function to do this. If I'm going to copy from memory that might be entirely bogus, that's a different operation. In other words, if I'm writing e.g. filesystem that is touching get_user_pages()'d persistent memory, I don't want to panic if the memory fails, but I do want at least a very loud warning if I follow a wild pointer.
So I think that probe_kernel_copy() is not a valid replacement for memcpy_mcsafe().
Fair enough.
That said, the part I do like about probe_kernel_read/write() is that it does indicate which part we think is possibly the one that needs more care.
Sure, it _might_ be both sides, but honestly, that's likely the much less common case. Kind of like "copy_{to,from}_user()" vs "copy_in_user()".
Yes, the "copy_in_user()" case exists, but it's the odd and unusual case.
I suppose there could be a consistent naming like this:
copy_from_user() copy_to_user()
copy_from_unchecked_kernel_address() [what probe_kernel_read() is] copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
copy_from_fallible() [from a kernel address that can fail to a kernel address that can't fail] copy_to_fallible() [the opposite, but hopefully identical to memcpy() on x86]
copy_from_fallible_to_user() copy_from_user_to_fallible()
These names are fairly verbose and could probably be improved.
On Thu, Apr 30, 2020 at 11:42:20AM -0700, Andy Lutomirski wrote:
I suppose there could be a consistent naming like this:
copy_from_user() copy_to_user()
copy_from_unchecked_kernel_address() [what probe_kernel_read() is] copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
copy_from_fallible() [from a kernel address that can fail to a kernel address that can't fail] copy_to_fallible() [the opposite, but hopefully identical to memcpy() on x86]
copy_from_fallible_to_user() copy_from_user_to_fallible()
These names are fairly verbose and could probably be improved.
How about
try_copy_catch(void *dst, void *src, size_t count, int *fault)
returns number of bytes not-copied (like copy_to_user etc).
if return is not zero, "fault" tells you what type of fault cause the early stop (#PF, #MC).
-Tony
On Thu, Apr 30, 2020 at 12:23 PM Luck, Tony tony.luck@intel.com wrote:
How about
try_copy_catch(void *dst, void *src, size_t count, int *fault)
returns number of bytes not-copied (like copy_to_user etc).
if return is not zero, "fault" tells you what type of fault cause the early stop (#PF, #MC).
That just makes things worse.
The problem isn't "what fault did I get?".
The problem is "what is the point of this function?".
In other words, I want the code to explain _why_ it uses a particular function.
So the question the function should answer is not "Why did it take a fault?", but "Why isn't this just a 'memcpy()'"?
When somebody writes
x = copy_to_user(a,b,c);
the "why is it not a memcpy" question never comes up, because the code is obvious. It's not a memory copy, because it's copying to user space, and user space is different!
In contrast, if you write
x = copy_safe(a,b,c);
then what is going on? There is no rhyme or reason to the call. Is the source unsafe? Wny? Is the destination faulting? Why? Both? How?
So "copy_to_user()" _answers_ a question. But "copy_safe()" just results in more questions. See the difference?
And notice that the "why did it fault" question is NOT about your "what kind of fault did it take" question. That question is generally pretty much uninteresting.
The question I want answered is "why is this function being called AT ALL".
Again, "copy_to_user()" can fail, and we know to check failure cases. But more importantly, the _reason_ it can fail is obvious from the name and from the use. There's no confusion about "why".
"copy_safe()"? or "try_copy_catch()"? No such thing. It doesn't answer that fundamental "why" question.
And yes, this also has practical consequences. If you know that the failure is due to the source being some special memory (and if we care about the MC case with unaligned accesses), then the function in question should probably strive to make those _source_ accesses be the aligned ones. But if it's the destination that is special memory, then it's the writes to the destination that should be aligned. If you need both, you may need to be either mutually aligned, or do byte accesses, or do special shifting copies. So it matters for any initial alignment code (if the thing has alignment issues to begin with).
I haven't even gotten an answer to the "why would the write fail". When I asked, Dan said
"writes can mmu-fault now that memcpy_mcsafe() is also used by _copy_to_iter_mcsafe()"
but as mentioned, the only reason I can find for _that_ is that the destination is user space.
In which case a "copy_safe()" absolutely could never work.
If I can't figure out the "why is this function used" or even figure out why it needs the semantics it claims it needs, then there's a problem.
Personally, I suspect that the *easiest* way to support the broken nvdimm semantics is to not have a "copy" function at all as the basic accessor model.
Why? Exactly because "copy" is not a fundamental well-defined action. You get nasty combinatorial explosions of different things, where you have three different kinds of sources (user, RAM, nvdimm) and three different kinds of destinations.
And that's ignoring the whole "maybe you don't want to do a plain copy, maybe you want to calculate a RAID checksum, or do a 'strcpy()' or whatever". If those are ever issues, you get another explosion of combinations.
The only *fundamental* access would likely be a single read/write operation, not a copy operation. Think "get_user()" instead of "copy_from_user()". Even there you get combinatorial explosions with access sizes, but you can often generate those automatically or with simple patterns, and then you can build up the copy functions from that if you really need to.
Linus
On Thu, Apr 30, 2020 at 12:50:40PM -0700, Linus Torvalds wrote:
I see your point about the namimg being important. I think Dan's case is indeed "copy from pmem to user" where only options for faulting are #MC on the source addresses, and #PF on the destination.
The only *fundamental* access would likely be a single read/write operation, not a copy operation. Think "get_user()" instead of "copy_from_user()". Even there you get combinatorial explosions with access sizes, but you can often generate those automatically or with simple patterns, and then you can build up the copy functions from that if you really need to.
That's maybe very clean. But it looks like it would be hard to build a high performance interface on top of that primitive. Remember that for Dan's copy 99.999999999367673%[1] of copies will not hit a machine check on the read from pmem.
Dan wants (whatever the function name) to get to a "REP MOVS" with an exception table entry to handle the cases where there is a fault.
-Tony
[1] Likely several more '9's in there
On Thu, Apr 30, 2020 at 12:56 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Apr 30, 2020 at 12:23 PM Luck, Tony tony.luck@intel.com wrote:
How about
try_copy_catch(void *dst, void *src, size_t count, int *fault)
returns number of bytes not-copied (like copy_to_user etc).
if return is not zero, "fault" tells you what type of fault cause the early stop (#PF, #MC).
That just makes things worse.
The problem isn't "what fault did I get?".
The problem is "what is the point of this function?".
In other words, I want the code to explain _why_ it uses a particular function.
So the question the function should answer is not "Why did it take a fault?", but "Why isn't this just a 'memcpy()'"?
When somebody writes
x = copy_to_user(a,b,c);
the "why is it not a memcpy" question never comes up, because the code is obvious. It's not a memory copy, because it's copying to user space, and user space is different!
In contrast, if you write
x = copy_safe(a,b,c);
then what is going on? There is no rhyme or reason to the call. Is the source unsafe? Wny? Is the destination faulting? Why? Both? How?
So "copy_to_user()" _answers_ a question. But "copy_safe()" just results in more questions. See the difference?
And notice that the "why did it fault" question is NOT about your "what kind of fault did it take" question. That question is generally pretty much uninteresting.
The question I want answered is "why is this function being called AT ALL".
Again, "copy_to_user()" can fail, and we know to check failure cases. But more importantly, the _reason_ it can fail is obvious from the name and from the use. There's no confusion about "why".
"copy_safe()"? or "try_copy_catch()"? No such thing. It doesn't answer that fundamental "why" question.
And yes, this also has practical consequences. If you know that the failure is due to the source being some special memory (and if we care about the MC case with unaligned accesses), then the function in question should probably strive to make those _source_ accesses be the aligned ones. But if it's the destination that is special memory, then it's the writes to the destination that should be aligned. If you need both, you may need to be either mutually aligned, or do byte accesses, or do special shifting copies. So it matters for any initial alignment code (if the thing has alignment issues to begin with).
I haven't even gotten an answer to the "why would the write fail". When I asked, Dan said
"writes can mmu-fault now that memcpy_mcsafe() is also used by _copy_to_iter_mcsafe()"
but as mentioned, the only reason I can find for _that_ is that the destination is user space.
In which case a "copy_safe()" absolutely could never work.
If I can't figure out the "why is this function used" or even figure out why it needs the semantics it claims it needs, then there's a problem.
Personally, I suspect that the *easiest* way to support the broken nvdimm semantics is to not have a "copy" function at all as the basic accessor model.
You had me until here. Up to this point I was grokking that Andy's "_fallible" suggestion does help explain better than "_safe", because the copy is doing extra safety checks. copy_to_user() and copy_to_user_fallible() mean *something* where copy_to_user_safe() does not.
However you lose me on this "broken nvdimm semantics" contention. There is nothing nvdimm-hardware specific about the copy_safe() implementation, zero, nada, nothing new to the error model that DRAM did not also inflict on the Linux implementation.
This is about Linux treating memory as a disk and performing bulk transfers with the CPU instead of a DMA engine. Whereas existing memory error handling has a high chance for it to trigger on userspace accesses, large copies in kernel mode now have a higher chance of tripping over the same errors in kernel space. Since there is an error model overlaid on top of the block-I/O, software is well prepared to handle the short transfer case.
Why? Exactly because "copy" is not a fundamental well-defined action. You get nasty combinatorial explosions of different things, where you have three different kinds of sources (user, RAM, nvdimm) and three different kinds of destinations.
True, copy is not a fundamentally well defined operation. It's unfortunate that the low-level of a direct-I/O read(2) eventually turns into a call to memcpy() inside a typical ramdisk driver. So you're right, this isn't a copy routine as much as it's the backend of read(2)/write(2) with short transfer semantics, but by the time it gets to pmem driver it's been disconnected from its original intent and all it sees is "move bytes from here to there if you can".
And that's ignoring the whole "maybe you don't want to do a plain copy, maybe you want to calculate a RAID checksum, or do a 'strcpy()' or whatever". If those are ever issues, you get another explosion of combinations.
The only *fundamental* access would likely be a single read/write operation, not a copy operation. Think "get_user()" instead of "copy_from_user()". Even there you get combinatorial explosions with access sizes, but you can often generate those automatically or with simple patterns, and then you can build up the copy functions from that if you really need to.
The CPU overhead of synchronous bulk block-I/O transfers with the CPU is already painful, a get_user() loop adds to that without a benefit that I can currently perceive. The vast majority of driver actions and DAX operations are in PAGE_SIZE units.
On Thu, Apr 30, 2020 at 4:52 PM Dan Williams dan.j.williams@intel.com wrote:
You had me until here. Up to this point I was grokking that Andy's "_fallible" suggestion does help explain better than "_safe", because the copy is doing extra safety checks. copy_to_user() and copy_to_user_fallible() mean *something* where copy_to_user_safe() does not.
It's a horrible word, btw. The word doesn't actually mean what Andy means it to mean. "fallible" means "can make mistakes", not "can fault".
So "fallible" is a horrible name.
But anyway, I don't hate something like "copy_to_user_fallible()" conceptually. The naming needs to be fixed, in that "user" can always take a fault, so it's the _source_ that can fault, not the "user" part.
It was the "copy_safe()" model that I find unacceptable, that uses _one_ name for what is at the very least *four* different operations:
- copy from faulting memory to user
- copy from faulting memory to kernel
- copy from kernel to faulting memory
- copy within faulting memory
No way can you do that with one single function. A kernel address and a user address may literally have the exact same bit representation. So the user vs kernel distinction _has_ to be in the name.
The "kernel vs faulting" doesn't necessarily have to be there from an implementation standpoint, but it *should* be there, because
- it might affect implemmentation
- but even if it DOESN'T affect implementation, it should be separate just from the standpoint of being self-documenting code.
However you lose me on this "broken nvdimm semantics" contention. There is nothing nvdimm-hardware specific about the copy_safe() implementation, zero, nada, nothing new to the error model that DRAM did not also inflict on the Linux implementation.
Ok, so good. Let's kill this all, and just use memcpy(), and copy_to_user().
Just make sure that the nvdimm code doesn't use invalid kernel addresses or other broken poisoning.
Problem solved.
You can't have it both ways. Either memcpy just works, or it doesn't.
So which way is it?
Linus
On Apr 30, 2020, at 5:10 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Apr 30, 2020 at 4:52 PM Dan Williams dan.j.williams@intel.com wrote:
You had me until here. Up to this point I was grokking that Andy's "_fallible" suggestion does help explain better than "_safe", because the copy is doing extra safety checks. copy_to_user() and copy_to_user_fallible() mean *something* where copy_to_user_safe() does not.
It's a horrible word, btw. The word doesn't actually mean what Andy means it to mean. "fallible" means "can make mistakes", not "can fault".
So "fallible" is a horrible name.
What I was trying to get at was not “can fault” but “can malfunction”. Maybe “unreliable”? Better words welcome.
But anyway, I don't hate something like "copy_to_user_fallible()" conceptually. The naming needs to be fixed, in that "user" can always take a fault, so it's the _source_ that can fault, not the "user" part.
I don’t like this. “user” already implied that basically anything can be wrong with the memory — it can be unmapped entirely, it can have the wrong permissions, it can have the wrong protection key, it can have an ECC error, etc. If the operation you want is “copy from unreliable kernel memory (but with a definitely valid pointer) to user memory”, you want copy_unreliable_to_user().
Now maybe copy_to_user() should *always* work this way, but I’m not convinced. Certainly put_user() shouldn’t — the result wouldn’t even be well defined. And I’m unconvinced that it makes much sense for the majority of copy_to_user() callers that are also directly accessing the source structure.
I also tend to think that the probe_kernel stuff should just stay separate. Those are really for two totally separate types of use: either the kernel is trying its best to print an errr message without exploding worse, or it’s involved in eBPF or trading hacks in which address is arbitrary and essentially untrusted.
On Thu, Apr 30, 2020 at 5:23 PM Andy Lutomirski luto@amacapital.net wrote:
But anyway, I don't hate something like "copy_to_user_fallible()" conceptually. The naming needs to be fixed, in that "user" can always take a fault, so it's the _source_ that can fault, not the "user" part.
I don’t like this. “user” already implied that basically anything can be wrong with the memory
Maybe I didn't explain.
"user" already implies faulting. We agree.
And since we by definition cannot know what the user has mapped into user space, *every* normal copy_to_user() has to be able to handle whatever faults that throws at us.
The reason I dislike "copy_to_user_fallible()" is that the user side already has that 'fallible".
If it's the _source_ being "fallible" (it really needs a better name - I will not call it just "f") then it should be "copy_f_to_user()".
That would be ok.
So "copy_f_to_user()" makes sense. But "copy_to_user_f()" does not. That puts the "f" on the "user", which we already know can fault.
See what I want in the name? I want the name to say which side can cause problems!
If you are copying memory from some f source, it must not be "copy_safe()". That doesn't say if the _source_ is f, or the destination is f.
So "copy_to_f()" makes sense (we don't say "copy_kernel_to_user()" - the "normal" case is silent), and "copy_from_f()" makes sense. "copy_in_f()" makes sense too.
But not this "randomly copy some randomly f memory area that I don't know if it's the source or the destination".
Sometimes you may then use a common implementation for the different directions - if that works on the architecture.
For example, "copy_to_user()" and "copy_from_user()" on x86 both end up internally using a shared "copy_user_generic()" implementation. I wish that wasn't the case (when I asked for what was to become STAC/CLAC, I asked for one that could determine which direction of a "rep movs" could touch user space), but it so happens that the implementations end up being symmetric on x86.
But that's a pure implementation issue, and it very much is not going to be true in general, and it shouldn't be exposed to users as such (and we obviously don't).
Linus
On Apr 30, 2020, at 5:40 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Apr 30, 2020 at 5:23 PM Andy Lutomirski luto@amacapital.net wrote:
But anyway, I don't hate something like "copy_to_user_fallible()" conceptually. The naming needs to be fixed, in that "user" can always take a fault, so it's the _source_ that can fault, not the "user" part.
I don’t like this. “user” already implied that basically anything can be wrong with the memory
Maybe I didn't explain.
"user" already implies faulting. We agree.
And since we by definition cannot know what the user has mapped into user space, *every* normal copy_to_user() has to be able to handle whatever faults that throws at us.
The reason I dislike "copy_to_user_fallible()" is that the user side already has that 'fallible".
If it's the _source_ being "fallible" (it really needs a better name - I will not call it just "f") then it should be "copy_f_to_user()".
That would be ok.
So "copy_f_to_user()" makes sense. But "copy_to_user_f()" does not. That puts the "f" on the "user", which we already know can fault.
See what I want in the name? I want the name to say which side can cause problems!
We are in violent agreement. I’m moderately confident that I never suggested copy_from_user_f(). We appear to agree completely.
Now maybe copy_to_user() should *always* work this way, but I’m not convinced. Certainly put_user() shouldn’t — the result wouldn’t even be well defined. And I’m unconvinced that it makes much sense for the majority of copy_to_user() callers that are also directly accessing the source structure.
One case that might work is copy_to_user() that's copying from the kernel page cache to the user in response to a read(2) system call. Action would be to check if we could re-read from the file system to a different page. If not, return -EIO. Either way ditch the poison page from the page cache.
-Tony
On Fri, May 1, 2020 at 7:09 AM Luck, Tony tony.luck@intel.com wrote:
Now maybe copy_to_user() should *always* work this way, but I’m not convinced. Certainly put_user() shouldn’t — the result wouldn’t even be well defined. And I’m unconvinced that it makes much sense for the majority of copy_to_user() callers that are also directly accessing the source structure.
One case that might work is copy_to_user() that's copying from the kernel page cache to the user in response to a read(2) system call. Action would be to check if we could re-read from the file system to a different page. If not, return -EIO. Either way ditch the poison page from the page cache.
I think that, before we do too much design of the semantics of just the copy function, we need a design for the whole system. Specifically:
When the kernel finds out that a kernel page is bad (via #MC or via any other mechanism), what does the kernel do? Does it unmap it? Does it replace it with a dummy page? Does it leave it there?
When a copy function hits a bad page and the page is not yet known to be bad, what does it do? (I.e. the page was believed to be fine but the copy function gets #MC.) Does it unmap it right away? What does it return?
When a copy function hits a page that is already known to be bad because the kernel got the "oh crap, bad page" notification earlier, what does it do? Return -EIO? Take some fancier action under the assumption that it's called in a preemptible, IRQs-on context, whereas the original #MC or other hardware notification may have come at a less opportune time?
When a copy function hits a bad page and the page is not yet known to be bad, what does it do? (I.e. the page was believed to be fine but the copy function gets #MC.) Does it unmap it right away? What does it return?
I suspect that we will only ever find a handful of situations where the kernel can recover from memory that has gone bad that are worth fixing (got to be some code path that touches a meaningful fraction of memory, otherwise we get code complexity without any meaningful payoff).
I don't think we'd want different actions for the cases of "we just found out now that this page is bad" and "we got a notification an hour ago that this page had gone bad". Currently we treat those the same for application errors ... SIGBUS either way[1].
-Tony
[1] well there are options both globally and at the per-process level to have the "early" notifications delivered right away.
On Mon, May 4, 2020 at 1:05 PM Luck, Tony tony.luck@intel.com wrote:
When a copy function hits a bad page and the page is not yet known to be bad, what does it do? (I.e. the page was believed to be fine but the copy function gets #MC.) Does it unmap it right away? What does it return?
I suspect that we will only ever find a handful of situations where the kernel can recover from memory that has gone bad that are worth fixing (got to be some code path that touches a meaningful fraction of memory, otherwise we get code complexity without any meaningful payoff).
I don't think we'd want different actions for the cases of "we just found out now that this page is bad" and "we got a notification an hour ago that this page had gone bad". Currently we treat those the same for application errors ... SIGBUS either way[1].
Oh, I agree that the end result should be the same. I'm thinking more about the mechanism and the internal API. As a somewhat silly example of why there's a difference, the first time we try to read from bad memory, we can expect #MC (I assume, on a sensibly functioning platform). But, once we get the #MC, I imagine that the #MC handler will want to unmap the page to prevent a storm of additional #MC events on the same page -- given the awful x86 #MC design, too many all at once is fatal. So the next time we copy_mc_to_user() or whatever from the memory, we'll get #PF instead. Or maybe that #MC will defer the unmap?
So the point of my questions is that the overall design should be at least somewhat settled before anyone tries to review just the copy functions.
On Mon, May 4, 2020 at 1:26 PM Andy Lutomirski luto@kernel.org wrote:
On Mon, May 4, 2020 at 1:05 PM Luck, Tony tony.luck@intel.com wrote:
When a copy function hits a bad page and the page is not yet known to be bad, what does it do? (I.e. the page was believed to be fine but the copy function gets #MC.) Does it unmap it right away? What does it return?
I suspect that we will only ever find a handful of situations where the kernel can recover from memory that has gone bad that are worth fixing (got to be some code path that touches a meaningful fraction of memory, otherwise we get code complexity without any meaningful payoff).
I don't think we'd want different actions for the cases of "we just found out now that this page is bad" and "we got a notification an hour ago that this page had gone bad". Currently we treat those the same for application errors ... SIGBUS either way[1].
Oh, I agree that the end result should be the same. I'm thinking more about the mechanism and the internal API. As a somewhat silly example of why there's a difference, the first time we try to read from bad memory, we can expect #MC (I assume, on a sensibly functioning platform). But, once we get the #MC, I imagine that the #MC handler will want to unmap the page to prevent a storm of additional #MC events on the same page -- given the awful x86 #MC design, too many all at once is fatal. So the next time we copy_mc_to_user() or whatever from the memory, we'll get #PF instead. Or maybe that #MC will defer the unmap?
After the consumption the PMEM driver arranges for the page to never be mapped again via its "badblocks" list.
So the point of my questions is that the overall design should be at least somewhat settled before anyone tries to review just the copy functions.
I would say that DAX / PMEM stretches the Linux memory error handling model beyond what it was originally designed. The primary concepts that bend the assumptions of mm/memory-failure.c are:
1/ DAX pages can not be offlined via the page allocator.
2/ DAX pages (well cachelines in those pages) can be asynchronously marked poisoned by a platform or device patrol scrub facility.
3/ DAX pages might be repaired by writes.
Currently 1/ and 2/ are managed by a per-block-device "badblocks" list that is populated by scrub results and also amended when #MC is raised (see nfit_handle_mce()). When fs/dax.c services faults it will decline to map the page if the physical file extent intersects a bad block.
There is also support for sending SIGBUS if userspace races the scrubber to consume the badblock. However, that uses the standard 'struct page' error model and assumes that a file backed page is 1:1 mapped to a file. This requirement prevents filesystems from enabling reflink. That collision and the desire to enable reflink is why we are now investigating supplanting the mm/memory-failure.c model. When the page is "owned" by a filesystem invoke the filesystem to handle the memory error across all impacted files.
The presence of 3/ means that any action error handling takes to disable access to the page needs to be capable of being undone, which runs counter to the mm/memory-failure.c assumption that offlining is a one-way trip.
On Thu, Apr 30, 2020 at 5:10 PM Linus Torvalds torvalds@linux-foundation.org wrote:
It's a horrible word, btw. The word doesn't actually mean what Andy means it to mean. "fallible" means "can make mistakes", not "can fault".
Btw, on naming: the name should be about _why_ it can fault, not about whether it faults.
Which hasn't been explained to me.
I know why user accesses can fault. I still don't know why these new accesses can fault. I know of the old name (mcs), but the newly suggested name (safe) is the _opposite_ of an explanation of why it faults.
Naming - like comments - shouldn't be about what some implementation is, but about the concept.
Again, let me use that "copy_to_user()" as an example of this. Yes, it can fault. Notice how the name doesn't say "copy_to_faulting()". That would be WRONG. It can fault _because_ it is user memory, so "copy_to_user()" not only describes what it does, but it also implicitly describes that it can fault.
THAT is the kind of explanation I'm looking for. The "memcpy_mcsafe()" at least had _some_ of that in it. It was wrong for all the _other_ reasons (not having a direction, and the hardware just being complete and utter garbage), but at least there was a reason in the name.
I am not interested in adding new infrastructure that cannot even be explained. Why would writes ever fault, considering they are posted (and again, "user space" is not a valid reason, we have that case already and have had it since day #1 even if the original naming was the same kind of bad implementation-specific name that "mcsafe" was).
If the ONLY reason for the fault is a machine check, then the name should say so, and "copy_mc_to_user()" would be a perfectly fine name (along with copy_to_mc(), copy_from_mc(), and copy_in_mc()).
It wasn't clear how "copy_to_mc()" could ever fault. Poisoning after-the-fact? Why would that be preferable to just mapping a dummy page? But even if it cannot fault, I can see that you might want to do it as a special kind of copy to avoid any read-mask-write artifacts (which can definitely happen on other architectures, and I could see a manual memcpy() implementation doing even on x86)
Linus
On Apr 30, 2020, at 5:25 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
It wasn't clear how "copy_to_mc()" could ever fault. Poisoning after-the-fact? Why would that be preferable to just mapping a dummy page?
If the kernel gets an async memory error and maps a dummy page, then subsequent reads will subsequently succeed and return garbage when they should fail. If x86 had write-only pages, we could map a dummy write-only page. But we don’t, so I think we’re stuck.
As for naming the kind of memory we’re taking about, ISTM there are two classes: DAX and monstrous memory-mapped non-persistent cache devices. Both could be Optane, I suppose.
But I also think it’s legitimate to use these accessors to increase the chance of surviving a failure of normal memory. If a normal page happens to be page cache when it fails and if page cache access use these fancy accessors, then we might actually survive a failure.
We could be ambitious: declare that all page cache and all get_user_page’d memory should use the new accessors. I doubt we’ll ever really succeed due to magical things like rseq and anything that thinks that users can set up their own memory as a kernel-accessed ring buffer, but I suppose we could try.
On Thu, Apr 30, 2020 at 5:10 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Apr 30, 2020 at 4:52 PM Dan Williams dan.j.williams@intel.com wrote:
You had me until here. Up to this point I was grokking that Andy's "_fallible" suggestion does help explain better than "_safe", because the copy is doing extra safety checks. copy_to_user() and copy_to_user_fallible() mean *something* where copy_to_user_safe() does not.
It's a horrible word, btw. The word doesn't actually mean what Andy means it to mean. "fallible" means "can make mistakes", not "can fault".
So "fallible" is a horrible name.
But anyway, I don't hate something like "copy_to_user_fallible()" conceptually. The naming needs to be fixed, in that "user" can always take a fault, so it's the _source_ that can fault, not the "user" part.
It was the "copy_safe()" model that I find unacceptable, that uses _one_ name for what is at the very least *four* different operations:
copy from faulting memory to user
copy from faulting memory to kernel
copy from kernel to faulting memory
copy within faulting memory
No way can you do that with one single function. A kernel address and a user address may literally have the exact same bit representation. So the user vs kernel distinction _has_ to be in the name.
The "kernel vs faulting" doesn't necessarily have to be there from an implementation standpoint, but it *should* be there, because
it might affect implemmentation
but even if it DOESN'T affect implementation, it should be separate
just from the standpoint of being self-documenting code.
However you lose me on this "broken nvdimm semantics" contention. There is nothing nvdimm-hardware specific about the copy_safe() implementation, zero, nada, nothing new to the error model that DRAM did not also inflict on the Linux implementation.
Ok, so good. Let's kill this all, and just use memcpy(), and copy_to_user().
Just make sure that the nvdimm code doesn't use invalid kernel addresses or other broken poisoning.
Problem solved.
You can't have it both ways. Either memcpy just works, or it doesn't.
It doesn't, but copy_to_user() is frustratingly close and you can see in the patch that I went ahead and used copy_user_generic() to implement the backend of the default "fast" implementation.
However now I see that copy_user_generic() works for the wrong reason. It works because the exception on the source address due to poison looks no different than a write fault on the user address to the caller, it's still just a short copy. So it makes copy_to_user() work for the wrong reason relative to the name.
How about, following your suggestion, introduce copy_mc_to_user() (can just use copy_user_generic() internally) and copy_mc_to_kernel() for the other the helpers that the copy_to_iter() implementation needs? That makes it clear that no mmu-faults are expected on reads, only exceptions, and no protection-faults are expected at all for copy_mc_to_kernel() even if it happens to accidentally handle it. Following Jann's ex_handler_uaccess() example I could arrange for copy_mc_to_kernel() to use a new _ASM_EXTABLE_MC() to validate that the only type of exception meant to be handled is MC and warn otherwise?
On Thu, Apr 30, 2020 at 6:21 PM Dan Williams dan.j.williams@intel.com wrote:
However now I see that copy_user_generic() works for the wrong reason. It works because the exception on the source address due to poison looks no different than a write fault on the user address to the caller, it's still just a short copy. So it makes copy_to_user() work for the wrong reason relative to the name.
Right.
And it won't work that way on other architectures. On x86, we have a generic function that can take faults on either side, and we use it for both cases (and for the "in_user" case too), but that's an artifact of the architecture oddity.
In fact, it's probably wrong even on x86 - because it can hide bugs - but writing those things is painful enough that everybody prefers having just one function.
That's particularly true since that "one function" is actually a whole family of functions - just for different optimizations.
Plus on x86 you can't reasonably even have different code sequences for that case, because CLAC/STAC don't have a "enable users read accesses" vs "write accesses" case. It's an all-or-nothing "enable user faults".
We _used_ to have a difference on x86, back when we did the whole "fs segment points to user space".
But on other architectures, there really is a difference between "copy_to_user()" and "copy_from_user()", and the functions won't do fault handling for the kernel side accesses.
How about, following your suggestion, introduce copy_mc_to_user() (can just use copy_user_generic() internally) and copy_mc_to_kernel() for the other the helpers that the copy_to_iter() implementation needs?
Yes. That at least solves my naming worries, and is conceptually something that works on other architectures.
Those other architectures may not have nvdimm support yet, but I think everybody is at least looking at it.
And I really do think it will make the users more readable too, when you see on a source level that "oh, this code is expecting that it could take a poison fault/machine check on the source/destination".
Following Jann's ex_handler_uaccess() example I could arrange for copy_mc_to_kernel() to use a new _ASM_EXTABLE_MC() to validate that the only type of exception meant to be handled is MC and warn otherwise?
That may be a good idea, but won't work for any shared case.
IOW, it would be lovely to have a "copy_mc_to_user()" check that if it's a write fault, it's because it's a user address (and if it's a read fault it's because it's a poisoned page or mc or whatever, but a valid kernel address).
But it's exactly the kind of thing that we currently don't do even for the bog-standard "copy_to_user()", because we share all the code because we're lazy.
And as DavidL pointed out - if you ever have "iomem" as a source or destination, you need yet another case. Not because they can take another kind of fault (although on some platforms you have the machine checks for that too), but because they have *very* different performance profiles (and the ERMS "rep movsb" sucks baby donkeys through a straw).
Linus
On 5/1/20 11:28 AM, Linus Torvalds wrote:
Plus on x86 you can't reasonably even have different code sequences for that case, because CLAC/STAC don't have a "enable users read accesses" vs "write accesses" case. It's an all-or-nothing "enable user faults".
We _used_ to have a difference on x86, back when we did the whole "fs segment points to user space".
Protection keys might give us _some_ of this back. If we're doing a copy_from_user(), we could (logically) do:
stac() save_pkru() pkru |= ~0x55555555 ... do userspace read restore_pkru() clac()
That *should* generate a fault if we try to write to userspace in there because PKRU affects all user *addresses* (PTEs with _PAGE_USER set) not user-mode *accesses*.
Properly stashing the value off and context switching it correctly would be fun, but probably not impossible to pull off. You actually wouldn't even technically need to restore PKRU in this path. It would just need to be restored before the thread runs userspace or hits a copy_to_user() equivalent.
I can't imagine this would all be worth the trouble, but there are crazier people out there than me.
From: Linus Torvalds
Sent: 01 May 2020 19:29
...
And as DavidL pointed out - if you ever have "iomem" as a source or destination, you need yet another case. Not because they can take another kind of fault (although on some platforms you have the machine checks for that too), but because they have *very* different performance profiles (and the ERMS "rep movsb" sucks baby donkeys through a straw).
I was actually thinking that the nvdimm accesses need to be treated much more like (cached) memory mapped io space than normal system memory. So treating them the same as "iomem" and then having access functions that report access failures (which the current readq() doesn't) might make sense.
If you are using memory that 'might fail' for kernel code or data you really get what you deserve.
OTOH system response to PCIe errors is currently rather problematic. Mostly reads time out and return ~0u. This can be checked for and, if possibly valid, a second location read.
However we have a x86 server box (I've forgotten whether it is HP or Dell) that generates an NMI whenever a PCIe link goes down. (The 'platform' takes the AER interrupt and uses an NMI to pass it to the kernel - whose bright idea was it to use an NMI???) This happens even after we've done an 'echo 1 >remove'. The system is supposed to be NEBS (I think that is the term) compliant which is supposed to be suitable for telephony work (including emergency calls), but any PCIe failure crashes the box!
I've another system here that sometimes fails to bring the PCIe link back up. I guess these code paths don't get regular testing. In my case the PCIe slave is an fpga, reloading the fpga image (either over JTAG or after rewriting eeprom) doesn't always work.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, May 3, 2020 at 5:57 AM David Laight David.Laight@aculab.com wrote:
From: Linus Torvalds
Sent: 01 May 2020 19:29
...
And as DavidL pointed out - if you ever have "iomem" as a source or destination, you need yet another case. Not because they can take another kind of fault (although on some platforms you have the machine checks for that too), but because they have *very* different performance profiles (and the ERMS "rep movsb" sucks baby donkeys through a straw).
I was actually thinking that the nvdimm accesses need to be treated much more like (cached) memory mapped io space than normal system memory. So treating them the same as "iomem" and then having access functions that report access failures (which the current readq() doesn't) might make sense.
While I agree that something like copy_mc_iomem_to_{user,kernel} could have users, nvdimm is not one of them.
If you are using memory that 'might fail' for kernel code or data you really get what you deserve.
nvdimms are no less "might fail" than DRAM, recall that some nvdimms are just DRAM with a platform promise that their contents are battery backed.
OTOH system response to PCIe errors is currently rather problematic. Mostly reads time out and return ~0u. This can be checked for and, if possibly valid, a second location read.
Yes, the ambiguous ~0u return needs careful handling.
On Thu, Apr 30, 2020 at 06:21:45PM -0700, Dan Williams wrote:
On Thu, Apr 30, 2020 at 5:10 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Apr 30, 2020 at 4:52 PM Dan Williams dan.j.williams@intel.com wrote:
You had me until here. Up to this point I was grokking that Andy's "_fallible" suggestion does help explain better than "_safe", because the copy is doing extra safety checks. copy_to_user() and copy_to_user_fallible() mean *something* where copy_to_user_safe() does not.
It's a horrible word, btw. The word doesn't actually mean what Andy means it to mean. "fallible" means "can make mistakes", not "can fault".
So "fallible" is a horrible name.
But anyway, I don't hate something like "copy_to_user_fallible()" conceptually. The naming needs to be fixed, in that "user" can always take a fault, so it's the _source_ that can fault, not the "user" part.
It was the "copy_safe()" model that I find unacceptable, that uses _one_ name for what is at the very least *four* different operations:
copy from faulting memory to user
copy from faulting memory to kernel
copy from kernel to faulting memory
copy within faulting memory
No way can you do that with one single function. A kernel address and a user address may literally have the exact same bit representation. So the user vs kernel distinction _has_ to be in the name.
The "kernel vs faulting" doesn't necessarily have to be there from an implementation standpoint, but it *should* be there, because
it might affect implemmentation
but even if it DOESN'T affect implementation, it should be separate
just from the standpoint of being self-documenting code.
However you lose me on this "broken nvdimm semantics" contention. There is nothing nvdimm-hardware specific about the copy_safe() implementation, zero, nada, nothing new to the error model that DRAM did not also inflict on the Linux implementation.
Ok, so good. Let's kill this all, and just use memcpy(), and copy_to_user().
Just make sure that the nvdimm code doesn't use invalid kernel addresses or other broken poisoning.
Problem solved.
You can't have it both ways. Either memcpy just works, or it doesn't.
It doesn't, but copy_to_user() is frustratingly close and you can see in the patch that I went ahead and used copy_user_generic() to implement the backend of the default "fast" implementation.
However now I see that copy_user_generic() works for the wrong reason. It works because the exception on the source address due to poison looks no different than a write fault on the user address to the caller, it's still just a short copy. So it makes copy_to_user() work for the wrong reason relative to the name.
How about, following your suggestion, introduce copy_mc_to_user() (can just use copy_user_generic() internally) and copy_mc_to_kernel() for the other the helpers that the copy_to_iter() implementation needs? That makes it clear that no mmu-faults are expected on reads, only exceptions, and no protection-faults are expected at all for copy_mc_to_kernel() even if it happens to accidentally handle it. Following Jann's ex_handler_uaccess() example I could arrange for copy_mc_to_kernel() to use a new _ASM_EXTABLE_MC() to validate that the only type of exception meant to be handled is MC and warn otherwise?
While we are discussing this, I wanted to mention another use case I am looking at. That is using DAX for virtiofs. virtiofs device exports a shared memory region which guest maps using DAX. virtiofs driver dax ops ->copy_to_iter() and ->copy_from_iter() needs to now copy contents from/to this shared memmory region to user space.
So far we are focussed only on nvdimm and expecting only a machine check on read side, IIUC. But this virtual device will probably need something more.
- A page can go missing on host (because file got truncated). So error can happen both in read and write path.
- It might not be a machine check to report this kind of error. KVM as of now considering using an interrupt to report errors or possibly using #VE to report errors.
IOW, tying these new helpers to only machine check will work well for nvdimm use case but not for virtual devices like virtiofs and we will end up defining more helpers. Something more generic then machine check might be able to address both.
Thanks Vivek
On Thu, Apr 30, 2020 at 12:23 PM Luck, Tony tony.luck@intel.com wrote:
On Thu, Apr 30, 2020 at 11:42:20AM -0700, Andy Lutomirski wrote:
I suppose there could be a consistent naming like this:
copy_from_user() copy_to_user()
copy_from_unchecked_kernel_address() [what probe_kernel_read() is] copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
copy_from_fallible() [from a kernel address that can fail to a kernel address that can't fail] copy_to_fallible() [the opposite, but hopefully identical to memcpy() on x86]
copy_from_fallible_to_user() copy_from_user_to_fallible()
These names are fairly verbose and could probably be improved.
How about
try_copy_catch(void *dst, void *src, size_t count, int *fault)
returns number of bytes not-copied (like copy_to_user etc).
if return is not zero, "fault" tells you what type of fault cause the early stop (#PF, #MC).
I do like try_copy_catch() for the case when neither of the buffers are __user (like in the pmem driver) and _copy_to_iter_fallible() (plus all the helpers it implies) for the other cases.
copy_to_user_fallible copy_fallible_to_page copy_pipe_to_iter_fallible
...because the mmu-fault handling is an aspect of "_user" and fallible implies other source fault reasons. It does leave a gap if an architecture has a concept of a fallible write, but that seems like something that will be handled asynchronously and not subject to this interface.
On Apr 30, 2020, at 12:51 PM, Dan Williams dan.j.williams@intel.com wrote:
On Thu, Apr 30, 2020 at 12:23 PM Luck, Tony tony.luck@intel.com wrote:
On Thu, Apr 30, 2020 at 11:42:20AM -0700, Andy Lutomirski wrote: I suppose there could be a consistent naming like this:
copy_from_user() copy_to_user()
copy_from_unchecked_kernel_address() [what probe_kernel_read() is] copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
copy_from_fallible() [from a kernel address that can fail to a kernel address that can't fail] copy_to_fallible() [the opposite, but hopefully identical to memcpy() on x86]
copy_from_fallible_to_user() copy_from_user_to_fallible()
These names are fairly verbose and could probably be improved.
How about
try_copy_catch(void *dst, void *src, size_t count, int *fault)
returns number of bytes not-copied (like copy_to_user etc).
if return is not zero, "fault" tells you what type of fault cause the early stop (#PF, #MC).
I do like try_copy_catch() for the case when neither of the buffers are __user (like in the pmem driver) and _copy_to_iter_fallible() (plus all the helpers it implies) for the other cases.
copy_to_user_fallible copy_fallible_to_page copy_pipe_to_iter_fallible
...because the mmu-fault handling is an aspect of "_user" and fallible implies other source fault reasons. It does leave a gap if an architecture has a concept of a fallible write, but that seems like something that will be handled asynchronously and not subject to this interface.
I’m suspicious that, as a practical matter, x86 does have a fallible write. In particular, if a page fails and the memory failure code is notified, the page will be unmapped. At that point, an attempt to write to the failed fallible page will get #PF, and code that writes to it needs to be prepared to handle #PF. Perhaps copy_to_fallible(), etc can still return void, but I’m unconvinced the implementation can be the same as memcpy.
From: Andy Lutomirski
Sent: 30 April 2020 19:42
...
I suppose there could be a consistent naming like this:
copy_from_user() copy_to_user()
copy_from_unchecked_kernel_address() [what probe_kernel_read() is] copy_to_unchecked_kernel_address() [what probe_kernel_write() is]
copy_from_fallible() [from a kernel address that can fail to a kernel address that can't fail] copy_to_fallible() [the opposite, but hopefully identical to memcpy() on x86]
copy_from_fallible_to_user() copy_from_user_to_fallible()
You missed out: copy_to/from_io() copy_to_io_from_user() copy_from_io_to_user() All of which want aligned addresses on the 'io' side.
It might even be worth saying that the copy_to/from_io() can fail due to bad IO accesses (rather than bad addresses). This is not entirely unexpected since all PCIe accesses can fail unexpectedly (usually without a trap and returning -1). But a system could arrange to generate a synchronous fault.
If you are copying directly from io to user you need to differentiate between a user page fault and an io access error. The latter shouldn't generate SIGSEGV. Possibly return -EFAULT on user page fault and 'transfer length remaining' on io access error. Although filling the rest of the buffer with 0xff might be appropriate.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
linux-stable-mirror@lists.linaro.org