hi, we see broken access to user space with bpf_probe_read/bpf_probe_read_str helpers on arm64 with 5.4 kernel. The problem is that both helpers try to read user memory by calling probe_kernel_read, which seems to work on x86 but fails on arm64.
There are several fixes after v5.4 to address this issue.
There was an attempt to fix that in past [1], but it deviated far from upstream changes.
This patchset tries to follow the upstream changes with 2 notable exceptions:
1) bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers
- this upsgream patch adds new helpers, which we don't need to do, we just need following functions (and related helper's glue):
bpf_probe_read_kernel_common bpf_probe_read_kernel_str_common
that implement bpf_probe_read* helpers and receive fix in next patch below, ommiting any other hunks
2) bpf: rework the compat kernel probe handling
- taking only fixes for functions and realted helper's glue that we took from patch above, ommiting any other hunks
It's possible to add new helpers and keep the patches closer to upstream, but I thought trying this way first as RFC without adding any new helpers into stable kernel, which might possibly end up later with additional fixes.
Also I'm sending this as RFC, because I might be missing some mm related dependency change, that I'm not aware of.
We tested new kernel with our use case on arm64 and x86.
thanks, jirka
[1] https://yhbt.net/lore/all/YHGMFzQlHomDtZYG@kroah.com/t/ --- Andrii Nakryiko (1): bpf: bpf_probe_read_kernel_str() has to return amount of data read on success
Christoph Hellwig (4): maccess: clarify kerneldoc comments maccess: rename strncpy_from_unsafe_user to strncpy_from_user_nofault maccess: rename strncpy_from_unsafe_strict to strncpy_from_kernel_nofault bpf: rework the compat kernel probe handling
Daniel Borkmann (3): uaccess: Add strict non-pagefault kernel-space read function bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers bpf: Restrict bpf_probe_read{, str}() only to archs where they work
arch/arm/Kconfig | 1 + arch/arm64/Kconfig | 1 + arch/x86/Kconfig | 1 + arch/x86/mm/Makefile | 2 +- arch/x86/mm/maccess.c | 43 ++++++++++++++++++++++++++++++++++++++ include/linux/uaccess.h | 8 +++++-- init/Kconfig | 3 +++ kernel/trace/bpf_trace.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------- kernel/trace/trace_kprobe.c | 2 +- mm/maccess.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++------- 10 files changed, 207 insertions(+), 57 deletions(-) create mode 100644 arch/x86/mm/maccess.c
From: Daniel Borkmann daniel@iogearbox.net
commit 75a1a607bb7e6d918be3aca11ec2214a275392f4 upstream.
Add two new probe_kernel_read_strict() and strncpy_from_unsafe_strict() helpers which by default alias to the __probe_kernel_read() and the __strncpy_from_unsafe(), respectively, but can be overridden by archs which have non-overlapping address ranges for kernel space and user space in order to bail out with -EFAULT when attempting to probe user memory including non-canonical user access addresses [0]:
4-level page tables: user-space mem: 0x0000000000000000 - 0x00007fffffffffff non-canonical: 0x0000800000000000 - 0xffff7fffffffffff
5-level page tables: user-space mem: 0x0000000000000000 - 0x00ffffffffffffff non-canonical: 0x0100000000000000 - 0xfeffffffffffffff
The idea is that these helpers are complementary to the probe_user_read() and strncpy_from_unsafe_user() which probe user-only memory. Both added helpers here do the same, but for kernel-only addresses.
Both set of helpers are going to be used for BPF tracing. They also explicitly avoid throwing the splat for non-canonical user addresses from 00c42373d397 ("x86-64: add warning for non-canonical user access address dereferences").
For compat, the current probe_kernel_read() and strncpy_from_unsafe() are left as-is.
[0] Documentation/x86/x86_64/mm.txt
Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Masami Hiramatsu mhiramat@kernel.org Cc: x86@kernel.org Link: https://lore.kernel.org/bpf/eefeefd769aa5a013531f491a71f0936779e916b.1572649... --- arch/x86/mm/Makefile | 2 +- arch/x86/mm/maccess.c | 43 +++++++++++++++++++++++++++++++++++++++++ include/linux/uaccess.h | 4 ++++ mm/maccess.c | 25 +++++++++++++++++++++++- 4 files changed, 72 insertions(+), 2 deletions(-) create mode 100644 arch/x86/mm/maccess.c
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 84373dc9b341..bbc68a54795e 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -13,7 +13,7 @@ CFLAGS_REMOVE_mem_encrypt_identity.o = -pg endif
obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \ - pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o + pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o maccess.o
# Make sure __phys_addr has no stackprotector nostackp := $(call cc-option, -fno-stack-protector) diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c new file mode 100644 index 000000000000..f5b85bdc0535 --- /dev/null +++ b/arch/x86/mm/maccess.c @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include <linux/uaccess.h> +#include <linux/kernel.h> + +#ifdef CONFIG_X86_64 +static __always_inline u64 canonical_address(u64 vaddr, u8 vaddr_bits) +{ + return ((s64)vaddr << (64 - vaddr_bits)) >> (64 - vaddr_bits); +} + +static __always_inline bool invalid_probe_range(u64 vaddr) +{ + /* + * Range covering the highest possible canonical userspace address + * as well as non-canonical address range. For the canonical range + * we also need to include the userspace guard page. + */ + return vaddr < TASK_SIZE_MAX + PAGE_SIZE || + canonical_address(vaddr, boot_cpu_data.x86_virt_bits) != vaddr; +} +#else +static __always_inline bool invalid_probe_range(u64 vaddr) +{ + return vaddr < TASK_SIZE_MAX; +} +#endif + +long probe_kernel_read_strict(void *dst, const void *src, size_t size) +{ + if (unlikely(invalid_probe_range((unsigned long)src))) + return -EFAULT; + + return __probe_kernel_read(dst, src, size); +} + +long strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, long count) +{ + if (unlikely(invalid_probe_range((unsigned long)unsafe_addr))) + return -EFAULT; + + return __strncpy_from_unsafe(dst, unsafe_addr, count); +} diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 70941f49d66e..25ae650dcb1a 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -315,6 +315,7 @@ copy_struct_from_user(void *dst, size_t ksize, const void __user *src, * happens, handle that and return -EFAULT. */ extern long probe_kernel_read(void *dst, const void *src, size_t size); +extern long probe_kernel_read_strict(void *dst, const void *src, size_t size); extern long __probe_kernel_read(void *dst, const void *src, size_t size);
/* @@ -354,6 +355,9 @@ extern long notrace probe_user_write(void __user *dst, const void *src, size_t s extern long notrace __probe_user_write(void __user *dst, const void *src, size_t size);
extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); +extern long strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, + long count); +extern long __strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, long count); extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count); diff --git a/mm/maccess.c b/mm/maccess.c index 2d3c3d01064c..3ca8d97e5010 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -43,11 +43,20 @@ probe_write_common(void __user *dst, const void *src, size_t size) * do_page_fault() doesn't attempt to take mmap_sem. This makes * probe_kernel_read() suitable for use within regions where the caller * already holds mmap_sem, or other locks which nest inside mmap_sem. + * + * probe_kernel_read_strict() is the same as probe_kernel_read() except for + * the case where architectures have non-overlapping user and kernel address + * ranges: probe_kernel_read_strict() will additionally return -EFAULT for + * probing memory on a user address range where probe_user_read() is supposed + * to be used instead. */
long __weak probe_kernel_read(void *dst, const void *src, size_t size) __attribute__((alias("__probe_kernel_read")));
+long __weak probe_kernel_read_strict(void *dst, const void *src, size_t size) + __attribute__((alias("__probe_kernel_read"))); + long __probe_kernel_read(void *dst, const void *src, size_t size) { long ret; @@ -157,8 +166,22 @@ EXPORT_SYMBOL_GPL(probe_user_write); * * If @count is smaller than the length of the string, copies @count-1 bytes, * sets the last byte of @dst buffer to NUL and returns @count. + * + * strncpy_from_unsafe_strict() is the same as strncpy_from_unsafe() except + * for the case where architectures have non-overlapping user and kernel address + * ranges: strncpy_from_unsafe_strict() will additionally return -EFAULT for + * probing memory on a user address range where strncpy_from_unsafe_user() is + * supposed to be used instead. */ -long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) + +long __weak strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) + __attribute__((alias("__strncpy_from_unsafe"))); + +long __weak strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, + long count) + __attribute__((alias("__strncpy_from_unsafe"))); + +long __strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) { mm_segment_t old_fs = get_fs(); const void *src = unsafe_addr;
From: Daniel Borkmann daniel@iogearbox.net
commit 6ae08ae3dea2cfa03dd3665a3c8475c2d429ef47 upstream.
[Taking only hunks that are related to probe_read and probe_read_str helpers, which we want to fix. Taking this patch/hunks as a base for following changes and ommiting the new helpers and uapi changes.]
The current bpf_probe_read() and bpf_probe_read_str() helpers are broken in that they assume they can be used for probing memory access for kernel space addresses /as well as/ user space addresses.
However, plain use of probe_kernel_read() for both cases will attempt to always access kernel space address space given access is performed under KERNEL_DS and some archs in-fact have overlapping address spaces where a kernel pointer and user pointer would have the /same/ address value and therefore accessing application memory via bpf_probe_read{,_str}() would read garbage values.
Lets fix BPF side by making use of recently added 3d7081822f7f ("uaccess: Add non-pagefault user-space read functions"). Unfortunately, the only way to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}() helpers are kept as-is to retain their current behavior.
The two *_user() variants attempt the access always under USER_DS set, the two *_kernel() variants will -EFAULT when accessing user memory if the underlying architecture has non-overlapping address ranges, also avoiding throwing the kernel warning via 00c42373d397 ("x86-64: add warning for non-canonical user access address dereferences").
Fixes: a5e8c07059d0 ("bpf: add bpf_probe_read_str helper") Fixes: 2541517c32be ("tracing, perf: Implement BPF programs attached to kprobes") Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org Acked-by: Andrii Nakryiko andriin@fb.com Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649... --- kernel/trace/bpf_trace.c | 103 ++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 46 deletions(-)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 1e1345cd21b4..9ac27d48cc8e 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -138,24 +138,70 @@ static const struct bpf_func_proto bpf_override_return_proto = { }; #endif
-BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr) +static __always_inline int +bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr, + const bool compat) { - int ret; + int ret = security_locked_down(LOCKDOWN_BPF_READ);
- ret = security_locked_down(LOCKDOWN_BPF_READ); - if (ret < 0) + if (unlikely(ret < 0)) goto out; - - ret = probe_kernel_read(dst, unsafe_ptr, size); + ret = compat ? probe_kernel_read(dst, unsafe_ptr, size) : + probe_kernel_read_strict(dst, unsafe_ptr, size); if (unlikely(ret < 0)) out: memset(dst, 0, size); + return ret; +} + +BPF_CALL_3(bpf_probe_read_compat, void *, dst, u32, size, + const void *, unsafe_ptr) +{ + return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, true); +} + +static const struct bpf_func_proto bpf_probe_read_compat_proto = { + .func = bpf_probe_read_compat, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_UNINIT_MEM, + .arg2_type = ARG_CONST_SIZE_OR_ZERO, + .arg3_type = ARG_ANYTHING, +};
+static __always_inline int +bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr, + const bool compat) +{ + int ret = security_locked_down(LOCKDOWN_BPF_READ); + + if (unlikely(ret < 0)) + goto out; + /* + * The strncpy_from_unsafe_*() call will likely not fill the entire + * buffer, but that's okay in this circumstance as we're probing + * arbitrary memory anyway similar to bpf_probe_read_*() and might + * as well probe the stack. Thus, memory is explicitly cleared + * only in error case, so that improper users ignoring return + * code altogether don't copy garbage; otherwise length of string + * is returned that can be used for bpf_perf_event_output() et al. + */ + ret = compat ? strncpy_from_unsafe(dst, unsafe_ptr, size) : + strncpy_from_unsafe_strict(dst, unsafe_ptr, size); + if (unlikely(ret < 0)) +out: + memset(dst, 0, size); return ret; }
-static const struct bpf_func_proto bpf_probe_read_proto = { - .func = bpf_probe_read, +BPF_CALL_3(bpf_probe_read_compat_str, void *, dst, u32, size, + const void *, unsafe_ptr) +{ + return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, true); +} + +static const struct bpf_func_proto bpf_probe_read_compat_str_proto = { + .func = bpf_probe_read_compat_str, .gpl_only = true, .ret_type = RET_INTEGER, .arg1_type = ARG_PTR_TO_UNINIT_MEM, @@ -583,41 +629,6 @@ static const struct bpf_func_proto bpf_current_task_under_cgroup_proto = { .arg2_type = ARG_ANYTHING, };
-BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size, - const void *, unsafe_ptr) -{ - int ret; - - ret = security_locked_down(LOCKDOWN_BPF_READ); - if (ret < 0) - goto out; - - /* - * The strncpy_from_unsafe() call will likely not fill the entire - * buffer, but that's okay in this circumstance as we're probing - * arbitrary memory anyway similar to bpf_probe_read() and might - * as well probe the stack. Thus, memory is explicitly cleared - * only in error case, so that improper users ignoring return - * code altogether don't copy garbage; otherwise length of string - * is returned that can be used for bpf_perf_event_output() et al. - */ - ret = strncpy_from_unsafe(dst, unsafe_ptr, size); - if (unlikely(ret < 0)) -out: - memset(dst, 0, size); - - return ret; -} - -static const struct bpf_func_proto bpf_probe_read_str_proto = { - .func = bpf_probe_read_str, - .gpl_only = true, - .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_UNINIT_MEM, - .arg2_type = ARG_CONST_SIZE_OR_ZERO, - .arg3_type = ARG_ANYTHING, -}; - struct send_signal_irq_work { struct irq_work irq_work; struct task_struct *task; @@ -700,8 +711,6 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_map_pop_elem_proto; case BPF_FUNC_map_peek_elem: return &bpf_map_peek_elem_proto; - case BPF_FUNC_probe_read: - return &bpf_probe_read_proto; case BPF_FUNC_ktime_get_ns: return &bpf_ktime_get_ns_proto; case BPF_FUNC_tail_call: @@ -728,8 +737,10 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_current_task_under_cgroup_proto; case BPF_FUNC_get_prandom_u32: return &bpf_get_prandom_u32_proto; + case BPF_FUNC_probe_read: + return &bpf_probe_read_compat_proto; case BPF_FUNC_probe_read_str: - return &bpf_probe_read_str_proto; + return &bpf_probe_read_compat_str_proto; #ifdef CONFIG_CGROUPS case BPF_FUNC_get_current_cgroup_id: return &bpf_get_current_cgroup_id_proto;
From: Daniel Borkmann daniel@iogearbox.net
commit 0ebeea8ca8a4d1d453ad299aef0507dab04f6e8d upstream.
[Small context conflicts due to not bckported changes in previous patch]
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs with overlapping address ranges, we should really take the next step to disable them from BPF use there.
To generally fix the situation, we've recently added new helper variants bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str(). For details on them, see 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,kernel}_str helpers").
Given bpf_probe_read{,str}() have been around for ~5 years by now, there are plenty of users at least on x86 still relying on them today, so we cannot remove them entirely w/o breaking the BPF tracing ecosystem.
However, their use should be restricted to archs with non-overlapping address ranges where they are working in their current form. Therefore, move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and have x86, arm64, arm select it (other archs supporting it can follow-up on it as well).
For the remaining archs, they can workaround easily by relying on the feature probe from bpftool which spills out defines that can be used out of BPF C code to implement the drop-in replacement for old/new kernels via: bpftool feature probe macro
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org Reviewed-by: Masami Hiramatsu mhiramat@kernel.org Acked-by: Linus Torvalds torvalds@linux-foundation.org Cc: Brendan Gregg brendan.d.gregg@gmail.com Cc: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net --- arch/arm/Kconfig | 1 + arch/arm64/Kconfig | 1 + arch/x86/Kconfig | 1 + init/Kconfig | 3 +++ kernel/trace/bpf_trace.c | 2 ++ 5 files changed, 8 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index a70696a95b79..7c1cb0ebdb18 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -14,6 +14,7 @@ config ARM select ARCH_HAS_KEEPINITRD select ARCH_HAS_KCOV select ARCH_HAS_MEMBARRIER_SYNC_CORE + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PTE_SPECIAL if ARM_LPAE select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_SETUP_DMA_OPS diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 384b1bf56667..0d96acb2ca3e 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -22,6 +22,7 @@ config ARM64 select ARCH_HAS_KCOV select ARCH_HAS_KEEPINITRD select ARCH_HAS_MEMBARRIER_SYNC_CORE + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PTE_DEVMAP select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_SETUP_DMA_OPS diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 6002252692af..7be388116732 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -70,6 +70,7 @@ config X86 select ARCH_HAS_KCOV if X86_64 select ARCH_HAS_MEM_ENCRYPT select ARCH_HAS_MEMBARRIER_SYNC_CORE + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL diff --git a/init/Kconfig b/init/Kconfig index f641518f4ac5..2297b7ce6665 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -2231,6 +2231,9 @@ config ASN1
source "kernel/Kconfig.locks"
+config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE + bool + config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE bool
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 9ac27d48cc8e..61c81c38202b 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -737,10 +737,12 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_current_task_under_cgroup_proto; case BPF_FUNC_get_prandom_u32: return &bpf_get_prandom_u32_proto; +#ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE case BPF_FUNC_probe_read: return &bpf_probe_read_compat_proto; case BPF_FUNC_probe_read_str: return &bpf_probe_read_compat_str_proto; +#endif #ifdef CONFIG_CGROUPS case BPF_FUNC_get_current_cgroup_id: return &bpf_get_current_cgroup_id_proto;
From: Christoph Hellwig hch@lst.de
commit 4f6de12b375c37ba51f9412be7ed6ab44a7f71d8 upstream.
Add proper kerneldoc comments for probe_kernel_read_strict and probe_kernel_read strncpy_from_unsafe_strict and explain the different versus the non-strict version.
Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@elte.hu Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/20200521152301.2587579-5-hch@lst.de Signed-off-by: Linus Torvalds torvalds@linux-foundation.org --- mm/maccess.c | 60 +++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 43 insertions(+), 17 deletions(-)
diff --git a/mm/maccess.c b/mm/maccess.c index 3ca8d97e5010..d263c7b5e4eb 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -31,29 +31,36 @@ probe_write_common(void __user *dst, const void *src, size_t size) }
/** - * probe_kernel_read(): safely attempt to read from a kernel-space location + * probe_kernel_read(): safely attempt to read from any location * @dst: pointer to the buffer that shall take the data * @src: address to read from * @size: size of the data chunk * - * Safely read from address @src to the buffer at @dst. If a kernel fault - * happens, handle that and return -EFAULT. + * Same as probe_kernel_read_strict() except that for architectures with + * not fully separated user and kernel address spaces this function also works + * for user address tanges. + * + * DO NOT USE THIS FUNCTION - it is broken on architectures with entirely + * separate kernel and user address spaces, and also a bad idea otherwise. + */ +long __weak probe_kernel_read(void *dst, const void *src, size_t size) + __attribute__((alias("__probe_kernel_read"))); + +/** + * probe_kernel_read_strict(): safely attempt to read from kernel-space + * @dst: pointer to the buffer that shall take the data + * @src: address to read from + * @size: size of the data chunk + * + * Safely read from kernel address @src to the buffer at @dst. If a kernel + * fault happens, handle that and return -EFAULT. * * We ensure that the copy_from_user is executed in atomic context so that * do_page_fault() doesn't attempt to take mmap_sem. This makes * probe_kernel_read() suitable for use within regions where the caller * already holds mmap_sem, or other locks which nest inside mmap_sem. - * - * probe_kernel_read_strict() is the same as probe_kernel_read() except for - * the case where architectures have non-overlapping user and kernel address - * ranges: probe_kernel_read_strict() will additionally return -EFAULT for - * probing memory on a user address range where probe_user_read() is supposed - * to be used instead. */
-long __weak probe_kernel_read(void *dst, const void *src, size_t size) - __attribute__((alias("__probe_kernel_read"))); - long __weak probe_kernel_read_strict(void *dst, const void *src, size_t size) __attribute__((alias("__probe_kernel_read")));
@@ -167,16 +174,35 @@ EXPORT_SYMBOL_GPL(probe_user_write); * If @count is smaller than the length of the string, copies @count-1 bytes, * sets the last byte of @dst buffer to NUL and returns @count. * - * strncpy_from_unsafe_strict() is the same as strncpy_from_unsafe() except - * for the case where architectures have non-overlapping user and kernel address - * ranges: strncpy_from_unsafe_strict() will additionally return -EFAULT for - * probing memory on a user address range where strncpy_from_unsafe_user() is - * supposed to be used instead. + * Same as strncpy_from_unsafe_strict() except that for architectures with + * not fully separated user and kernel address spaces this function also works + * for user address tanges. + * + * DO NOT USE THIS FUNCTION - it is broken on architectures with entirely + * separate kernel and user address spaces, and also a bad idea otherwise. */
long __weak strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) __attribute__((alias("__strncpy_from_unsafe")));
+/** + * strncpy_from_unsafe_strict: - Copy a NUL terminated string from unsafe + * address. + * @dst: Destination address, in kernel space. This buffer must be at + * least @count bytes long. + * @unsafe_addr: Unsafe address. + * @count: Maximum number of bytes to copy, including the trailing NUL. + * + * Copies a NUL-terminated string from unsafe address to kernel buffer. + * + * On success, returns the length of the string INCLUDING the trailing NUL. + * + * If access fails, returns -EFAULT (some data may have been copied + * and the trailing NUL added). + * + * If @count is smaller than the length of the string, copies @count-1 bytes, + * sets the last byte of @dst buffer to NUL and returns @count. + */ long __weak strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, long count) __attribute__((alias("__strncpy_from_unsafe")));
From: Christoph Hellwig hch@lst.de
commit bd88bb5d4007949be7154deae7cef7173c751a95 upstream.
[Missing bpf_trace.c hunk due to not backported changes]
This matches the naming of strncpy_from_user, and also makes it more clear what the function is supposed to do.
Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@elte.hu Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/20200521152301.2587579-7-hch@lst.de Signed-off-by: Linus Torvalds torvalds@linux-foundation.org --- include/linux/uaccess.h | 4 ++-- kernel/trace/trace_kprobe.c | 2 +- mm/maccess.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 25ae650dcb1a..23bda5df4c08 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -358,8 +358,8 @@ extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); extern long strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, long count); extern long __strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); -extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, - long count); +long strncpy_from_user_nofault(char *dst, const void __user *unsafe_addr, + long count); extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count);
/** diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index a422cf6a0358..d6ba4f6bed73 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1111,7 +1111,7 @@ fetch_store_string_user(unsigned long addr, void *dest, void *base)
__dest = get_loc_data(dest, base);
- ret = strncpy_from_unsafe_user(__dest, uaddr, maxlen); + ret = strncpy_from_user_nofault(__dest, uaddr, maxlen); if (ret >= 0) *(u32 *)dest = make_data_loc(ret, __dest - base);
diff --git a/mm/maccess.c b/mm/maccess.c index d263c7b5e4eb..8e4d564b6c25 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -231,7 +231,7 @@ long __strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) }
/** - * strncpy_from_unsafe_user: - Copy a NUL terminated string from unsafe user + * strncpy_from_user_nofault: - Copy a NUL terminated string from unsafe user * address. * @dst: Destination address, in kernel space. This buffer must be at * least @count bytes long. @@ -248,7 +248,7 @@ long __strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) * If @count is smaller than the length of the string, copies @count-1 bytes, * sets the last byte of @dst buffer to NUL and returns @count. */ -long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, +long strncpy_from_user_nofault(char *dst, const void __user *unsafe_addr, long count) { mm_segment_t old_fs = get_fs();
From: Christoph Hellwig hch@lst.de
commit c4cb164426aebd635baa53685b0ebf1a127e9803 upstream.
[Missing bpf_trace_printk due to not backported changes]
This matches the naming of strncpy_from_user_nofault, and also makes it more clear what the function is supposed to do.
Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@elte.hu Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/20200521152301.2587579-8-hch@lst.de Signed-off-by: Linus Torvalds torvalds@linux-foundation.org --- arch/x86/mm/maccess.c | 2 +- include/linux/uaccess.h | 4 ++-- kernel/trace/bpf_trace.c | 2 +- mm/maccess.c | 6 +++--- 4 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c index f5b85bdc0535..62c4017a2473 100644 --- a/arch/x86/mm/maccess.c +++ b/arch/x86/mm/maccess.c @@ -34,7 +34,7 @@ long probe_kernel_read_strict(void *dst, const void *src, size_t size) return __probe_kernel_read(dst, src, size); }
-long strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, long count) +long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count) { if (unlikely(invalid_probe_range((unsigned long)unsafe_addr))) return -EFAULT; diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 23bda5df4c08..7a926c5b77ce 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -355,8 +355,8 @@ extern long notrace probe_user_write(void __user *dst, const void *src, size_t s extern long notrace __probe_user_write(void __user *dst, const void *src, size_t size);
extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); -extern long strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, - long count); +long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, + long count); extern long __strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); long strncpy_from_user_nofault(char *dst, const void __user *unsafe_addr, long count); diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 61c81c38202b..d1fd13a47bdf 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -187,7 +187,7 @@ bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr, * is returned that can be used for bpf_perf_event_output() et al. */ ret = compat ? strncpy_from_unsafe(dst, unsafe_ptr, size) : - strncpy_from_unsafe_strict(dst, unsafe_ptr, size); + strncpy_from_kernel_nofault(dst, unsafe_ptr, size); if (unlikely(ret < 0)) out: memset(dst, 0, size); diff --git a/mm/maccess.c b/mm/maccess.c index 8e4d564b6c25..8cfe21dfc953 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -174,7 +174,7 @@ EXPORT_SYMBOL_GPL(probe_user_write); * If @count is smaller than the length of the string, copies @count-1 bytes, * sets the last byte of @dst buffer to NUL and returns @count. * - * Same as strncpy_from_unsafe_strict() except that for architectures with + * Same as strncpy_from_kernel_nofault() except that for architectures with * not fully separated user and kernel address spaces this function also works * for user address tanges. * @@ -186,7 +186,7 @@ long __weak strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) __attribute__((alias("__strncpy_from_unsafe")));
/** - * strncpy_from_unsafe_strict: - Copy a NUL terminated string from unsafe + * strncpy_from_kernel_nofault: - Copy a NUL terminated string from unsafe * address. * @dst: Destination address, in kernel space. This buffer must be at * least @count bytes long. @@ -203,7 +203,7 @@ long __weak strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) * If @count is smaller than the length of the string, copies @count-1 bytes, * sets the last byte of @dst buffer to NUL and returns @count. */ -long __weak strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr, +long __weak strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count) __attribute__((alias("__strncpy_from_unsafe")));
From: Christoph Hellwig hch@lst.de
commit 8d92db5c04d10381f4db70ed99b1b576f5db18a7 upstream.
[Conflicts due to applying hunks only to the functions that were taken in 6ae08ae3dea2 upstream commit backport earlier]
Instead of using the dangerous probe_kernel_read and strncpy_from_unsafe helpers, rework the compat probes to check if an address is a kernel or userspace one, and then use the low-level kernel or user probe helper shared by the proper kernel and user probe helpers. This slightly changes behavior as the compat probe on a user address doesn't check the lockdown flags, just as the pure user probes do.
Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@elte.hu Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/20200521152301.2587579-14-hch@lst.de Signed-off-by: Linus Torvalds torvalds@linux-foundation.org --- kernel/trace/bpf_trace.c | 93 +++++++++++++++++++++++++++------------- 1 file changed, 64 insertions(+), 29 deletions(-)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d1fd13a47bdf..a46256f99229 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -139,65 +139,99 @@ static const struct bpf_func_proto bpf_override_return_proto = { #endif
static __always_inline int -bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr, - const bool compat) +bpf_probe_read_user_common(void *dst, u32 size, const void __user *unsafe_ptr) { - int ret = security_locked_down(LOCKDOWN_BPF_READ); + int ret;
+ ret = probe_user_read(dst, unsafe_ptr, size); if (unlikely(ret < 0)) - goto out; - ret = compat ? probe_kernel_read(dst, unsafe_ptr, size) : - probe_kernel_read_strict(dst, unsafe_ptr, size); - if (unlikely(ret < 0)) -out: memset(dst, 0, size); return ret; }
-BPF_CALL_3(bpf_probe_read_compat, void *, dst, u32, size, - const void *, unsafe_ptr) +static __always_inline int +bpf_probe_read_user_str_common(void *dst, u32 size, + const void __user *unsafe_ptr) { - return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, true); + int ret; + + ret = strncpy_from_user_nofault(dst, unsafe_ptr, size); + if (unlikely(ret < 0)) + memset(dst, 0, size); + return ret; }
-static const struct bpf_func_proto bpf_probe_read_compat_proto = { - .func = bpf_probe_read_compat, - .gpl_only = true, - .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_UNINIT_MEM, - .arg2_type = ARG_CONST_SIZE_OR_ZERO, - .arg3_type = ARG_ANYTHING, -}; +static __always_inline int +bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr) +{ + int ret = security_locked_down(LOCKDOWN_BPF_READ); + + if (unlikely(ret < 0)) + goto fail; + ret = probe_kernel_read_strict(dst, unsafe_ptr, size); + if (unlikely(ret < 0)) + goto fail; + return ret; +fail: + memset(dst, 0, size); + return ret; +}
static __always_inline int -bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr, - const bool compat) +bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) { int ret = security_locked_down(LOCKDOWN_BPF_READ);
if (unlikely(ret < 0)) - goto out; + goto fail; + /* - * The strncpy_from_unsafe_*() call will likely not fill the entire - * buffer, but that's okay in this circumstance as we're probing + * The strncpy_from_kernel_nofault() call will likely not fill the + * entire buffer, but that's okay in this circumstance as we're probing * arbitrary memory anyway similar to bpf_probe_read_*() and might * as well probe the stack. Thus, memory is explicitly cleared * only in error case, so that improper users ignoring return * code altogether don't copy garbage; otherwise length of string * is returned that can be used for bpf_perf_event_output() et al. */ - ret = compat ? strncpy_from_unsafe(dst, unsafe_ptr, size) : - strncpy_from_kernel_nofault(dst, unsafe_ptr, size); + ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size); if (unlikely(ret < 0)) -out: - memset(dst, 0, size); + goto fail; + + return 0; +fail: + memset(dst, 0, size); return ret; }
+#ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE +BPF_CALL_3(bpf_probe_read_compat, void *, dst, u32, size, + const void *, unsafe_ptr) +{ + if ((unsigned long)unsafe_ptr < TASK_SIZE) { + return bpf_probe_read_user_common(dst, size, + (__force void __user *)unsafe_ptr); + } + return bpf_probe_read_kernel_common(dst, size, unsafe_ptr); +} + +static const struct bpf_func_proto bpf_probe_read_compat_proto = { + .func = bpf_probe_read_compat, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_UNINIT_MEM, + .arg2_type = ARG_CONST_SIZE_OR_ZERO, + .arg3_type = ARG_ANYTHING, +}; + BPF_CALL_3(bpf_probe_read_compat_str, void *, dst, u32, size, const void *, unsafe_ptr) { - return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, true); + if ((unsigned long)unsafe_ptr < TASK_SIZE) { + return bpf_probe_read_user_str_common(dst, size, + (__force void __user *)unsafe_ptr); + } + return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr); }
static const struct bpf_func_proto bpf_probe_read_compat_str_proto = { @@ -208,6 +242,7 @@ static const struct bpf_func_proto bpf_probe_read_compat_str_proto = { .arg2_type = ARG_CONST_SIZE_OR_ZERO, .arg3_type = ARG_ANYTHING, }; +#endif /* CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE */
BPF_CALL_3(bpf_probe_write_user, void __user *, unsafe_ptr, const void *, src, u32, size)
From: Andrii Nakryiko andriin@fb.com
commit 02553b91da5deb63c8562b47529b09b734659af0 upstream.
During recent refactorings, bpf_probe_read_kernel_str() started returning 0 on success, instead of amount of data successfully read. This majorly breaks applications relying on bpf_probe_read_kernel_str() and bpf_probe_read_str() and their results. Fix this by returning actual number of bytes read.
Fixes: 8d92db5c04d1 ("bpf: rework the compat kernel probe handling") Signed-off-by: Andrii Nakryiko andriin@fb.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Reviewed-by: Christoph Hellwig hch@lst.de Acked-by: John Fastabend john.fastabend@gmail.com Link: https://lore.kernel.org/bpf/20200616050432.1902042-1-andriin@fb.com --- kernel/trace/bpf_trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index a46256f99229..c4c825dcdef8 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -198,7 +198,7 @@ bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) if (unlikely(ret < 0)) goto fail;
- return 0; + return ret; fail: memset(dst, 0, size); return ret;
On Mon, May 22, 2023 at 10:33:44PM +0200, Jiri Olsa wrote:
hi, we see broken access to user space with bpf_probe_read/bpf_probe_read_str helpers on arm64 with 5.4 kernel. The problem is that both helpers try to read user memory by calling probe_kernel_read, which seems to work on x86 but fails on arm64.
Has this ever worked on arm64 for the 5.4 kernel tree? If not, it's not really a regression, and so, why not use a newer kernel that has this new feature added to it there?
In other words, what requires you to use the 5.4.y tree and requires feature parity across architectures?
thanks,
greg k-h
On Fri, May 26, 2023 at 07:54:17PM +0100, Greg KH wrote:
On Mon, May 22, 2023 at 10:33:44PM +0200, Jiri Olsa wrote:
hi, we see broken access to user space with bpf_probe_read/bpf_probe_read_str helpers on arm64 with 5.4 kernel. The problem is that both helpers try to read user memory by calling probe_kernel_read, which seems to work on x86 but fails on arm64.
Has this ever worked on arm64 for the 5.4 kernel tree? If not, it's not really a regression, and so, why not use a newer kernel that has this new feature added to it there?
In other words, what requires you to use the 5.4.y tree and requires feature parity across architectures?
we have a customer running ok on x86 v5.4, but arm64 is broken with the same bpf/user space code
upgrade is an option of course, but it's not a big change and we can have 5.4 working on arm64 as well
I can send out the change that will be closer to upstream changes, if that's a concern.. with adding the new probe helpers, which I guess is not a problem, because it does not change current API
jirka
On Sun, May 28, 2023 at 10:02:49PM +0200, Jiri Olsa wrote:
On Fri, May 26, 2023 at 07:54:17PM +0100, Greg KH wrote:
On Mon, May 22, 2023 at 10:33:44PM +0200, Jiri Olsa wrote:
hi, we see broken access to user space with bpf_probe_read/bpf_probe_read_str helpers on arm64 with 5.4 kernel. The problem is that both helpers try to read user memory by calling probe_kernel_read, which seems to work on x86 but fails on arm64.
Has this ever worked on arm64 for the 5.4 kernel tree? If not, it's not really a regression, and so, why not use a newer kernel that has this new feature added to it there?
In other words, what requires you to use the 5.4.y tree and requires feature parity across architectures?
we have a customer running ok on x86 v5.4, but arm64 is broken with the same bpf/user space code
Again why can they not use a newer kernel version? What forces this customer to be stuck with a specific kernel version that spans different processor types?
upgrade is an option of course, but it's not a big change and we can have 5.4 working on arm64 as well
For loads of other reasons, I'd recommend 5.15 or newer for arm64, so why not use that?
I can send out the change that will be closer to upstream changes, if that's a concern.. with adding the new probe helpers, which I guess is not a problem, because it does not change current API
You are trying to add features to a stable kernel that are not regression fixes, which is something that we generally do not accept into stable kernels.
thnaks,
greg k-h
linux-stable-mirror@lists.linaro.org