Some applications, especially tracing ones, benefit from avoiding the syscall overhead for getcpu() so it is common for architectures to have vDSO implementations. Add one for arm64, using TPIDRRO_EL0 to pass a pointer to per-CPU data rather than just store the immediate value in order to allow for future extensibility.
It is questionable if something TPIDRRO_EL0 based is worthwhile at all on current kernels, since v4.18 we have had support for restartable sequences which can be used to provide a sched_getcpu() implementation with generally better performance than the vDSO approach on architectures which have that[1]. Work is ongoing to implement this for glibc:
https://lore.kernel.org/lkml/20200527185130.5604-3-mathieu.desnoyers@efficio...
but is not yet merged and will need similar work for other userspaces. The main advantages for the vDSO implementation are the node parameter (though this is a static mapping to CPU number so could be looked up separately when processing data if it's needed, it shouldn't need to be in the hot path) and ease of implementation for users.
This is currently not compatible with KPTI due to the use of TPIDRRO_EL0 by the KPTI trampoline, this could be addressed by reinitializing that system register in the return path but I have found it hard to justify adding that overhead for all users for something that is essentially a profiling optimization which is likely to get superceeded by a more modern implementation - if there are other uses for the per-CPU data then the balance might change here.
This builds on work done by Kristina Martsenko some time ago but is a new implementation.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
v3: - Rebase on v5.9-rc1. - Drop in progress portions of the series. v2: - Rebase on v5.8-rc3. - Add further cleanup patches & a first draft of multi-page support.
Mark Brown (5): arm64: vdso: Provide a define when building the vDSO arm64: vdso: Add per-CPU data arm64: vdso: Initialise the per-CPU vDSO data arm64: vdso: Add getcpu() implementation selftests: vdso: Support arm64 in getcpu() test
arch/arm64/include/asm/processor.h | 12 +---- arch/arm64/include/asm/vdso/datapage.h | 54 +++++++++++++++++++ arch/arm64/kernel/process.c | 26 ++++++++- arch/arm64/kernel/vdso.c | 33 +++++++++++- arch/arm64/kernel/vdso/Makefile | 4 +- arch/arm64/kernel/vdso/vdso.lds.S | 1 + arch/arm64/kernel/vdso/vgetcpu.c | 48 +++++++++++++++++ .../testing/selftests/vDSO/vdso_test_getcpu.c | 10 ++++ 8 files changed, 172 insertions(+), 16 deletions(-) create mode 100644 arch/arm64/include/asm/vdso/datapage.h create mode 100644 arch/arm64/kernel/vdso/vgetcpu.c
Provide a define identifying if code is being built for the vDSO to help with writing headers that are shared between the kernel and the vDSO.
Signed-off-by: Mark Brown broonie@kernel.org --- arch/arm64/kernel/vdso/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile index 45d5cfe46429..88cf0f0b91ed 100644 --- a/arch/arm64/kernel/vdso/Makefile +++ b/arch/arm64/kernel/vdso/Makefile @@ -28,7 +28,7 @@ ldflags-y := -shared -nostdlib -soname=linux-vdso.so.1 --hash-style=sysv \ $(btildflags-y) -T
ccflags-y := -fno-common -fno-builtin -fno-stack-protector -ffixed-x18 -ccflags-y += -DDISABLE_BRANCH_PROFILING +ccflags-y += -DDISABLE_BRANCH_PROFILING -D__VDSO__
CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) $(GCC_PLUGINS_CFLAGS) KBUILD_CFLAGS += $(DISABLE_LTO)
In order to support a vDSO getcpu() implementation add per-CPU data to the vDSO data page. Do this by wrapping the generic vdso_data struct in an arm64 specific one with an array of per-CPU data. The offset of the per-CPU data applying to a CPU will be stored in TPIDRRO_EL0, this allows us to get to the per-CPU data without doing any multiplications.
Since we currently only map a single data page for the vDSO but support very large numbers of CPUs TPIDRRO may be set to zero for CPUs which don't fit in the data page. This will also happen when KPTI is active since kernel_ventry uses TPIDRRO_EL0 as a scratch register in that case, add a comment to the code explaining this.
Acessors for the data are provided in the header since they will be needed in multiple files and it seems neater to keep things together.
Signed-off-by: Mark Brown broonie@kernel.org --- arch/arm64/include/asm/processor.h | 12 +----- arch/arm64/include/asm/vdso/datapage.h | 54 ++++++++++++++++++++++++++ arch/arm64/kernel/process.c | 26 ++++++++++++- arch/arm64/kernel/vdso.c | 5 ++- 4 files changed, 83 insertions(+), 14 deletions(-) create mode 100644 arch/arm64/include/asm/vdso/datapage.h
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index 240fe5e5b720..db7a804030b3 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -207,17 +207,7 @@ static inline void set_compat_ssbs_bit(struct pt_regs *regs) regs->pstate |= PSR_AA32_SSBS_BIT; }
-static inline void start_thread(struct pt_regs *regs, unsigned long pc, - unsigned long sp) -{ - start_thread_common(regs, pc); - regs->pstate = PSR_MODE_EL0t; - - if (arm64_get_ssbd_state() != ARM64_SSBD_FORCE_ENABLE) - set_ssbs_bit(regs); - - regs->sp = sp; -} +void start_thread(struct pt_regs *regs, unsigned long pc, unsigned long sp);
static inline bool is_ttbr0_addr(unsigned long addr) { diff --git a/arch/arm64/include/asm/vdso/datapage.h b/arch/arm64/include/asm/vdso/datapage.h new file mode 100644 index 000000000000..e88d97238c52 --- /dev/null +++ b/arch/arm64/include/asm/vdso/datapage.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2020 ARM Limited + */ +#ifndef __ASM_VDSO_DATAPAGE_H +#define __ASM_VDSO_DATAPAGE_H + +#include <vdso/datapage.h> + +struct vdso_cpu_data { + unsigned int cpu; + unsigned int node; +}; + +struct arm64_vdso_data { + /* Must be first in struct, we cast to vdso_data */ + struct vdso_data data[CS_BASES]; + struct vdso_cpu_data cpu_data[]; +}; + +#ifdef __VDSO__ +static inline struct vdso_cpu_data *__vdso_cpu_data(void) +{ + unsigned long offset; + + asm volatile( + "mrs %0, tpidrro_el0\n" + : "=r" (offset) + : + : "cc"); + + if (offset) + return (void *)(_vdso_data) + offset; + + return NULL; +} +#else +static inline size_t vdso_cpu_offset(void) +{ + size_t offset, data_end; + + offset = offsetof(struct arm64_vdso_data, cpu_data) + + smp_processor_id() * sizeof(struct vdso_cpu_data); + data_end = offset + sizeof(struct vdso_cpu_data) + 1; + + /* We only map a single page for vDSO data currently */ + if (data_end > PAGE_SIZE) + return 0; + + return offset; +} +#endif + +#endif diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 84ec630b8ab5..89b400f9397d 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -55,6 +55,7 @@ #include <asm/processor.h> #include <asm/pointer_auth.h> #include <asm/stacktrace.h> +#include <asm/vdso/datapage.h>
#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_STACKPROTECTOR_PER_TASK) #include <linux/stackprotector.h> @@ -309,6 +310,28 @@ void show_regs(struct pt_regs * regs) dump_backtrace(regs, NULL, KERN_DEFAULT); }
+void start_thread(struct pt_regs *regs, unsigned long pc, unsigned long sp) +{ + start_thread_common(regs, pc); + regs->pstate = PSR_MODE_EL0t; + + if (arm64_get_ssbd_state() != ARM64_SSBD_FORCE_ENABLE) + set_ssbs_bit(regs); + + regs->sp = sp; + + /* + * Store the vDSO per-CPU offset if supported. Disable + * preemption to make sure we read the CPU offset on the CPU + * we write it on. + */ + if (!arm64_kernel_unmapped_at_el0()) { + preempt_disable(); + write_sysreg(vdso_cpu_offset(), tpidrro_el0); + preempt_enable(); + } +} + static void tls_thread_flush(void) { write_sysreg(0, tpidr_el0); @@ -452,7 +475,8 @@ static void tls_thread_switch(struct task_struct *next) if (is_compat_thread(task_thread_info(next))) write_sysreg(next->thread.uw.tp_value, tpidrro_el0); else if (!arm64_kernel_unmapped_at_el0()) - write_sysreg(0, tpidrro_el0); + /* Used as scratch in KPTI trampoline so don't set here. */ + write_sysreg(vdso_cpu_offset(), tpidrro_el0);
write_sysreg(*task_user_tls(next), tpidr_el0); } diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c index d4202a32abc9..2a8d7ab76bee 100644 --- a/arch/arm64/kernel/vdso.c +++ b/arch/arm64/kernel/vdso.c @@ -28,6 +28,7 @@ #include <asm/cacheflush.h> #include <asm/signal32.h> #include <asm/vdso.h> +#include <asm/vdso/datapage.h>
extern char vdso_start[], vdso_end[]; #ifdef CONFIG_COMPAT_VDSO @@ -77,10 +78,10 @@ static struct vdso_abi_info vdso_info[] __ro_after_init = { * The vDSO data page. */ static union { - struct vdso_data data[CS_BASES]; + struct arm64_vdso_data data; u8 page[PAGE_SIZE]; } vdso_data_store __page_aligned_data; -struct vdso_data *vdso_data = vdso_data_store.data; +struct vdso_data *vdso_data = vdso_data_store.data.data;
static int __vdso_remap(enum vdso_abi abi, const struct vm_special_mapping *sm,
Register with the CPU hotplug system to initialise the per-CPU data for getcpu().
Signed-off-by: Mark Brown broonie@kernel.org --- arch/arm64/kernel/vdso.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+)
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c index 2a8d7ab76bee..d9743c659341 100644 --- a/arch/arm64/kernel/vdso.c +++ b/arch/arm64/kernel/vdso.c @@ -9,6 +9,7 @@
#include <linux/cache.h> #include <linux/clocksource.h> +#include <linux/cpuhotplug.h> #include <linux/elf.h> #include <linux/err.h> #include <linux/errno.h> @@ -18,6 +19,7 @@ #include <linux/sched.h> #include <linux/signal.h> #include <linux/slab.h> +#include <linux/smp.h> #include <linux/time_namespace.h> #include <linux/timekeeper_internal.h> #include <linux/vmalloc.h> @@ -466,6 +468,26 @@ int aarch32_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) } #endif /* CONFIG_COMPAT */
+static void vdso_cpu_init(void *p) +{ + struct arm64_vdso_data *data = (struct arm64_vdso_data *)vdso_data; + unsigned int cpu; + + if (vdso_cpu_offset()) { + cpu = smp_processor_id(); + + data->cpu_data[cpu].cpu = cpu; + data->cpu_data[cpu].node = cpu_to_node(cpu); + } +} + +static int vdso_cpu_online(unsigned int cpu) +{ + smp_call_function_single(cpu, vdso_cpu_init, NULL, 1); + + return 0; +} + static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma) { @@ -494,6 +516,12 @@ static int __init vdso_init(void) vdso_info[VDSO_ABI_AA64].dm = &aarch64_vdso_maps[AA64_MAP_VVAR]; vdso_info[VDSO_ABI_AA64].cm = &aarch64_vdso_maps[AA64_MAP_VDSO];
+ /* + * Initialize per-CPU data, callback runs for all current and + * future CPUs. + */ + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "vdso", vdso_cpu_online, NULL); + return __vdso_init(VDSO_ABI_AA64); } arch_initcall(vdso_init);
Some applications, especially trace ones, benefit from avoiding the syscall overhead on getcpu() calls so provide a vDSO implementation of it.
Signed-off-by: Mark Brown broonie@kernel.org --- arch/arm64/kernel/vdso/Makefile | 2 +- arch/arm64/kernel/vdso/vdso.lds.S | 1 + arch/arm64/kernel/vdso/vgetcpu.c | 48 +++++++++++++++++++++++++++++++ 3 files changed, 50 insertions(+), 1 deletion(-) create mode 100644 arch/arm64/kernel/vdso/vgetcpu.c
diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile index 88cf0f0b91ed..ff350e69b8b6 100644 --- a/arch/arm64/kernel/vdso/Makefile +++ b/arch/arm64/kernel/vdso/Makefile @@ -11,7 +11,7 @@ ARCH_REL_TYPE_ABS := R_AARCH64_JUMP_SLOT|R_AARCH64_GLOB_DAT|R_AARCH64_ABS64 include $(srctree)/lib/vdso/Makefile
-obj-vdso := vgettimeofday.o note.o sigreturn.o +obj-vdso := vgettimeofday.o note.o sigreturn.o vgetcpu.o
# Build rules targets := $(obj-vdso) vdso.so vdso.so.dbg diff --git a/arch/arm64/kernel/vdso/vdso.lds.S b/arch/arm64/kernel/vdso/vdso.lds.S index d808ad31e01f..ef3fb80e0349 100644 --- a/arch/arm64/kernel/vdso/vdso.lds.S +++ b/arch/arm64/kernel/vdso/vdso.lds.S @@ -80,6 +80,7 @@ VERSION __kernel_gettimeofday; __kernel_clock_gettime; __kernel_clock_getres; + __kernel_getcpu; local: *; }; } diff --git a/arch/arm64/kernel/vdso/vgetcpu.c b/arch/arm64/kernel/vdso/vgetcpu.c new file mode 100644 index 000000000000..e8972e561e08 --- /dev/null +++ b/arch/arm64/kernel/vdso/vgetcpu.c @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * ARM64 userspace implementations of getcpu() + * + * Copyright (C) 2020 ARM Limited + * + */ + +#include <asm/unistd.h> +#include <asm/vdso/datapage.h> + +struct getcpucache; + +static __always_inline +int getcpu_fallback(unsigned int *_cpu, unsigned int *_node, + struct getcpucache *_c) +{ + register unsigned int *cpu asm("x0") = _cpu; + register unsigned int *node asm("x1") = _node; + register struct getcpucache *c asm("x2") = _c; + register long ret asm ("x0"); + register long nr asm("x8") = __NR_getcpu; + + asm volatile( + " svc #0\n" + : "=r" (ret) + : "r" (cpu), "r" (node), "r" (c), "r" (nr) + : "memory"); + + return ret; +} + +int __kernel_getcpu(unsigned int *cpu, unsigned int *node, + struct getcpucache *c) +{ + struct vdso_cpu_data *cpu_data = __vdso_cpu_data(); + + if (cpu_data) { + if (cpu) + *cpu = cpu_data->cpu; + if (node) + *node = cpu_data->node; + + return 0; + } + + return getcpu_fallback(cpu, node, c); +}
arm64 exports the vDSO ABI with a version of LINUX_2.6.39 and symbols prefixed with __kernel rather than __vdso. Update the getcpu() test to handle this.
Signed-off-by: Mark Brown broonie@kernel.org --- tools/testing/selftests/vDSO/vdso_test_getcpu.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/tools/testing/selftests/vDSO/vdso_test_getcpu.c b/tools/testing/selftests/vDSO/vdso_test_getcpu.c index fc25ede131b8..4aeb65012b81 100644 --- a/tools/testing/selftests/vDSO/vdso_test_getcpu.c +++ b/tools/testing/selftests/vDSO/vdso_test_getcpu.c @@ -14,8 +14,18 @@ #include "../kselftest.h" #include "parse_vdso.h"
+/* + * ARM64's vDSO exports its getcpu() implementation with a different + * name and version from other architectures, so we need to handle it + * as a special case. + */ +#if defined(__aarch64__) +const char *version = "LINUX_2.6.39"; +const char *name = "__kernel_getcpu"; +#else const char *version = "LINUX_2.6"; const char *name = "__vdso_getcpu"; +#endif
struct getcpu_cache; typedef long (*getcpu_t)(unsigned int *, unsigned int *,
On 8/19/20 6:13 AM, Mark Brown wrote:
Some applications, especially tracing ones, benefit from avoiding the syscall overhead for getcpu() so it is common for architectures to have vDSO implementations. Add one for arm64, using TPIDRRO_EL0 to pass a pointer to per-CPU data rather than just store the immediate value in order to allow for future extensibility.
It is questionable if something TPIDRRO_EL0 based is worthwhile at all on current kernels, since v4.18 we have had support for restartable sequences which can be used to provide a sched_getcpu() implementation with generally better performance than the vDSO approach on architectures which have that[1]. Work is ongoing to implement this for glibc:
https://lore.kernel.org/lkml/20200527185130.5604-3-mathieu.desnoyers@efficios.com/
but is not yet merged and will need similar work for other userspaces. The main advantages for the vDSO implementation are the node parameter (though this is a static mapping to CPU number so could be looked up separately when processing data if it's needed, it shouldn't need to be in the hot path) and ease of implementation for users.
This is currently not compatible with KPTI due to the use of TPIDRRO_EL0 by the KPTI trampoline, this could be addressed by reinitializing that system register in the return path but I have found it hard to justify adding that overhead for all users for something that is essentially a profiling optimization which is likely to get superceeded by a more modern implementation - if there are other uses for the per-CPU data then the balance might change here.
This builds on work done by Kristina Martsenko some time ago but is a new implementation.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
v3:
- Rebase on v5.9-rc1.
- Drop in progress portions of the series.
v2:
- Rebase on v5.8-rc3.
- Add further cleanup patches & a first draft of multi-page support.
Mark Brown (5): arm64: vdso: Provide a define when building the vDSO arm64: vdso: Add per-CPU data arm64: vdso: Initialise the per-CPU vDSO data arm64: vdso: Add getcpu() implementation selftests: vdso: Support arm64 in getcpu() test
arch/arm64/include/asm/processor.h | 12 +---- arch/arm64/include/asm/vdso/datapage.h | 54 +++++++++++++++++++ arch/arm64/kernel/process.c | 26 ++++++++- arch/arm64/kernel/vdso.c | 33 +++++++++++- arch/arm64/kernel/vdso/Makefile | 4 +- arch/arm64/kernel/vdso/vdso.lds.S | 1 + arch/arm64/kernel/vdso/vgetcpu.c | 48 +++++++++++++++++ .../testing/selftests/vDSO/vdso_test_getcpu.c | 10 ++++ 8 files changed, 172 insertions(+), 16 deletions(-) create mode 100644 arch/arm64/include/asm/vdso/datapage.h create mode 100644 arch/arm64/kernel/vdso/vgetcpu.c
Patches look good to me from selftests perspective. My acked by for these patches to go through arm64.
Acked-by: Shuah Khan skhan@linuxfoundation.org
If you would like me to take these through kselftest tree, give me your Acks. I can queue these up for 5.10-rc1
thanks, -- Shuah
On Mon, Aug 31, 2020 at 03:47:17PM -0600, Shuah Khan wrote:
On 8/19/20 6:13 AM, Mark Brown wrote:
Some applications, especially tracing ones, benefit from avoiding the syscall overhead for getcpu() so it is common for architectures to have vDSO implementations. Add one for arm64, using TPIDRRO_EL0 to pass a pointer to per-CPU data rather than just store the immediate value in order to allow for future extensibility.
It is questionable if something TPIDRRO_EL0 based is worthwhile at all on current kernels, since v4.18 we have had support for restartable sequences which can be used to provide a sched_getcpu() implementation with generally better performance than the vDSO approach on architectures which have that[1]. Work is ongoing to implement this for glibc:
https://lore.kernel.org/lkml/20200527185130.5604-3-mathieu.desnoyers@efficios.com/
but is not yet merged and will need similar work for other userspaces. The main advantages for the vDSO implementation are the node parameter (though this is a static mapping to CPU number so could be looked up separately when processing data if it's needed, it shouldn't need to be in the hot path) and ease of implementation for users.
This is currently not compatible with KPTI due to the use of TPIDRRO_EL0 by the KPTI trampoline, this could be addressed by reinitializing that system register in the return path but I have found it hard to justify adding that overhead for all users for something that is essentially a profiling optimization which is likely to get superceeded by a more modern implementation - if there are other uses for the per-CPU data then the balance might change here.
This builds on work done by Kristina Martsenko some time ago but is a new implementation.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
v3:
- Rebase on v5.9-rc1.
- Drop in progress portions of the series.
v2:
- Rebase on v5.8-rc3.
- Add further cleanup patches & a first draft of multi-page support.
Mark Brown (5): arm64: vdso: Provide a define when building the vDSO arm64: vdso: Add per-CPU data arm64: vdso: Initialise the per-CPU vDSO data arm64: vdso: Add getcpu() implementation selftests: vdso: Support arm64 in getcpu() test
arch/arm64/include/asm/processor.h | 12 +---- arch/arm64/include/asm/vdso/datapage.h | 54 +++++++++++++++++++ arch/arm64/kernel/process.c | 26 ++++++++- arch/arm64/kernel/vdso.c | 33 +++++++++++- arch/arm64/kernel/vdso/Makefile | 4 +- arch/arm64/kernel/vdso/vdso.lds.S | 1 + arch/arm64/kernel/vdso/vgetcpu.c | 48 +++++++++++++++++ .../testing/selftests/vDSO/vdso_test_getcpu.c | 10 ++++ 8 files changed, 172 insertions(+), 16 deletions(-) create mode 100644 arch/arm64/include/asm/vdso/datapage.h create mode 100644 arch/arm64/kernel/vdso/vgetcpu.c
Patches look good to me from selftests perspective. My acked by for these patches to go through arm64.
Acked-by: Shuah Khan skhan@linuxfoundation.org
If you would like me to take these through kselftest tree, give me your Acks. I can queue these up for 5.10-rc1
Thanks Shuah for the ack. We are still pondering whether the merge these patches as they have some limitations (the per-CPU data structures may not fit in the sole data vDSO page).
On Tue, Sep 01, 2020 at 10:25:52AM +0100, Catalin Marinas wrote:
Thanks Shuah for the ack. We are still pondering whether the merge these patches as they have some limitations (the per-CPU data structures may not fit in the sole data vDSO page).
They definitely don't fit, I did have some half-written proof of concept patches that I posted that extend this but I was waiting to see if there was any interest in a vDSO getcpu() at all before taking it further. Vincenzo's work on doing the multipage user data that he announced at Plumbers would cover it as well, I hadn't been aware of that.
linux-kselftest-mirror@lists.linaro.org