Hi,
This is v8 of syscall user dispatch. Last version got some acks but there was one small documentation fix I wanted to do, as requested by Florian. This also addresses the commit message fixup Peter requested.
The only actual code change from v7 is solving a trivial merge conflict I myself created with the entry code fixup I made week and with something else in the TIP tree.
I also shared this with glibc and there wasn't any complaints other than the matter about user-notif vs. siginfo, which was discussed in v7 and the understanding is that it is not necessary now and can be added later, if needed, on the same infrastructure without a new api.
I'm not sure about TIP the rules, but is it too late to be queued for the next merge window? I'd love to have this in 5.11 if possible, since it has been flying for quite a while.
This is based on tip/master.
As usual, a working tree with this patchset is available at:
https://gitlab.collabora.com/krisman/linux -b syscall-user-dispatch-v8
Previous submissions are archived at:
RFC/v1: https://lkml.org/lkml/2020/7/8/96 v2: https://lkml.org/lkml/2020/7/9/17 v3: https://lkml.org/lkml/2020/7/12/4 v4: https://www.spinics.net/lists/linux-kselftest/msg16377.html v5: https://lkml.org/lkml/2020/8/10/1320 v6: https://lkml.org/lkml/2020/9/4/1122 v7: https://lwn.net/Articles/837598/
Gabriel Krisman Bertazi (7): x86: vdso: Expose sigreturn address on vdso to the kernel signal: Expose SYS_USER_DISPATCH si_code type kernel: Implement selective syscall userspace redirection entry: Support Syscall User Dispatch on common syscall entry selftests: Add kselftest for syscall user dispatch selftests: Add benchmark for syscall user dispatch docs: Document Syscall User Dispatch
.../admin-guide/syscall-user-dispatch.rst | 87 +++++ arch/x86/entry/vdso/vdso2c.c | 2 + arch/x86/entry/vdso/vdso32/sigreturn.S | 2 + arch/x86/entry/vdso/vma.c | 15 + arch/x86/include/asm/elf.h | 2 + arch/x86/include/asm/vdso.h | 2 + arch/x86/kernel/signal_compat.c | 2 +- fs/exec.c | 3 + include/linux/entry-common.h | 2 + include/linux/sched.h | 2 + include/linux/syscall_user_dispatch.h | 40 +++ include/linux/thread_info.h | 2 + include/uapi/asm-generic/siginfo.h | 3 +- include/uapi/linux/prctl.h | 5 + kernel/entry/Makefile | 2 +- kernel/entry/common.c | 17 + kernel/entry/common.h | 16 + kernel/entry/syscall_user_dispatch.c | 102 ++++++ kernel/fork.c | 1 + kernel/sys.c | 5 + tools/testing/selftests/Makefile | 1 + .../syscall_user_dispatch/.gitignore | 3 + .../selftests/syscall_user_dispatch/Makefile | 9 + .../selftests/syscall_user_dispatch/config | 1 + .../syscall_user_dispatch/sud_benchmark.c | 200 +++++++++++ .../syscall_user_dispatch/sud_test.c | 310 ++++++++++++++++++ 26 files changed, 833 insertions(+), 3 deletions(-) create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst create mode 100644 include/linux/syscall_user_dispatch.h create mode 100644 kernel/entry/common.h create mode 100644 kernel/entry/syscall_user_dispatch.c create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile create mode 100644 tools/testing/selftests/syscall_user_dispatch/config create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_test.c
Syscall user redirection requires the signal trampoline code to not be captured, in order to support returning with a locked selector while avoiding recursion back into the signal handler. For ia-32, which has the trampoline in the vDSO, expose the entry points to the kernel, such that it can avoid dispatching syscalls from that region to userspace.
Suggested-by: Andy Lutomirski luto@kernel.org Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Acked-by: Andy Lutomirski luto@kernel.org Reviewed-by: Kees Cook keescook@chromium.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org --- Changes since V5 - Change return address to bool (Andy) --- arch/x86/entry/vdso/vdso2c.c | 2 ++ arch/x86/entry/vdso/vdso32/sigreturn.S | 2 ++ arch/x86/entry/vdso/vma.c | 15 +++++++++++++++ arch/x86/include/asm/elf.h | 2 ++ arch/x86/include/asm/vdso.h | 2 ++ 5 files changed, 23 insertions(+)
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 7380908045c7..2d0f3d8bcc25 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -101,6 +101,8 @@ struct vdso_sym required_syms[] = { {"__kernel_sigreturn", true}, {"__kernel_rt_sigreturn", true}, {"int80_landing_pad", true}, + {"vdso32_rt_sigreturn_landing_pad", true}, + {"vdso32_sigreturn_landing_pad", true}, };
__attribute__((format(printf, 1, 2))) __attribute__((noreturn)) diff --git a/arch/x86/entry/vdso/vdso32/sigreturn.S b/arch/x86/entry/vdso/vdso32/sigreturn.S index c3233ee98a6b..1bd068f72d4c 100644 --- a/arch/x86/entry/vdso/vdso32/sigreturn.S +++ b/arch/x86/entry/vdso/vdso32/sigreturn.S @@ -18,6 +18,7 @@ __kernel_sigreturn: movl $__NR_sigreturn, %eax SYSCALL_ENTER_KERNEL .LEND_sigreturn: +SYM_INNER_LABEL(vdso32_sigreturn_landing_pad, SYM_L_GLOBAL) nop .size __kernel_sigreturn,.-.LSTART_sigreturn
@@ -29,6 +30,7 @@ __kernel_rt_sigreturn: movl $__NR_rt_sigreturn, %eax SYSCALL_ENTER_KERNEL .LEND_rt_sigreturn: +SYM_INNER_LABEL(vdso32_rt_sigreturn_landing_pad, SYM_L_GLOBAL) nop .size __kernel_rt_sigreturn,.-.LSTART_rt_sigreturn .previous diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 50e5d3a2e70a..de60cd37070b 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -436,6 +436,21 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) } #endif
+bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs) +{ +#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION) + const struct vdso_image *image = current->mm->context.vdso_image; + unsigned long vdso = (unsigned long) current->mm->context.vdso; + + if (in_ia32_syscall() && image == &vdso_image_32) { + if (regs->ip == vdso + image->sym_vdso32_sigreturn_landing_pad || + regs->ip == vdso + image->sym_vdso32_rt_sigreturn_landing_pad) + return true; + } +#endif + return false; +} + #ifdef CONFIG_X86_64 static __init int vdso_setup(char *s) { diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h index 44a9b9940535..66bdfe838d61 100644 --- a/arch/x86/include/asm/elf.h +++ b/arch/x86/include/asm/elf.h @@ -388,6 +388,8 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm, compat_arch_setup_additional_pages(bprm, interpreter, \ (ex->e_machine == EM_X86_64))
+extern bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs); + /* Do not change the values. See get_align_mask() */ enum align_flags { ALIGN_VA_32 = BIT(0), diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index b5d23470f56b..98aa103eb4ab 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -29,6 +29,8 @@ struct vdso_image { long sym___kernel_rt_sigreturn; long sym___kernel_vsyscall; long sym_int80_landing_pad; + long sym_vdso32_sigreturn_landing_pad; + long sym_vdso32_rt_sigreturn_landing_pad; };
#ifdef CONFIG_X86_64
SYS_USER_DISPATCH will be triggered when a syscall is sent to userspace by the Syscall User Dispatch mechanism. This adjusts eventual BUILD_BUG_ON around the tree.
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Acked-by: Kees Cook keescook@chromium.org Acked-by: Christian Brauner christian.brauner@ubuntu.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org --- arch/x86/kernel/signal_compat.c | 2 +- include/uapi/asm-generic/siginfo.h | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c index a7f3e12cfbdb..d7b51870f16b 100644 --- a/arch/x86/kernel/signal_compat.c +++ b/arch/x86/kernel/signal_compat.c @@ -31,7 +31,7 @@ static inline void signal_compat_build_tests(void) BUILD_BUG_ON(NSIGBUS != 5); BUILD_BUG_ON(NSIGTRAP != 5); BUILD_BUG_ON(NSIGCHLD != 6); - BUILD_BUG_ON(NSIGSYS != 1); + BUILD_BUG_ON(NSIGSYS != 2);
/* This is part of the ABI and can never change in size: */ BUILD_BUG_ON(sizeof(compat_siginfo_t) != 128); diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h index 7aacf9389010..d2597000407a 100644 --- a/include/uapi/asm-generic/siginfo.h +++ b/include/uapi/asm-generic/siginfo.h @@ -286,7 +286,8 @@ typedef struct siginfo { * SIGSYS si_codes */ #define SYS_SECCOMP 1 /* seccomp triggered */ -#define NSIGSYS 1 +#define SYS_USER_DISPATCH 2 /* syscall user dispatch triggered */ +#define NSIGSYS 2
/* * SIGEMT si_codes
Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box.
Cc: Matthew Wilcox willy@infradead.org Cc: Andy Lutomirski luto@kernel.org Cc: Paul Gofman gofmanp@gmail.com Cc: Kees Cook keescook@chromium.org Cc: linux-api@vger.kernel.org Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
--- Changes since v7: - Correct half-open interval in commit message (PeterZ) - Solve rebase conflicts
Changes since v6: (Matthew Wilcox) - Use unsigned long for mode (peterZ) - Change interface to {offset,len} - Use SYSCALL_WORK interface instead of TIF flags
Changes since v4: (Andy Lutomirski) - Allow sigreturn coming from vDSO - Exit with SIGSYS instead of SIGSEGV on bad selector (Thomas Gleixner) - Use sizeof selector in access_ok - Document usage of __get_user - Use constant for state value - Split out x86 parts - Rebase on top of Gleixner's common entry code - Don't expose do_syscall_user_dispatch
Changes since v3: - NTR.
Changes since v2: (Matthew Wilcox suggestions) - Drop __user on non-ptr type. - Move #define closer to similar defs - Allow a memory region that can dispatch directly (Kees Cook suggestions) - Improve kconfig summary line - Move flag cleanup on execve to begin_new_exec - Hint branch predictor in the syscall path (Me) - Convert selector to char
Changes since RFC: (Kees Cook suggestions) - Don't mention personality while explaining the feature - Use syscall_get_nr - Remove header guard on several places - Convert WARN_ON to WARN_ON_ONCE - Explicit check for state values - Rename to syscall user dispatcher --- fs/exec.c | 3 + include/linux/sched.h | 2 + include/linux/syscall_user_dispatch.h | 40 ++++++++++ include/linux/thread_info.h | 2 + include/uapi/linux/prctl.h | 5 ++ kernel/entry/Makefile | 2 +- kernel/entry/common.h | 16 ++++ kernel/entry/syscall_user_dispatch.c | 102 ++++++++++++++++++++++++++ kernel/fork.c | 1 + kernel/sys.c | 5 ++ 10 files changed, 177 insertions(+), 1 deletion(-) create mode 100644 include/linux/syscall_user_dispatch.h create mode 100644 kernel/entry/common.h create mode 100644 kernel/entry/syscall_user_dispatch.c
diff --git a/fs/exec.c b/fs/exec.c index 547a2390baf5..aee36e5733ce 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -64,6 +64,7 @@ #include <linux/compat.h> #include <linux/vmalloc.h> #include <linux/io_uring.h> +#include <linux/syscall_user_dispatch.h>
#include <linux/uaccess.h> #include <asm/mmu_context.h> @@ -1302,6 +1303,8 @@ int begin_new_exec(struct linux_binprm * bprm) flush_thread(); me->personality &= ~bprm->per_clear;
+ clear_syscall_work_syscall_user_dispatch(me); + /* * We have to apply CLOEXEC before we change whether the process is * dumpable (in setup_new_exec) to avoid a race with a process in userspace diff --git a/include/linux/sched.h b/include/linux/sched.h index ee2fdf34095b..4b719dc2eba2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -34,6 +34,7 @@ #include <linux/rseq.h> #include <linux/seqlock.h> #include <linux/kcsan.h> +#include <linux/syscall_user_dispatch.h> #include <asm/kmap_size.h>
/* task_struct member predeclarations (sorted alphabetically): */ @@ -1000,6 +1001,7 @@ struct task_struct { unsigned int sessionid; #endif struct seccomp seccomp; + struct syscall_user_dispatch syscall_dispatch;
/* Thread group tracking: */ u64 parent_exec_id; diff --git a/include/linux/syscall_user_dispatch.h b/include/linux/syscall_user_dispatch.h new file mode 100644 index 000000000000..9517ea16f090 --- /dev/null +++ b/include/linux/syscall_user_dispatch.h @@ -0,0 +1,40 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2020 Collabora Ltd. + */ +#ifndef _SYSCALL_USER_DISPATCH_H +#define _SYSCALL_USER_DISPATCH_H + +#include <linux/thread_info.h> + +#ifdef CONFIG_GENERIC_ENTRY + +struct syscall_user_dispatch { + char __user *selector; + unsigned long offset; + unsigned long len; + bool on_dispatch; +}; + +int set_syscall_user_dispatch(unsigned long mode, unsigned long offset, + unsigned long len, char __user *selector); + +#define clear_syscall_work_syscall_user_dispatch(tsk) \ + clear_task_syscall_work(tsk, SYSCALL_USER_DISPATCH) + +#else +struct syscall_user_dispatch {}; + +static inline int set_syscall_user_dispatch(unsigned long mode, unsigned long offset, + unsigned long len, char __user *selector) +{ + return -EINVAL; +} + +static inline void clear_syscall_work_syscall_user_dispatch(struct task_struct *tsk) +{ +} + +#endif /* CONFIG_GENERIC_ENTRY */ + +#endif /* _SYSCALL_USER_DISPATCH_H */ diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index ca80a214df09..c8a974cead73 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -42,6 +42,7 @@ enum syscall_work_bit { SYSCALL_WORK_BIT_SYSCALL_TRACE, SYSCALL_WORK_BIT_SYSCALL_EMU, SYSCALL_WORK_BIT_SYSCALL_AUDIT, + SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, };
#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) @@ -49,6 +50,7 @@ enum syscall_work_bit { #define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) #define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) +#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) #endif
#include <asm/thread_info.h> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 7f0827705c9a..90deb41c8a34 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -247,4 +247,9 @@ struct prctl_mm_map { #define PR_SET_IO_FLUSHER 57 #define PR_GET_IO_FLUSHER 58
+/* Dispatch syscalls to a userspace handler */ +#define PR_SET_SYSCALL_USER_DISPATCH 59 +# define PR_SYS_DISPATCH_OFF 0 +# define PR_SYS_DISPATCH_ON 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/entry/Makefile b/kernel/entry/Makefile index 34c8a3f1c735..095c775e001e 100644 --- a/kernel/entry/Makefile +++ b/kernel/entry/Makefile @@ -9,5 +9,5 @@ KCOV_INSTRUMENT := n CFLAGS_REMOVE_common.o = -fstack-protector -fstack-protector-strong CFLAGS_common.o += -fno-stack-protector
-obj-$(CONFIG_GENERIC_ENTRY) += common.o +obj-$(CONFIG_GENERIC_ENTRY) += common.o syscall_user_dispatch.o obj-$(CONFIG_KVM_XFER_TO_GUEST_WORK) += kvm.o diff --git a/kernel/entry/common.h b/kernel/entry/common.h new file mode 100644 index 000000000000..cd0c4e5f143e --- /dev/null +++ b/kernel/entry/common.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _COMMON_H +#define _COMMON_H + +bool do_syscall_user_dispatch(struct pt_regs *regs); + +static inline bool on_syscall_dispatch(void) +{ + if (unlikely(current->syscall_dispatch.on_dispatch)) { + current->syscall_dispatch.on_dispatch = false; + return true; + } + return false; +} + +#endif diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c new file mode 100644 index 000000000000..131c38a0b628 --- /dev/null +++ b/kernel/entry/syscall_user_dispatch.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2020 Collabora Ltd. + */ +#include <linux/sched.h> +#include <linux/prctl.h> +#include <linux/syscall_user_dispatch.h> +#include <linux/uaccess.h> +#include <linux/signal.h> +#include <linux/elf.h> + +#include <asm/syscall.h> + +#include <linux/sched/signal.h> +#include <linux/sched/task_stack.h> + +static void trigger_sigsys(struct pt_regs *regs) +{ + struct kernel_siginfo info; + + clear_siginfo(&info); + info.si_signo = SIGSYS; + info.si_code = SYS_USER_DISPATCH; + info.si_call_addr = (void __user *)KSTK_EIP(current); + info.si_errno = 0; + info.si_arch = syscall_get_arch(current); + info.si_syscall = syscall_get_nr(current, regs); + + force_sig_info(&info); +} + +bool do_syscall_user_dispatch(struct pt_regs *regs) +{ + struct syscall_user_dispatch *sd = ¤t->syscall_dispatch; + char state; + + if (likely(instruction_pointer(regs) - sd->offset < sd->len)) + return false; + + if (unlikely(arch_syscall_is_vdso_sigreturn(regs))) + return false; + + if (likely(sd->selector)) { + /* + * access_ok() is performed once, at prctl time, when + * the selector is loaded by userspace. + */ + if (unlikely(__get_user(state, sd->selector))) + do_exit(SIGSEGV); + + if (likely(state == PR_SYS_DISPATCH_OFF)) + return false; + + if (state != PR_SYS_DISPATCH_ON) + do_exit(SIGSYS); + } + + sd->on_dispatch = true; + syscall_rollback(current, regs); + trigger_sigsys(regs); + + return true; +} + +int set_syscall_user_dispatch(unsigned long mode, unsigned long offset, + unsigned long len, char __user *selector) +{ + switch (mode) { + case PR_SYS_DISPATCH_OFF: + if (offset || len || selector) + return -EINVAL; + break; + case PR_SYS_DISPATCH_ON: + /* + * Validate the direct dispatcher region just for basic + * sanity against overflow and a 0-sized dispatcher + * region. If the user is able to submit a syscall from + * an address, that address is obviously valid. + */ + if (offset && offset + len <= offset) + return -EINVAL; + + if (selector && !access_ok(selector, sizeof(*selector))) + return -EFAULT; + + break; + default: + return -EINVAL; + } + + current->syscall_dispatch.selector = selector; + current->syscall_dispatch.offset = offset; + current->syscall_dispatch.len = len; + current->syscall_dispatch.on_dispatch = false; + + if (mode == PR_SYS_DISPATCH_ON) + set_syscall_work(SYSCALL_USER_DISPATCH); + else + clear_syscall_work(SYSCALL_USER_DISPATCH); + + return 0; +} diff --git a/kernel/fork.c b/kernel/fork.c index 9a01b89ed05c..99c76dab31c1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -906,6 +906,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) clear_user_return_notifier(tsk); clear_tsk_need_resched(tsk); set_task_stack_end_magic(tsk); + clear_syscall_work_syscall_user_dispatch(tsk);
#ifdef CONFIG_STACKPROTECTOR tsk->stack_canary = get_random_canary(); diff --git a/kernel/sys.c b/kernel/sys.c index a730c03ee607..51f00fe20e4d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -42,6 +42,7 @@ #include <linux/syscore_ops.h> #include <linux/version.h> #include <linux/ctype.h> +#include <linux/syscall_user_dispatch.h>
#include <linux/compat.h> #include <linux/syscalls.h> @@ -2530,6 +2531,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; break; + case PR_SET_SYSCALL_USER_DISPATCH: + error = set_syscall_user_dispatch(arg2, arg3, arg4, + (char __user *) arg5); + break; default: error = -EINVAL; break;
Syscall User Dispatch (SUD) must take precedence over seccomp and ptrace, since the use case is emulation (it can be invoked with a different ABI) such that seccomp filtering by syscall number doesn't make sense in the first place. In addition, either the syscall is dispatched back to userspace, in which case there is no resource for to trace, or the syscall will be executed, and seccomp/ptrace will execute next.
Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as well, just to prevent a trace exit event when dispatch was triggered. For that, the on_syscall_dispatch() examines context to skip the tracepoint, audit and other work.
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org --- Changes since v6: - Update do_syscall_intercept signature (Christian Brauner) - Move it to before tracepoints - Use SYSCALL_WORK flags --- include/linux/entry-common.h | 2 ++ kernel/entry/common.c | 17 +++++++++++++++++ 2 files changed, 19 insertions(+)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 49b26b216e4e..a6e98b4ba8e9 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -44,10 +44,12 @@ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_EMU | \ SYSCALL_WORK_SYSCALL_AUDIT | \ + SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ ARCH_SYSCALL_WORK_ENTER) #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_AUDIT | \ + SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ ARCH_SYSCALL_WORK_EXIT)
/* diff --git a/kernel/entry/common.c b/kernel/entry/common.c index f1b12dc32ff4..ec20aba3b890 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -6,6 +6,8 @@ #include <linux/livepatch.h> #include <linux/audit.h>
+#include "common.h" + #define CREATE_TRACE_POINTS #include <trace/events/syscalls.h>
@@ -47,6 +49,16 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall, { long ret = 0;
+ /* + * Handle Syscall User Dispatch. This must comes first, since + * the ABI here can be something that doesn't make sense for + * other syscall_work features. + */ + if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { + if (do_syscall_user_dispatch(regs)) + return -1L; + } + /* Handle ptrace */ if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { ret = arch_syscall_enter_tracehook(regs); @@ -232,6 +244,11 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work) { bool step;
+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) { + if (on_syscall_dispatch()) + return; + } + audit_syscall_exit(regs);
if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
Implement functionality tests for syscall user dispatch. In order to make the test portable, refrain from open coding syscall dispatchers and calculating glibc memory ranges.
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Acked-by: Kees Cook keescook@chromium.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
--- Changes since v6: - Update selftests to reflect {offset,len} api change Changes since v4: - Update bad selector test to reflect change in API
Changes since v3: - Sort entry in Makefile - Add SPDX header - Use __NR_syscalls if available --- tools/testing/selftests/Makefile | 1 + .../syscall_user_dispatch/.gitignore | 3 + .../selftests/syscall_user_dispatch/Makefile | 9 + .../selftests/syscall_user_dispatch/config | 1 + .../syscall_user_dispatch/sud_test.c | 310 ++++++++++++++++++ 5 files changed, 324 insertions(+) create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile create mode 100644 tools/testing/selftests/syscall_user_dispatch/config create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_test.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 2e20e30a6faa..e93f10386e76 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -57,6 +57,7 @@ TARGETS += sparc64 TARGETS += splice TARGETS += static_keys TARGETS += sync +TARGETS += syscall_user_dispatch TARGETS += sysctl TARGETS += tc-testing TARGETS += timens diff --git a/tools/testing/selftests/syscall_user_dispatch/.gitignore b/tools/testing/selftests/syscall_user_dispatch/.gitignore new file mode 100644 index 000000000000..f539615ad5da --- /dev/null +++ b/tools/testing/selftests/syscall_user_dispatch/.gitignore @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only +sud_test +sud_benchmark diff --git a/tools/testing/selftests/syscall_user_dispatch/Makefile b/tools/testing/selftests/syscall_user_dispatch/Makefile new file mode 100644 index 000000000000..8e15fa42bcda --- /dev/null +++ b/tools/testing/selftests/syscall_user_dispatch/Makefile @@ -0,0 +1,9 @@ +# SPDX-License-Identifier: GPL-2.0 +top_srcdir = ../../../.. +INSTALL_HDR_PATH = $(top_srcdir)/usr +LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/ + +CFLAGS += -Wall -I$(LINUX_HDR_PATH) + +TEST_GEN_PROGS := sud_test +include ../lib.mk diff --git a/tools/testing/selftests/syscall_user_dispatch/config b/tools/testing/selftests/syscall_user_dispatch/config new file mode 100644 index 000000000000..039e303e59d7 --- /dev/null +++ b/tools/testing/selftests/syscall_user_dispatch/config @@ -0,0 +1 @@ +CONFIG_GENERIC_ENTRY=y diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c new file mode 100644 index 000000000000..6498b050ef89 --- /dev/null +++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c @@ -0,0 +1,310 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2020 Collabora Ltd. + * + * Test code for syscall user dispatch + */ + +#define _GNU_SOURCE +#include <sys/prctl.h> +#include <sys/sysinfo.h> +#include <sys/syscall.h> +#include <signal.h> + +#include <asm/unistd.h> +#include "../kselftest_harness.h" + +#ifndef PR_SET_SYSCALL_USER_DISPATCH +# define PR_SET_SYSCALL_USER_DISPATCH 59 +# define PR_SYS_DISPATCH_OFF 0 +# define PR_SYS_DISPATCH_ON 1 +#endif + +#ifndef SYS_USER_DISPATCH +# define SYS_USER_DISPATCH 2 +#endif + +#ifdef __NR_syscalls +# define MAGIC_SYSCALL_1 (__NR_syscalls + 1) /* Bad Linux syscall number */ +#else +# define MAGIC_SYSCALL_1 (0xff00) /* Bad Linux syscall number */ +#endif + +#define SYSCALL_DISPATCH_ON(x) ((x) = 1) +#define SYSCALL_DISPATCH_OFF(x) ((x) = 0) + +/* Test Summary: + * + * - dispatch_trigger_sigsys: Verify if PR_SET_SYSCALL_USER_DISPATCH is + * able to trigger SIGSYS on a syscall. + * + * - bad_selector: Test that a bad selector value triggers SIGSYS with + * si_errno EINVAL. + * + * - bad_prctl_param: Test that the API correctly rejects invalid + * parameters on prctl + * + * - dispatch_and_return: Test that a syscall is selectively dispatched + * to userspace depending on the value of selector. + * + * - disable_dispatch: Test that the PR_SYS_DISPATCH_OFF correctly + * disables the dispatcher + * + * - direct_dispatch_range: Test that a syscall within the allowed range + * can bypass the dispatcher. + */ + +TEST_SIGNAL(dispatch_trigger_sigsys, SIGSYS) +{ + char sel = 0; + struct sysinfo info; + int ret; + + ret = sysinfo(&info); + ASSERT_EQ(0, ret); + + ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &sel); + ASSERT_EQ(0, ret) { + TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH"); + } + + SYSCALL_DISPATCH_ON(sel); + + sysinfo(&info); + + EXPECT_FALSE(true) { + TH_LOG("Unreachable!"); + } +} + +TEST(bad_prctl_param) +{ + char sel = 0; + int op; + + /* Invalid op */ + op = -1; + prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0, 0, &sel); + ASSERT_EQ(EINVAL, errno); + + /* PR_SYS_DISPATCH_OFF */ + op = PR_SYS_DISPATCH_OFF; + + /* offset != 0 */ + prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, 0); + EXPECT_EQ(EINVAL, errno); + + /* len != 0 */ + prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0xff, 0); + EXPECT_EQ(EINVAL, errno); + + /* sel != NULL */ + prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, &sel); + EXPECT_EQ(EINVAL, errno); + + /* Valid parameter */ + errno = 0; + prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, 0x0); + EXPECT_EQ(0, errno); + + /* PR_SYS_DISPATCH_ON */ + op = PR_SYS_DISPATCH_ON; + + /* Dispatcher region is bad (offset > 0 && len == 0) */ + prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel); + EXPECT_EQ(EINVAL, errno); + prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel); + EXPECT_EQ(EINVAL, errno); + + /* Invalid selector */ + prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x1, (void *) -1); + ASSERT_EQ(EFAULT, errno); + + /* + * Dispatcher range overflows unsigned long + */ + prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 1, -1L, &sel); + ASSERT_EQ(EINVAL, errno) { + TH_LOG("Should reject bad syscall range"); + } + + /* + * Allowed range overflows usigned long + */ + prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, -1L, 0x1, &sel); + ASSERT_EQ(EINVAL, errno) { + TH_LOG("Should reject bad syscall range"); + } +} + +/* + * Use global selector for handle_sigsys tests, to avoid passing + * selector to signal handler + */ +char glob_sel; +int nr_syscalls_emulated; +int si_code; +int si_errno; + +static void handle_sigsys(int sig, siginfo_t *info, void *ucontext) +{ + si_code = info->si_code; + si_errno = info->si_errno; + + if (info->si_syscall == MAGIC_SYSCALL_1) + nr_syscalls_emulated++; + + /* In preparation for sigreturn. */ + SYSCALL_DISPATCH_OFF(glob_sel); +} + +TEST(dispatch_and_return) +{ + long ret; + struct sigaction act; + sigset_t mask; + + glob_sel = 0; + nr_syscalls_emulated = 0; + si_code = 0; + si_errno = 0; + + memset(&act, 0, sizeof(act)); + sigemptyset(&mask); + + act.sa_sigaction = handle_sigsys; + act.sa_flags = SA_SIGINFO; + act.sa_mask = mask; + + ret = sigaction(SIGSYS, &act, NULL); + ASSERT_EQ(0, ret); + + /* Make sure selector is good prior to prctl. */ + SYSCALL_DISPATCH_OFF(glob_sel); + + ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &glob_sel); + ASSERT_EQ(0, ret) { + TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH"); + } + + /* MAGIC_SYSCALL_1 doesn't exist. */ + SYSCALL_DISPATCH_OFF(glob_sel); + ret = syscall(MAGIC_SYSCALL_1); + EXPECT_EQ(-1, ret) { + TH_LOG("Dispatch triggered unexpectedly"); + } + + /* MAGIC_SYSCALL_1 should be emulated. */ + nr_syscalls_emulated = 0; + SYSCALL_DISPATCH_ON(glob_sel); + + ret = syscall(MAGIC_SYSCALL_1); + EXPECT_EQ(MAGIC_SYSCALL_1, ret) { + TH_LOG("Failed to intercept syscall"); + } + EXPECT_EQ(1, nr_syscalls_emulated) { + TH_LOG("Failed to emulate syscall"); + } + ASSERT_EQ(SYS_USER_DISPATCH, si_code) { + TH_LOG("Bad si_code in SIGSYS"); + } + ASSERT_EQ(0, si_errno) { + TH_LOG("Bad si_errno in SIGSYS"); + } +} + +TEST_SIGNAL(bad_selector, SIGSYS) +{ + long ret; + struct sigaction act; + sigset_t mask; + struct sysinfo info; + + glob_sel = 0; + nr_syscalls_emulated = 0; + si_code = 0; + si_errno = 0; + + memset(&act, 0, sizeof(act)); + sigemptyset(&mask); + + act.sa_sigaction = handle_sigsys; + act.sa_flags = SA_SIGINFO; + act.sa_mask = mask; + + ret = sigaction(SIGSYS, &act, NULL); + ASSERT_EQ(0, ret); + + /* Make sure selector is good prior to prctl. */ + SYSCALL_DISPATCH_OFF(glob_sel); + + ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &glob_sel); + ASSERT_EQ(0, ret) { + TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH"); + } + + glob_sel = -1; + + sysinfo(&info); + + /* Even though it is ready to catch SIGSYS, the signal is + * supposed to be uncatchable. + */ + + EXPECT_FALSE(true) { + TH_LOG("Unreachable!"); + } +} + +TEST(disable_dispatch) +{ + int ret; + struct sysinfo info; + char sel = 0; + + ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &sel); + ASSERT_EQ(0, ret) { + TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH"); + } + + /* MAGIC_SYSCALL_1 doesn't exist. */ + SYSCALL_DISPATCH_OFF(glob_sel); + + ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_OFF, 0, 0, 0); + EXPECT_EQ(0, ret) { + TH_LOG("Failed to unset syscall user dispatch"); + } + + /* Shouldn't have any effect... */ + SYSCALL_DISPATCH_ON(glob_sel); + + ret = syscall(__NR_sysinfo, &info); + EXPECT_EQ(0, ret) { + TH_LOG("Dispatch triggered unexpectedly"); + } +} + +TEST(direct_dispatch_range) +{ + int ret = 0; + struct sysinfo info; + char sel = 0; + + /* + * Instead of calculating libc addresses; allow the entire + * memory map and lock the selector. + */ + ret = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, -1L, &sel); + ASSERT_EQ(0, ret) { + TH_LOG("Kernel does not support CONFIG_SYSCALL_USER_DISPATCH"); + } + + SYSCALL_DISPATCH_ON(sel); + + ret = sysinfo(&info); + ASSERT_EQ(0, ret) { + TH_LOG("Dispatch triggered unexpectedly"); + } +} + +TEST_HARNESS_MAIN
This is the patch I'm using to evaluate the impact syscall user dispatch has on native syscall (syscalls not redirected to userspace) when enabled for the process and submiting syscalls though the unblocked dispatch selector. It works by running a step to define a baseline of the cost of executing sysinfo, then enabling SUD, and rerunning that step.
On my test machine, an AMD Ryzen 5 1500X, I have the following results with the latest version of syscall user dispatch patches.
root@olga:~# syscall_user_dispatch/sud_benchmark Calibrating test set to last ~5 seconds... test iterations = 37500000 Avg syscall time 134ns. Caught sys_ff00 trapped_call_count 1, native_call_count 0. Avg syscall time 147ns. Interception overhead: 9.7% (+13ns).
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org --- .../selftests/syscall_user_dispatch/Makefile | 2 +- .../syscall_user_dispatch/sud_benchmark.c | 200 ++++++++++++++++++ 2 files changed, 201 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
diff --git a/tools/testing/selftests/syscall_user_dispatch/Makefile b/tools/testing/selftests/syscall_user_dispatch/Makefile index 8e15fa42bcda..03c120270953 100644 --- a/tools/testing/selftests/syscall_user_dispatch/Makefile +++ b/tools/testing/selftests/syscall_user_dispatch/Makefile @@ -5,5 +5,5 @@ LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/
CFLAGS += -Wall -I$(LINUX_HDR_PATH)
-TEST_GEN_PROGS := sud_test +TEST_GEN_PROGS := sud_test sud_benchmark include ../lib.mk diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c b/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c new file mode 100644 index 000000000000..6689f1183dbf --- /dev/null +++ b/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c @@ -0,0 +1,200 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2020 Collabora Ltd. + * + * Benchmark and test syscall user dispatch + */ + +#define _GNU_SOURCE +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <signal.h> +#include <errno.h> +#include <time.h> +#include <sys/time.h> +#include <unistd.h> +#include <sys/sysinfo.h> +#include <sys/prctl.h> +#include <sys/syscall.h> + +#ifndef PR_SET_SYSCALL_USER_DISPATCH +# define PR_SET_SYSCALL_USER_DISPATCH 59 +# define PR_SYS_DISPATCH_OFF 0 +# define PR_SYS_DISPATCH_ON 1 +#endif + +#ifdef __NR_syscalls +# define MAGIC_SYSCALL_1 (__NR_syscalls + 1) /* Bad Linux syscall number */ +#else +# define MAGIC_SYSCALL_1 (0xff00) /* Bad Linux syscall number */ +#endif + +/* + * To test returning from a sigsys with selector blocked, the test + * requires some per-architecture support (i.e. knowledge about the + * signal trampoline address). On i386, we know it is on the vdso, and + * a small trampoline is open-coded for x86_64. Other architectures + * that have a trampoline in the vdso will support TEST_BLOCKED_RETURN + * out of the box, but don't enable them until they support syscall user + * dispatch. + */ +#if defined(__x86_64__) || defined(__i386__) +#define TEST_BLOCKED_RETURN +#endif + +#ifdef __x86_64__ +void* (syscall_dispatcher_start)(void); +void* (syscall_dispatcher_end)(void); +#else +unsigned long syscall_dispatcher_start = 0; +unsigned long syscall_dispatcher_end = 0; +#endif + +unsigned long trapped_call_count = 0; +unsigned long native_call_count = 0; + +char selector; +#define SYSCALL_BLOCK (selector = PR_SYS_DISPATCH_ON) +#define SYSCALL_UNBLOCK (selector = PR_SYS_DISPATCH_OFF) + +#define CALIBRATION_STEP 100000 +#define CALIBRATE_TO_SECS 5 +int factor; + +static double one_sysinfo_step(void) +{ + struct timespec t1, t2; + int i; + struct sysinfo info; + + clock_gettime(CLOCK_MONOTONIC, &t1); + for (i = 0; i < CALIBRATION_STEP; i++) + sysinfo(&info); + clock_gettime(CLOCK_MONOTONIC, &t2); + return (t2.tv_sec - t1.tv_sec) + 1.0e-9 * (t2.tv_nsec - t1.tv_nsec); +} + +static void calibrate_set(void) +{ + double elapsed = 0; + + printf("Calibrating test set to last ~%d seconds...\n", CALIBRATE_TO_SECS); + + while (elapsed < 1) { + elapsed += one_sysinfo_step(); + factor += CALIBRATE_TO_SECS; + } + + printf("test iterations = %d\n", CALIBRATION_STEP * factor); +} + +static double perf_syscall(void) +{ + unsigned int i; + double partial = 0; + + for (i = 0; i < factor; ++i) + partial += one_sysinfo_step()/(CALIBRATION_STEP*factor); + return partial; +} + +static void handle_sigsys(int sig, siginfo_t *info, void *ucontext) +{ + char buf[1024]; + int len; + + SYSCALL_UNBLOCK; + + /* printf and friends are not signal-safe. */ + len = snprintf(buf, 1024, "Caught sys_%x\n", info->si_syscall); + write(1, buf, len); + + if (info->si_syscall == MAGIC_SYSCALL_1) + trapped_call_count++; + else + native_call_count++; + +#ifdef TEST_BLOCKED_RETURN + SYSCALL_BLOCK; +#endif + +#ifdef __x86_64__ + __asm__ volatile("movq $0xf, %rax"); + __asm__ volatile("leaveq"); + __asm__ volatile("add $0x8, %rsp"); + __asm__ volatile("syscall_dispatcher_start:"); + __asm__ volatile("syscall"); + __asm__ volatile("nop"); /* Landing pad within dispatcher area */ + __asm__ volatile("syscall_dispatcher_end:"); +#endif + +} + +int main(void) +{ + struct sigaction act; + double time1, time2; + int ret; + sigset_t mask; + + memset(&act, 0, sizeof(act)); + sigemptyset(&mask); + + act.sa_sigaction = handle_sigsys; + act.sa_flags = SA_SIGINFO; + act.sa_mask = mask; + + calibrate_set(); + + time1 = perf_syscall(); + printf("Avg syscall time %.0lfns.\n", time1 * 1.0e9); + + ret = sigaction(SIGSYS, &act, NULL); + if (ret) { + perror("Error sigaction:"); + exit(-1); + } + + fprintf(stderr, "Enabling syscall trapping.\n"); + + if (prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, + syscall_dispatcher_start, + (syscall_dispatcher_end - syscall_dispatcher_start + 1), + &selector)) { + perror("prctl failed\n"); + exit(-1); + } + + SYSCALL_BLOCK; + syscall(MAGIC_SYSCALL_1); + +#ifdef TEST_BLOCKED_RETURN + if (selector == PR_SYS_DISPATCH_OFF) { + fprintf(stderr, "Failed to return with selector blocked.\n"); + exit(-1); + } +#endif + + SYSCALL_UNBLOCK; + + if (!trapped_call_count) { + fprintf(stderr, "syscall trapping does not work.\n"); + exit(-1); + } + + time2 = perf_syscall(); + + if (native_call_count) { + perror("syscall trapping intercepted more syscalls than expected\n"); + exit(-1); + } + + printf("trapped_call_count %lu, native_call_count %lu.\n", + trapped_call_count, native_call_count); + printf("Avg syscall time %.0lfns.\n", time2 * 1.0e9); + printf("Interception overhead: %.1lf%% (+%.0lfns).\n", + 100.0 * (time2 / time1 - 1.0), 1.0e9 * (time2 - time1)); + return 0; + +}
Explain the interface, provide some background and security notes.
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Reviewed-by: Kees Cook keescook@chromium.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
--- Changes since v7: - Change process -> thread (Florian Weimer) - Drop bogus reference to CONFIG_SYSCALL_USER_DISPATCH (me) - Document the interval as a half-open interval (me) --- .../admin-guide/syscall-user-dispatch.rst | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst new file mode 100644 index 000000000000..0ee7491440b3 --- /dev/null +++ b/Documentation/admin-guide/syscall-user-dispatch.rst @@ -0,0 +1,87 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Syscall User Dispatch +===================== + +Background +---------- + +Compatibility layers like Wine need a way to efficiently emulate system +calls of only a part of their process - the part that has the +incompatible code - while being able to execute native syscalls without +a high performance penalty on the native part of the process. Seccomp +falls short on this task, since it has limited support to efficiently +filter syscalls based on memory regions, and it doesn't support removing +filters. Therefore a new mechanism is necessary. + +Syscall User Dispatch brings the filtering of the syscall dispatcher +address back to userspace. The application is in control of a flip +switch, indicating the current personality of the process. A +multiple-personality application can then flip the switch without +invoking the kernel, when crossing the compatibility layer API +boundaries, to enable/disable the syscall redirection and execute +syscalls directly (disabled) or send them to be emulated in userspace +through a SIGSYS. + +The goal of this design is to provide very quick compatibility layer +boundary crosses, which is achieved by not executing a syscall to change +personality every time the compatibility layer executes. Instead, a +userspace memory region exposed to the kernel indicates the current +personality, and the application simply modifies that variable to +configure the mechanism. + +There is a relatively high cost associated with handling signals on most +architectures, like x86, but at least for Wine, syscalls issued by +native Windows code are currently not known to be a performance problem, +since they are quite rare, at least for modern gaming applications. + +Since this mechanism is designed to capture syscalls issued by +non-native applications, it must function on syscalls whose invocation +ABI is completely unexpected to Linux. Syscall User Dispatch, therefore +doesn't rely on any of the syscall ABI to make the filtering. It uses +only the syscall dispatcher address and the userspace key. + +Interface +--------- + +A thread can setup this mechanism on supported kernels by executing the +following prctl: + + prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) + +<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and +disable the mechanism globally for that thread. When +PR_SYS_DISPATCH_OFF is used, the other fields must be zero. + +[<offset>, <offset>+<length>) delimit a memory region interval +from which syscalls are always executed directly, regardless of the +userspace selector. This provides a fast path for the C library, which +includes the most common syscall dispatchers in the native code +applications, and also provides a way for the signal handler to return +without triggering a nested SIGSYS on (rt_)sigreturn. Users of this +interface should make sure that at least the signal trampoline code is +included in this region. In addition, for syscalls that implement the +trampoline code on the vDSO, that trampoline is never intercepted. + +[selector] is a pointer to a char-sized region in the process memory +region, that provides a quick way to enable disable syscall redirection +thread-wide, without the need to invoke the kernel directly. selector +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. Any other +value should terminate the program with a SIGSYS. + +Security Notes +-------------- + +Syscall User Dispatch provides functionality for compatibility layers to +quickly capture system calls issued by a non-native part of the +application, while not impacting the Linux native regions of the +process. It is not a mechanism for sandboxing system calls, and it +should not be seen as a security mechanism, since it is trivial for a +malicious application to subvert the mechanism by jumping to an allowed +dispatcher region prior to executing the syscall, or to discover the +address and modify the selector value. If the use case requires any +kind of security sandboxing, Seccomp should be used instead. + +Any fork or exec of the existing process resets the mechanism to +PR_SYS_DISPATCH_OFF.
On Fri, 27 Nov 2020 14:32:38 -0500 Gabriel Krisman Bertazi krisman@collabora.com wrote:
Explain the interface, provide some background and security notes.
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Reviewed-by: Kees Cook keescook@chromium.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
Nice to see documentation included...:) One nit:
Changes since v7:
- Change process -> thread (Florian Weimer)
- Drop bogus reference to CONFIG_SYSCALL_USER_DISPATCH (me)
- Document the interval as a half-open interval (me)
.../admin-guide/syscall-user-dispatch.rst | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
You need to add this file to index.rst in that directory as well so it gets included in the docs build.
Thanks,
jon
On Fri, Nov 27 2020 at 14:32, Gabriel Krisman Bertazi wrote:
+Compatibility layers like Wine need a way to efficiently emulate system +calls of only a part of their process - the part that has the +incompatible code - while being able to execute native syscalls without +a high performance penalty on the native part of the process. Seccomp +falls short on this task, since it has limited support to efficiently +filter syscalls based on memory regions, and it doesn't support removing +filters. Therefore a new mechanism is necessary.
+Syscall User Dispatch brings the filtering of the syscall dispatcher +address back to userspace. The application is in control of a flip +switch, indicating the current personality of the process. A +multiple-personality application can then flip the switch without +invoking the kernel, when crossing the compatibility layer API +boundaries, to enable/disable the syscall redirection and execute +syscalls directly (disabled) or send them to be emulated in userspace +through a SIGSYS.
+The goal of this design is to provide very quick compatibility layer +boundary crosses, which is achieved by not executing a syscall to change +personality every time the compatibility layer executes. Instead, a +userspace memory region exposed to the kernel indicates the current +personality, and the application simply modifies that variable to +configure the mechanism.
+There is a relatively high cost associated with handling signals on most +architectures, like x86, but at least for Wine, syscalls issued by +native Windows code are currently not known to be a performance problem, +since they are quite rare, at least for modern gaming applications.
+Since this mechanism is designed to capture syscalls issued by +non-native applications, it must function on syscalls whose invocation +ABI is completely unexpected to Linux. Syscall User Dispatch, therefore +doesn't rely on any of the syscall ABI to make the filtering. It uses +only the syscall dispatcher address and the userspace key.
I think this lacks information about the non-visiblity of these syscalls. Something like this:
As the ABI of these intercepted syscalls is unknown to Linux, these syscalls are not instrumentable via ptrace or the syscall tracepoints.
I'll add that unless someone objects or comes up with better wording before I apply the lot tomorrow morning.
Thanks,
tglx
On Fri, Nov 27, 2020 at 02:32:34PM -0500, Gabriel Krisman Bertazi wrote:
Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box.
Cc: Matthew Wilcox willy@infradead.org Cc: Andy Lutomirski luto@kernel.org Cc: Paul Gofman gofmanp@gmail.com Cc: Kees Cook keescook@chromium.org Cc: linux-api@vger.kernel.org Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com
Acked-by: Kees Cook keescook@chromium.org
On Fri, Nov 27, 2020 at 02:32:35PM -0500, Gabriel Krisman Bertazi wrote:
Syscall User Dispatch (SUD) must take precedence over seccomp and ptrace, since the use case is emulation (it can be invoked with a different ABI) such that seccomp filtering by syscall number doesn't make sense in the first place. In addition, either the syscall is dispatched back to userspace, in which case there is no resource for to trace, or the syscall will be executed, and seccomp/ptrace will execute next.
Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as well, just to prevent a trace exit event when dispatch was triggered. For that, the on_syscall_dispatch() examines context to skip the tracepoint, audit and other work.
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com
Acked-by: Kees Cook keescook@chromium.org
On Fri, Nov 27, 2020 at 02:32:37PM -0500, Gabriel Krisman Bertazi wrote:
This is the patch I'm using to evaluate the impact syscall user dispatch has on native syscall (syscalls not redirected to userspace) when enabled for the process and submiting syscalls though the unblocked dispatch selector. It works by running a step to define a baseline of the cost of executing sysinfo, then enabling SUD, and rerunning that step.
On my test machine, an AMD Ryzen 5 1500X, I have the following results with the latest version of syscall user dispatch patches.
root@olga:~# syscall_user_dispatch/sud_benchmark Calibrating test set to last ~5 seconds... test iterations = 37500000 Avg syscall time 134ns. Caught sys_ff00 trapped_call_count 1, native_call_count 0. Avg syscall time 147ns. Interception overhead: 9.7% (+13ns).
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com
Reviewed-by: Kees Cook keescook@chromium.org
On Tue, Dec 01 2020 at 15:21, Jonathan Corbet wrote:
On Fri, 27 Nov 2020 14:32:38 -0500 Gabriel Krisman Bertazi krisman@collabora.com wrote:
Explain the interface, provide some background and security notes.
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Reviewed-by: Kees Cook keescook@chromium.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
Nice to see documentation included...:) One nit:
Changes since v7:
- Change process -> thread (Florian Weimer)
- Drop bogus reference to CONFIG_SYSCALL_USER_DISPATCH (me)
- Document the interval as a half-open interval (me)
.../admin-guide/syscall-user-dispatch.rst | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
You need to add this file to index.rst in that directory as well so it gets included in the docs build.
Fixed that already after trying to build it :)
On Fri, Nov 27, 2020 at 11:33 AM Gabriel Krisman Bertazi krisman@collabora.com wrote:
Syscall User Dispatch (SUD) must take precedence over seccomp and ptrace, since the use case is emulation (it can be invoked with a different ABI) such that seccomp filtering by syscall number doesn't make sense in the first place. In addition, either the syscall is dispatched back to userspace, in which case there is no resource for to trace, or the syscall will be executed, and seccomp/ptrace will execute next.
Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as well, just to prevent a trace exit event when dispatch was triggered. For that, the on_syscall_dispatch() examines context to skip the tracepoint, audit and other work.
Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
Changes since v6:
- Update do_syscall_intercept signature (Christian Brauner)
- Move it to before tracepoints
- Use SYSCALL_WORK flags
include/linux/entry-common.h | 2 ++ kernel/entry/common.c | 17 +++++++++++++++++ 2 files changed, 19 insertions(+)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 49b26b216e4e..a6e98b4ba8e9 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -44,10 +44,12 @@ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_EMU | \ SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ ARCH_SYSCALL_WORK_ENTER)
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ ARCH_SYSCALL_WORK_EXIT)
/* diff --git a/kernel/entry/common.c b/kernel/entry/common.c index f1b12dc32ff4..ec20aba3b890 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -6,6 +6,8 @@ #include <linux/livepatch.h> #include <linux/audit.h>
+#include "common.h"
#define CREATE_TRACE_POINTS #include <trace/events/syscalls.h>
@@ -47,6 +49,16 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall, { long ret = 0;
/*
* Handle Syscall User Dispatch. This must comes first, since
* the ABI here can be something that doesn't make sense for
* other syscall_work features.
*/
if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
if (do_syscall_user_dispatch(regs))
return -1L;
}
/* Handle ptrace */ if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { ret = arch_syscall_enter_tracehook(regs);
@@ -232,6 +244,11 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work) { bool step;
if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
if (on_syscall_dispatch())
return;
}
I think this would be less confusing if you just open-coded the body of on_syscall_dispatch here and got rid of the helper.
--Andy
On Fri, Nov 27, 2020 at 11:32 AM Gabriel Krisman Bertazi krisman@collabora.com wrote:
Hi,
This is v8 of syscall user dispatch. Last version got some acks but there was one small documentation fix I wanted to do, as requested by Florian. This also addresses the commit message fixup Peter requested.
The only actual code change from v7 is solving a trivial merge conflict I myself created with the entry code fixup I made week and with something else in the TIP tree.
I also shared this with glibc and there wasn't any complaints other than the matter about user-notif vs. siginfo, which was discussed in v7 and the understanding is that it is not necessary now and can be added later, if needed, on the same infrastructure without a new api.
I'm not sure about TIP the rules, but is it too late to be queued for the next merge window? I'd love to have this in 5.11 if possible, since it has been flying for quite a while.
Other than my little nitpick about on_syscall_dispatch(), the whole series is:
Reviewed-by: Andy Lutomirski luto@kernel.org
Why does do_syscal_user_dispatch call do_exit(SIGSEGV) and do_exit(SIGSYS) instead of force_sig(SIGSEGV) and force_sig(SIGSYS)?
Looking at the code these cases are not expected to happen, so I would be surprised if userspace depends on any particular behaviour on the failure path so I think we can change this.
Is using do_exit in this way something you copied from seccomp?
The reason I am asking is that by using do_exit you deprive userspace of the change to catch the signal handler and try and fix things.
Also by using do_exit only a single thread of a multi-thread application is terminated which seems wrong.
I am asking because I am going through the callers of do_exit so I can refactor things and clean things up and this use just looks wrong.
Gabriel Krisman Bertazi krisman@collabora.com writes:
<snip>
+bool do_syscall_user_dispatch(struct pt_regs *regs) +{
- struct syscall_user_dispatch *sd = ¤t->syscall_dispatch;
- char state;
- if (likely(instruction_pointer(regs) - sd->offset < sd->len))
return false;
- if (unlikely(arch_syscall_is_vdso_sigreturn(regs)))
return false;
- if (likely(sd->selector)) {
/*
* access_ok() is performed once, at prctl time, when
* the selector is loaded by userspace.
*/
if (unlikely(__get_user(state, sd->selector)))
do_exit(SIGSEGV);
^^^^^^^^^^^^^^^^
I think it makes more sense if the code does:
if (unlikely(__get_user(state, sd->selector))) { force_sig(SIGSEGV); return true; }
if (likely(state == PR_SYS_DISPATCH_OFF))
return false;
if (state != PR_SYS_DISPATCH_ON)
do_exit(SIGSYS);
^^^^^^^^^^^^^^^
- }
- sd->on_dispatch = true;
- syscall_rollback(current, regs);
- trigger_sigsys(regs);
- return true;
+}
Eric
ebiederm@xmission.com (Eric W. Biederman) writes:
Why does do_syscal_user_dispatch call do_exit(SIGSEGV) and do_exit(SIGSYS) instead of force_sig(SIGSEGV) and force_sig(SIGSYS)?
Looking at the code these cases are not expected to happen, so I would be surprised if userspace depends on any particular behaviour on the failure path so I think we can change this.
Hi Eric,
There is not really a good reason, and the use case that originated the feature doesn't rely on it.
Unless I'm missing yet another problem and others correct me, I think it makes sense to change it as you described.
Is using do_exit in this way something you copied from seccomp?
I'm not sure, its been a while, but I think it might be just that. The first prototype of SUD was implemented as a seccomp mode.
The reason I am asking is that by using do_exit you deprive userspace of the change to catch the signal handler and try and fix things.
Also by using do_exit only a single thread of a multi-thread application is terminated which seems wrong.
I am asking because I am going through the callers of do_exit so I can refactor things and clean things up and this use just looks wrong.
Thanks,
linux-kselftest-mirror@lists.linaro.org