Hi,
this series is the result of the discussion to the RFC patch found at [1].
The goal is to make x86' ftrace_int3_handler() not to simply skip over the trapping instruction as this is problematic in the context of the live patching consistency model. For details, c.f. the commit message of [3/4] ("x86/ftrace: make ftrace_int3_handler() not to skip fops invocation").
Everything is based on v5.1-rc6, please let me know in case you want me to rebase on somehing else.
For x86_64, the live patching selftest added in [4/4] succeeds with this series applied and fails without it. On 32 bits I only compile-tested.
checkpatch reports warnings about - an overlong line in assembly -- I chose to ignore that - MAINTAINERS perhaps needing updates due to the new files arch/x86/kernel/ftrace_int3_stubs.S and tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh. As the existing arch/x86/kernel/ftrace_{32,64}.S haven't got an explicit entry either, this one is probably Ok? The selftest definitely is.
Changes to the RFC patch: - s/trampoline/stub/ to avoid confusion with the ftrace_ops' trampolines, - use a fixed size stack kept in struct thread_info for passing the (adjusted) ->ip values from ftrace_int3_handler() to the stubs, - provide one stub for each of the two possible jump targets and hardcode those, - add the live patching selftest.
Thanks,
Nicolai
Nicolai Stange (4): x86/thread_info: introduce ->ftrace_int3_stack member ftrace: drop 'static' qualifier from ftrace_ops_list_func() x86/ftrace: make ftrace_int3_handler() not to skip fops invocation selftests/livepatch: add "ftrace a live patched function" test
arch/x86/include/asm/thread_info.h | 11 +++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/asm-offsets.c | 8 +++ arch/x86/kernel/ftrace.c | 79 +++++++++++++++++++--- arch/x86/kernel/ftrace_int3_stubs.S | 61 +++++++++++++++++ kernel/trace/ftrace.c | 8 +-- tools/testing/selftests/livepatch/Makefile | 3 +- .../livepatch/test-livepatch-vs-ftrace.sh | 44 ++++++++++++ 8 files changed, 199 insertions(+), 16 deletions(-) create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S create mode 100755 tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
Before actually rewriting an insn, x86' DYNAMIC_FTRACE implementation places an int3 breakpoint on it. Currently, ftrace_int3_handler() simply treats the insn in question as nop and advances %rip past it. An upcoming patch will improve this by making the int3 trap handler emulate the call insn. To this end, ftrace_int3_handler() will be made to change its iret frame's ->ip to some stub which will then mimic the function call in the original context.
Somehow the trapping ->ip address will have to get communicated from ftrace_int3_handler() to these stubs though. Note that at any given point in time, there can be at most four such call insn emulations pending: namely at most one per "process", "irq", "softirq" and "nmi" context.
Introduce struct ftrace_int3_stack providing four entries for storing the instruction pointer.
In principle, it could be made per-cpu, but this would require making ftrace_int3_handler() to return with preemption disabled and to enable it from those emulation stubs again only after the stack's top entry has been consumed. I've been told that this would "break a lot of norms" and that making this stack part of struct thread_info instead would be less fragile. Follow this advice and add a struct ftrace_int3_stack instance to x86's struct thread_info. Note that these stacks will get only rarely accessed (only during ftrace's code modifications) and thus, cache line dirtying won't have any significant impact on the neighbouring fields.
Initialization will take place implicitly through INIT_THREAD_INFO as per the rules for missing elements in initializers. The memcpy() in arch_dup_task_struct() will propagate the initial state properly, because it's always run in process context and won't ever see a non-zero ->depth value.
Finally, add the necessary bits to asm-offsets for making struct ftrace_int3_stack accessible from assembly.
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Nicolai Stange nstange@suse.de --- arch/x86/include/asm/thread_info.h | 11 +++++++++++ arch/x86/kernel/asm-offsets.c | 8 ++++++++ 2 files changed, 19 insertions(+)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index e0eccbcb8447..83434a88cfbb 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -56,6 +56,17 @@ struct task_struct; struct thread_info { unsigned long flags; /* low level flags */ u32 status; /* thread synchronous flags */ +#ifdef CONFIG_DYNAMIC_FTRACE + struct ftrace_int3_stack { + int depth; + /* + * There can be at most one slot in use per context, + * i.e. at most one for "normal", "irq", "softirq" and + * "nmi" each. + */ + unsigned long slots[4]; + } ftrace_int3_stack; +#endif };
#define INIT_THREAD_INFO(tsk) \ diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 168543d077d7..ca6ee24a0c6e 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -105,4 +105,12 @@ static void __used common(void) OFFSET(TSS_sp0, tss_struct, x86_tss.sp0); OFFSET(TSS_sp1, tss_struct, x86_tss.sp1); OFFSET(TSS_sp2, tss_struct, x86_tss.sp2); + +#ifdef CONFIG_DYNAMIC_FTRACE + BLANK(); + OFFSET(TASK_TI_ftrace_int3_depth, task_struct, + thread_info.ftrace_int3_stack.depth); + OFFSET(TASK_TI_ftrace_int3_slots, task_struct, + thread_info.ftrace_int3_stack.slots); +#endif }
On Apr 27, 2019, at 3:06 AM, Nicolai Stange nstange@suse.de wrote:
Before actually rewriting an insn, x86' DYNAMIC_FTRACE implementation places an int3 breakpoint on it. Currently, ftrace_int3_handler() simply treats the insn in question as nop and advances %rip past it.
How does this not crash all the time?
An upcoming patch will improve this by making the int3 trap handler emulate the call insn. To this end, ftrace_int3_handler() will be made to change its iret frame's ->ip to some stub which will then mimic the function call in the original context.
Somehow the trapping ->ip address will have to get communicated from ftrace_int3_handler() to these stubs though.
Cute. But this should get “ftrace” removed from the name, since it’s potentially more generally useful. Josh wanted something like this for static_call.
Note that at any given point in time, there can be at most four such call insn emulations pending: namely at most one per "process", "irq", "softirq" and "nmi" context.
That’s quite an assumption. I think your list should also contain exception, exceptions nested inside that exception, and machine check, at the very least. I’m also wondering why irq and softirq are treated separately.
All this makes me think that one of the other solutions we came up with last time we discussed this might be better.
On Sun, 28 Apr 2019 10:41:10 -0700 Andy Lutomirski luto@amacapital.net wrote:
Note that at any given point in time, there can be at most four such call insn emulations pending: namely at most one per "process", "irq", "softirq" and "nmi" context.
That’s quite an assumption. I think your list should also contain exception, exceptions nested inside that exception, and machine check, at the very least. I’m also wondering why irq and softirq are treated separately.
4 has usually been the context count we choose. But I guess in theory, if we get exceptions then I could potentially be more.
As for irq vs softirq, an interrupt can preempt a softirq. Interrupts are enabled while softirqs are running. When sofirqs run, softirqs are disabled to prevent nested softirqs. But interrupts are enabled again, and another interrupt may come in while a softirq is executing.
All this makes me think that one of the other solutions we came up with last time we discussed this might be better.
+100
Perhaps adding another slot into pt_regs that gets used by int3 to store a slot to emulate a call on return?
-- Steve
On Apr 28, 2019, at 10:51 AM, Steven Rostedt rostedt@goodmis.org wrote:
On Sun, 28 Apr 2019 10:41:10 -0700 Andy Lutomirski luto@amacapital.net wrote:
Note that at any given point in time, there can be at most four such call insn emulations pending: namely at most one per "process", "irq", "softirq" and "nmi" context.
That’s quite an assumption. I think your list should also contain exception, exceptions nested inside that exception, and machine check, at the very least. I’m also wondering why irq and softirq are treated separately.
4 has usually been the context count we choose. But I guess in theory, if we get exceptions then I could potentially be more.
As for irq vs softirq, an interrupt can preempt a softirq. Interrupts are enabled while softirqs are running. When sofirqs run, softirqs are disabled to prevent nested softirqs. But interrupts are enabled again, and another interrupt may come in while a softirq is executing.
All this makes me think that one of the other solutions we came up with last time we discussed this might be better.
+100
Perhaps adding another slot into pt_regs that gets used by int3 to store a slot to emulate a call on return?
That’s not totally nuts, although finding pt_regs isn’t entirely trivial.
I still think I prefer an approach where we just emulate the call directly.
On Sun, 28 Apr 2019 11:08:34 -0700 Andy Lutomirski luto@amacapital.net wrote:
Perhaps adding another slot into pt_regs that gets used by int3 to store a slot to emulate a call on return?
That’s not totally nuts, although finding pt_regs isn’t entirely trivial.
I meant on the int3 handler (which stores the pt_regs).
I still think I prefer an approach where we just emulate the call directly.
Then, on the return of int3, if there's anything in that slot, then we could possibly shift the exception handler frame (that was added by the hardware), insert the slot data into the top of the stack, and then call iret (which the int3 handler, would add the return ip to be the function being called), which would in essence emulate the call directly.
I believe the complexity comes from the exception frame added by the hardware is where we need to put the return of the call for the emulation.
-- Steve
On Apr 28, 2019, at 12:43 PM, Steven Rostedt rostedt@goodmis.org wrote:
On Sun, 28 Apr 2019 11:08:34 -0700 Andy Lutomirski luto@amacapital.net wrote:
Perhaps adding another slot into pt_regs that gets used by int3 to store a slot to emulate a call on return?
That’s not totally nuts, although finding pt_regs isn’t entirely trivial.
I meant on the int3 handler (which stores the pt_regs).
But that’s below the stub’s RSP, so it’s toast if another interrupt happens. Or am I misunderstanding you?
I still think I prefer an approach where we just emulate the call directly.
Then, on the return of int3, if there's anything in that slot, then we could possibly shift the exception handler frame (that was added by the hardware), insert the slot data into the top of the stack, and then call iret (which the int3 handler, would add the return ip to be the function being called), which would in essence emulate the call directly.
Oh, I get it.
I liked Josh’s old proposal of unconditionally shifting the #BP frame 8 bytes better. It will be interesting when kernel shadow stacks are thrown in the mix, but that’s a problem for another day.
Steven Rostedt rostedt@goodmis.org writes:
On Sun, 28 Apr 2019 10:41:10 -0700 Andy Lutomirski luto@amacapital.net wrote:
Note that at any given point in time, there can be at most four such call insn emulations pending: namely at most one per "process", "irq", "softirq" and "nmi" context.
That’s quite an assumption. I think your list should also contain exception, exceptions nested inside that exception, and machine check, at the very least. I’m also wondering why irq and softirq are treated separately.
You're right, I missed the machine check case.
4 has usually been the context count we choose. But I guess in theory, if we get exceptions then I could potentially be more.
After having seen the static_call discussion, I'm in no way defending this limited approach here, but out of curiosity: can the code between the push onto the stack from ftrace_int3_handler() and the subsequent pop from the stub actually trigger an (non-#MC) exception? There's an iret inbetween, but that can fault only when returning to user space, correct?
As for irq vs softirq, an interrupt can preempt a softirq. Interrupts are enabled while softirqs are running. When sofirqs run, softirqs are disabled to prevent nested softirqs. But interrupts are enabled again, and another interrupt may come in while a softirq is executing.
Thanks,
Nicolai
On Apr 28, 2019, at 2:22 PM, Nicolai Stange nstange@suse.de wrote:
Steven Rostedt rostedt@goodmis.org writes:
On Sun, 28 Apr 2019 10:41:10 -0700 Andy Lutomirski luto@amacapital.net wrote:
Note that at any given point in time, there can be at most four such call insn emulations pending: namely at most one per "process", "irq", "softirq" and "nmi" context.
That’s quite an assumption. I think your list should also contain exception, exceptions nested inside that exception, and machine check, at the very least. I’m also wondering why irq and softirq are treated separately.
You're right, I missed the machine check case.
4 has usually been the context count we choose. But I guess in theory, if we get exceptions then I could potentially be more.
After having seen the static_call discussion, I'm in no way defending this limited approach here, but out of curiosity: can the code between the push onto the stack from ftrace_int3_handler() and the subsequent pop from the stub actually trigger an (non-#MC) exception? There's an iret inbetween, but that can fault only when returning to user space, correct?
IRET doesn’t do any fancy masking, so #DB, NMI and regular IRQs should all be possible.
As for irq vs softirq, an interrupt can preempt a softirq. Interrupts are enabled while softirqs are running. When sofirqs run, softirqs are disabled to prevent nested softirqs. But interrupts are enabled again, and another interrupt may come in while a softirq is executing.
Thanks,
Nicolai
-- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
With an upcoming patch improving x86' ftrace_int3_handler() not to simply skip over the insn being updated, ftrace_ops_list_func() will have to get referenced from arch/x86 code. Drop its 'static' qualifier.
Signed-off-by: Nicolai Stange nstange@suse.de --- kernel/trace/ftrace.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index b920358dd8f7..ed3c20811d9a 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -125,8 +125,8 @@ ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub; struct ftrace_ops global_ops;
#if ARCH_SUPPORTS_FTRACE_OPS -static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, - struct ftrace_ops *op, struct pt_regs *regs); +void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, + struct ftrace_ops *op, struct pt_regs *regs); #else /* See comment below, where ftrace_ops_list_func is defined */ static void ftrace_ops_no_ops(unsigned long ip, unsigned long parent_ip); @@ -6302,8 +6302,8 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, * set the ARCH_SUPPORTS_FTRACE_OPS. */ #if ARCH_SUPPORTS_FTRACE_OPS -static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, - struct ftrace_ops *op, struct pt_regs *regs) +void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, + struct ftrace_ops *op, struct pt_regs *regs) { __ftrace_ops_list_func(ip, parent_ip, NULL, regs); }
With dynamic ftrace, ftrace patches call sites on x86 in a three steps: 1. Put a breakpoint at the to be patched location, 2. update call site and 3. finally remove the breakpoint again.
Note that the breakpoint ftrace_int3_handler() simply makes execution to skip over the to be patched instruction.
This patching happens in the following circumstances: a.) the global ftrace_trace_function changes and the call sites at ftrace_call and ftrace_regs_call get patched, b.) an ftrace_ops' ->func changes and the call site in its trampoline gets patched, c.) graph tracing gets enabled/disabled and the jump site at ftrace_graph_call gets patched, d.) a mcount site gets converted from nop -> call, call -> nop, or call -> call.
The latter case, i.e. a mcount call getting redirected, is possible in e.g. a transition from trampolined to not trampolined upon a user enabling function tracing on a live patched function.
The ftrace_int3_handler() simply skipping over the updated insn is quite problematic in the context of live patching, because it means that a live patched function gets temporarily reverted to its unpatched original and this breaks the live patching consistency model. But even without live patching, it is desirable to avoid missing traces when making changes to the tracing setup.
Make ftrace_int3_handler not to skip over the fops invocation, but modify the interrupted control flow to issue a call as needed.
Case c.) from the list above can be ignored, because a jmp instruction gets changed to a nop or vice versa.
The remaining cases a.), b.) and d.) all involve call instructions. For a.) and b.), the call always goes to some ftrace_func_t and the generic ftrace_ops_list_func() impementation will be able to demultiplex and do the right thing. For case d.), the call target is either of ftrace_caller(), ftrace_regs_caller() or some ftrace_ops' trampoline. Because providing the register state won't cause any harm for !FTRACE_OPS_FL_SAVE_REGS ftrace_ops, ftrace_regs_caller() would be a suitable target capable of handling any case.
ftrace_int3_handler()'s context is different from the interrupted call instruction's one, obviously. In order to be able to emulate the call within the original context, make ftrace_int3_handler() set its iret frame's ->ip to some helper stub. Upon return from the trap, this stub will then mimic the call by pushing the the return address onto the stack and issuing a jmp to the target address. As describe above, the jmp target will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide one such stub implementation for each of the two cases.
Finally, the desired return address, which is derived from the trapping ->ip, must get passed from ftrace_int3_handler() to these stubs. Make ftrace_int3_handler() push it onto the ftrace_int3_stack introduced by an earlier patch and let the stubs consume it. Be careful to use proper compiler barriers such that nested int3 handling from e.g. irqs won't clobber entries owned by outer instances.
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Nicolai Stange nstange@suse.de --- arch/x86/kernel/Makefile | 1 + arch/x86/kernel/ftrace.c | 79 +++++++++++++++++++++++++++++++------ arch/x86/kernel/ftrace_int3_stubs.S | 61 ++++++++++++++++++++++++++++ 3 files changed, 130 insertions(+), 11 deletions(-) create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 00b7e27bc2b7..0b63ae02b1f3 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -93,6 +93,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o obj-$(CONFIG_FUNCTION_TRACER) += ftrace_$(BITS).o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o +obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace_int3_stubs.o obj-$(CONFIG_X86_TSC) += trace_clock.o obj-$(CONFIG_KEXEC_CORE) += machine_kexec_$(BITS).o obj-$(CONFIG_KEXEC_CORE) += relocate_kernel_$(BITS).o crash.o diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..917494f35cba 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -280,25 +280,84 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) return 0; }
-/* - * A breakpoint was added to the code address we are about to - * modify, and this is the handle that will just skip over it. - * We are either changing a nop into a trace call, or a trace - * call to a nop. While the change is taking place, we treat - * it just like it was a nop. - */ +extern void ftrace_graph_call(void); + +asmlinkage void ftrace_int3_stub_regs_caller(void); +asmlinkage void ftrace_int3_stub_list_func(void); + +/* A breakpoint was added to the code address we are about to modify. */ int ftrace_int3_handler(struct pt_regs *regs) { unsigned long ip; + bool is_ftrace_location; + struct ftrace_int3_stack *stack; + int slot;
if (WARN_ON_ONCE(!regs)) return 0;
ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) + is_ftrace_location = ftrace_location(ip); + if (!is_ftrace_location && !is_ftrace_caller(ip)) return 0;
- regs->ip += MCOUNT_INSN_SIZE - 1; + ip += MCOUNT_INSN_SIZE; + + if (!is_ftrace_location && + ftrace_update_func == (unsigned long)ftrace_graph_call) { + /* + * The insn at ftrace_graph_call is being changed from a + * nop to a jmp or vice versa. Treat it as a nop and + * skip over it. + */ + regs->ip = ip; + return 1; + } + + /* + * The insn having the breakpoint on it is either some mcount + * call site or one of ftrace_call, ftrace_regs_call and their + * equivalents within some trampoline. The currently pending + * transition is known to turn the insn from a nop to a call, + * from a call to a nop or to change the target address of an + * existing call. We're going to emulate a call to the most + * generic implementation capable of handling any possible + * configuration. For the mcount sites that would be + * ftrace_regs_caller() and for the remaining calls, which all + * have got some ftrace_func_t target, ftrace_ops_list_func() + * will do the right thing. + * + * However, the call insn can't get emulated from this trap + * handler here. Rewrite the iret frame's ->ip value to one of + * the ftrace_int3_stub instances which will then setup + * everything in the original context. The address following + * the current insn will be passed to the stub via the + * ftrace_int3_stack. + */ + stack = ¤t_thread_info()->ftrace_int3_stack; + if (WARN_ON_ONCE(stack->depth >= 4)) { + /* + * This should not happen as at most one stack slot is + * required per the contexts "normal", "irq", + * "softirq" and "nmi" each. However, be conservative + * and treat it like a nop. + */ + regs->ip = ip; + return 1; + } + + /* + * Make sure interrupts will see the incremented ->depth value + * before writing the stack entry. + */ + slot = stack->depth; + WRITE_ONCE(stack->depth, slot + 1); + WRITE_ONCE(stack->slots[slot], ip); + + if (is_ftrace_location) + regs->ip = (unsigned long)ftrace_int3_stub_regs_caller; + else + regs->ip = (unsigned long)ftrace_int3_stub_list_func;
return 1; } @@ -949,8 +1008,6 @@ void arch_ftrace_trampoline_free(struct ftrace_ops *ops) #ifdef CONFIG_FUNCTION_GRAPH_TRACER
#ifdef CONFIG_DYNAMIC_FTRACE -extern void ftrace_graph_call(void); - static unsigned char *ftrace_jmp_replace(unsigned long ip, unsigned long addr) { return ftrace_text_replace(0xe9, ip, addr); diff --git a/arch/x86/kernel/ftrace_int3_stubs.S b/arch/x86/kernel/ftrace_int3_stubs.S new file mode 100644 index 000000000000..ef5f580450bb --- /dev/null +++ b/arch/x86/kernel/ftrace_int3_stubs.S @@ -0,0 +1,61 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2019 SUSE Linux GmbH */ + +#include <asm/asm.h> +#include <asm/percpu.h> +#include <asm/unwind_hints.h> +#include <asm/asm-offsets.h> +#include <linux/linkage.h> + +#ifdef CONFIG_X86_64 +#define WORD_SIZE 8 +#else +#define WORD_SIZE 4 +#endif + +.macro ftrace_int3_stub call_target + /* + * We got here from ftrace_int3_handler() because a breakpoint + * on a call insn currently being modified has been + * hit. ftrace_int3_handler() can't emulate the function call + * directly, because it's running at a different position on + * the stack, obviously. Hence it sets the regs->ip to this + * stub so that we end up here upon the iret from the int3 + * handler. The stack is now in its original state and we can + * emulate the function call insn by pushing the return + * address onto the stack and jumping to the call target. The + * desired return address has been put onto the ftrace_int3_stack + * kept within struct thread_info. + */ + UNWIND_HINT_EMPTY + /* Reserve room for the emulated call's return address. */ + sub $WORD_SIZE, %_ASM_SP + /* + * Pop the return address from ftrace_int3_ip_stack and write + * it to the location just reserved. Be careful to retrieve + * the address before decrementing ->depth in order to protect + * against nested contexts clobbering it. + */ + push %_ASM_AX + push %_ASM_CX + push %_ASM_DX + mov PER_CPU_VAR(current_task), %_ASM_AX + mov TASK_TI_ftrace_int3_depth(%_ASM_AX), %_ASM_CX + dec %_ASM_CX + mov TASK_TI_ftrace_int3_slots(%_ASM_AX, %_ASM_CX, WORD_SIZE), %_ASM_DX + mov %_ASM_CX, TASK_TI_ftrace_int3_depth(%_ASM_AX) + mov %_ASM_DX, 3*WORD_SIZE(%_ASM_SP) + pop %_ASM_DX + pop %_ASM_CX + pop %_ASM_AX + /* Finally, transfer control to the target function. */ + jmp \call_target +.endm + +ENTRY(ftrace_int3_stub_regs_caller) + ftrace_int3_stub ftrace_regs_caller +END(ftrace_int3_stub_regs_caller) + +ENTRY(ftrace_int3_stub_list_func) + ftrace_int3_stub ftrace_ops_list_func +END(ftrace_int3_stub_list_func)
On Sat, Apr 27, 2019 at 12:06:38PM +0200, Nicolai Stange wrote:
ftrace_int3_handler()'s context is different from the interrupted call instruction's one, obviously. In order to be able to emulate the call within the original context, make ftrace_int3_handler() set its iret frame's ->ip to some helper stub. Upon return from the trap, this stub will then mimic the call by pushing the the return address onto the stack and issuing a jmp to the target address. As describe above, the jmp target will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide one such stub implementation for each of the two cases.
Yuck; I'd much rather we get that static_call() stuff sorted such that text_poke() and poke_int3_handler() can do CALL emulation.
Given all the back and forth, I think the solution where we shift pt_regs a bit to allow the emulated PUSH is a viable solution; eg. I think we collectively hated it least.
[ Added Linus ]
On Sat, 27 Apr 2019 12:26:57 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Sat, Apr 27, 2019 at 12:06:38PM +0200, Nicolai Stange wrote:
ftrace_int3_handler()'s context is different from the interrupted call instruction's one, obviously. In order to be able to emulate the call within the original context, make ftrace_int3_handler() set its iret frame's ->ip to some helper stub. Upon return from the trap, this stub will then mimic the call by pushing the the return address onto the stack and issuing a jmp to the target address. As describe above, the jmp target will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide one such stub implementation for each of the two cases.
Yuck; I'd much rather we get that static_call() stuff sorted such that text_poke() and poke_int3_handler() can do CALL emulation.
Given all the back and forth, I think the solution where we shift pt_regs a bit to allow the emulated PUSH is a viable solution; eg. I think we collectively hated it least.
Note, this use case is slightly different than the static calls but has the same issue. That issue is that we need a way to simulate a "call" function from int3, and there's no good way to handle that case right now.
The static call needs to modify the call sight but make the call still work from int3 context.
This use case is similar, but the issue is with a "bug" in the function tracing implementation, that has become an issue with live kernel patching which is built on top of it.
The issue is this:
For optimization reasons, if there's only a single user of a function it gets its own trampoline that sets up the call to its callback and calls that callback directly:
<function being traced> call custom trampoline
<custom trampoline> save regs to call C code call custom_callback restore regs ret
If more than one callback is attached to that function, then we need to call the iterator that loops through all registered callbacks of the function tracer and checks their hashes to see if they want to trace this function or not.
<function being traced> call iterator_trampoline
<iterator_trampoline> save regs to call C code call iterator restore regs ret
What happens in that transition? We add an "int3" at the function being traced, and change:
call custom_trampoline
to
call iterator_trampoline
But if the function being traced is executed during this transition, the int3 just makes it act like a nop and returns pass the call.
For tracing this is a bug because we just missed a function that should be traced. For live kernel patching, this could be more devastating because the original "buggy" patched function is called, and this could cause subtle problems.
If we can add a slot to the int3 handler, that we can use to emulate a call, (perhaps add another slot to pt_regs structure, call it link register)
It would make it easier to solve both this and the static call problems.
-- Steve
On Sun, Apr 28, 2019 at 10:38 AM Steven Rostedt rostedt@goodmis.org wrote:
For optimization reasons, if there's only a single user of a function it gets its own trampoline that sets up the call to its callback and calls that callback directly:
So this is the same issue as the static calls, and it has exactly the same solution.
Which I already outlined once, and nobody wrote the code for.
So here's a COMPLETELY UNTESTED patch that only works (_if_ it works) for
(a) 64-bit
(b) SMP
but that's just because I've hardcoded the percpu segment handling.
It does *not* emulate the "call" in the BP handler itself, instead if replace the %ip (the same way all the other BP handlers replace the %ip) with a code sequence that just does
push %gs:bp_call_return jmp *%gs:bp_call_target
after having filled in those per-cpu things.
The reason they are percpu is that after the %ip has been changed, the target CPU goes its merry way, and doesn't wait for the text--poke semaphore serialization. But since we have interrupts disabled on that CPU, we know that *another* text poke won't be coming around and changing the values.
THIS IS ENTIRELY UNTESTED! I've built it, and it at least seems to build, although with warnings
arch/x86/kernel/alternative.o: warning: objtool: emulate_call_irqoff()+0x9: indirect jump found in RETPOLINE build arch/x86/kernel/alternative.o: warning: objtool: emulate_call_irqon()+0x8: indirect jump found in RETPOLINE build arch/x86/kernel/alternative.o: warning: objtool: emulate_call_irqoff()+0x9: sibling call from callable instruction with modified stack frame arch/x86/kernel/alternative.o: warning: objtool: emulate_call_irqon()+0x8: sibling call from callable instruction with modified stack frame
that will need the appropriate "ignore this case" annotations that I didn't do.
Do I expect it to work? No. I'm sure there's some silly mistake here, but the point of the patch is to show it as an example, so that it can actually be tested.
With this, it should be possible (under the text rewriting lock) to do
replace_call(callsite, newcallopcode, callsize, calltargettarget);
to do the static rewriting of the call at "callsite" to have the new call target.
And again. Untested. But doesn't need any special code in the entry path, and the concept is simple even if there are probably stupid bugs just because it's entirely untested.
Oh, and did I mention that I didn't test this?
Linus
On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds torvalds@linux-foundation.org wrote:
It does *not* emulate the "call" in the BP handler itself, instead if replace the %ip (the same way all the other BP handlers replace the %ip) with a code sequence that just does
push %gs:bp_call_return jmp *%gs:bp_call_target
after having filled in those per-cpu things.
Note that if you read the patch, you'll see that my explanation glossed over the "what if an interrupt happens" part. Which is handled by having two handlers, one for "interrupts were already disabled" and one for "interrupts were enabled, so I disabled them before entering the handler".
The second handler does the same push/jmp sequence, but has a "sti" before the jmp. Because of the one-instruction sti shadow, interrupts won't actually be enabled until after the jmp instruction has completed, and thus the "push/jmp" is atomic wrt regular interrupts.
It's not safe wrt NMI, of course, but since NMI won't be rescheduling, and since any SMP IPI won't be punching through that sequence anyway, it's still atomic wrt _another_ text_poke() attempt coming in and re-using the bp_call_return/tyarget slots.
So yeah, it's not "one-liner" trivial, but it's not like it's complicated either, and it actually matches the existing "call this code to emulate the replaced instruction". So I'd much rather have a couple of tens of lines of code here that still acts pretty much exactly like all the other rewriting does, rather than play subtle games with the entry stack frame.
Finally: there might be other situations where you want to have this kind of "pseudo-atomic" replacement sequence, so I think while it's a hack specific to emulating a "call" instruction, I don't think it is conceptually limited to just that case.
Linus
On Mon, Apr 29, 2019 at 11:29 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds torvalds@linux-foundation.org wrote:
It does *not* emulate the "call" in the BP handler itself, instead if replace the %ip (the same way all the other BP handlers replace the %ip) with a code sequence that just does
push %gs:bp_call_return jmp *%gs:bp_call_target
after having filled in those per-cpu things.
Note that if you read the patch, you'll see that my explanation glossed over the "what if an interrupt happens" part. Which is handled by having two handlers, one for "interrupts were already disabled" and one for "interrupts were enabled, so I disabled them before entering the handler".
This is quite a cute solution.
The second handler does the same push/jmp sequence, but has a "sti" before the jmp. Because of the one-instruction sti shadow, interrupts won't actually be enabled until after the jmp instruction has completed, and thus the "push/jmp" is atomic wrt regular interrupts.
It's not safe wrt NMI, of course, but since NMI won't be rescheduling, and since any SMP IPI won't be punching through that sequence anyway, it's still atomic wrt _another_ text_poke() attempt coming in and re-using the bp_call_return/tyarget slots.
I'm less than 100% convinced about this argument. Sure, an NMI right there won't cause a problem. But an NMI followed by an interrupt will kill us if preemption is on. I can think of three solutions:
1. Assume that all CPUs (and all relevant hypervisors!) either mask NMIs in the STI shadow or return from NMIs with interrupts masked for one instruction. Ditto for MCE. This seems too optimistic.
2. Put a fixup in the NMI handler and MCE handler: if the return address is one of these magic jumps, clear IF and back up IP by one byte. This should *work*, but it's a bit ugly.
3. Use a different magic sequence:
push %gs:bp_call_return int3
and have the int3 handler adjust regs->ip and regs->flags as appropriate.
I think I like #3 the best, even though it's twice as slow.
FWIW, kernel shadow stack patches will show up eventually, and none of these approaches are compatible as is. Perhaps the actual sequence should be this, instead:
bp_int3_fixup_irqsoff: call 1f 1: int3
bp_int3_fixup_irqson: call 1f 1: int3
and the int3 handler will update the conventional return address *and* the shadow return address. Linus, what do you think about this variant?
Finally, with my maintainer hat on: if anyone actually wants to merge this thing, I want to see a test case, exercised regularly (every boot in configured, perhaps) that actually *runs* this code. Hitting it in practice will be rare, and I want bugs to be caught.
On Mon, 29 Apr 2019 11:06:58 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
+void replace_call(void *addr, const void *opcode, size_t len, void *target) +{
- bp_int3_call_target = target;
- bp_int3_call_return = addr + len;
- bp_int3_handler_irqoff = emulate_call_irqoff;
- text_poke_bp(addr, opcode, len, emulate_call_irqon);
+}
Note, the function tracer does not use text poke. It does it in batch mode. It can update over 40,000 calls in one go:
add int3 breakpoint to all 40,000 call sites. sync() change the last four bytes of each of those call sites sync() remove int3 from the 40,000 call site with new call.
It's a bit more intrusive than the static call updates we were discussing before.
-- Steve
On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
On Mon, 29 Apr 2019 11:06:58 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
+void replace_call(void *addr, const void *opcode, size_t len, void *target) +{
- bp_int3_call_target = target;
- bp_int3_call_return = addr + len;
- bp_int3_handler_irqoff = emulate_call_irqoff;
- text_poke_bp(addr, opcode, len, emulate_call_irqon);
+}
Note, the function tracer does not use text poke. It does it in batch mode. It can update over 40,000 calls in one go:
Note that Daniel is adding batch stuff to text_poke(), because jump_label/static_key stuffs also end up wanting to change more than one site at a time and IPI spraying the machine for every single instance is hurting.
So ideally ftrace would start to use the 'normal' code when that happens.
On Tue, 30 Apr 2019 12:46:48 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
On Mon, 29 Apr 2019 11:06:58 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
+void replace_call(void *addr, const void *opcode, size_t len, void *target) +{
- bp_int3_call_target = target;
- bp_int3_call_return = addr + len;
- bp_int3_handler_irqoff = emulate_call_irqoff;
- text_poke_bp(addr, opcode, len, emulate_call_irqon);
+}
Note, the function tracer does not use text poke. It does it in batch mode. It can update over 40,000 calls in one go:
Note that Daniel is adding batch stuff to text_poke(), because jump_label/static_key stuffs also end up wanting to change more than one site at a time and IPI spraying the machine for every single instance is hurting.
So ideally ftrace would start to use the 'normal' code when that happens.
Sure, but that's a completely different issue. We would need to solve the 'emulate' call for batch mode first.
-- Steve
On Tue, Apr 30, 2019 at 09:44:45AM -0400, Steven Rostedt wrote:
On Tue, 30 Apr 2019 12:46:48 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
On Mon, 29 Apr 2019 11:06:58 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
+void replace_call(void *addr, const void *opcode, size_t len, void *target) +{
- bp_int3_call_target = target;
- bp_int3_call_return = addr + len;
- bp_int3_handler_irqoff = emulate_call_irqoff;
- text_poke_bp(addr, opcode, len, emulate_call_irqon);
+}
Note, the function tracer does not use text poke. It does it in batch mode. It can update over 40,000 calls in one go:
Note that Daniel is adding batch stuff to text_poke(), because jump_label/static_key stuffs also end up wanting to change more than one site at a time and IPI spraying the machine for every single instance is hurting.
So ideally ftrace would start to use the 'normal' code when that happens.
Sure, but that's a completely different issue. We would need to solve the 'emulate' call for batch mode first.
I don't see a problem there; when INT3 happens; you bsearch() the batch array to find the one you hit, you frob that into the percpu variables and do the thing. Or am I loosing the plot again?
On Tue, 30 Apr 2019 16:20:47 +0200 Peter Zijlstra peterz@infradead.org wrote:
Sure, but that's a completely different issue. We would need to solve the 'emulate' call for batch mode first.
I don't see a problem there; when INT3 happens; you bsearch() the batch array to find the one you hit, you frob that into the percpu variables and do the thing. Or am I loosing the plot again?
I may need to tweak Linus's patch slightly, but I may be able to get it to work with the batch mode.
-- Steve
There had been an issue with interactions between tracing and live patching due to how x86' CONFIG_DYNAMIC_FTRACE used to handle the breakpoints at the updated instructions from its ftrace_int3_handler().
More specifically, starting to trace a live patched function caused a short period in time where the live patching redirection became ineffective. In particular, the guarantees from the consistency model couldn't be held up in this situation.
Implement a testcase for verifying that a function's live patch replacement is kept effective when enabling tracing on it.
Reuse the existing 'test_klp_livepatch' live patch module which patches cmdline_proc_show(), the handler for /proc/cmdline.
Let the testcase in a loop - apply this live patch, - launch a background shell job enabling tracing on that function - while continuously verifying that the contents of /proc/cmdline still match what would be expected when the live patch is applied.
Signed-off-by: Nicolai Stange nstange@suse.de --- tools/testing/selftests/livepatch/Makefile | 3 +- .../livepatch/test-livepatch-vs-ftrace.sh | 44 ++++++++++++++++++++++ 2 files changed, 46 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
diff --git a/tools/testing/selftests/livepatch/Makefile b/tools/testing/selftests/livepatch/Makefile index af4aee79bebb..bfa5353f6d17 100644 --- a/tools/testing/selftests/livepatch/Makefile +++ b/tools/testing/selftests/livepatch/Makefile @@ -3,6 +3,7 @@ TEST_GEN_PROGS := \ test-livepatch.sh \ test-callbacks.sh \ - test-shadow-vars.sh + test-shadow-vars.sh \ + test-livepatch-vs-ftrace.sh
include ../lib.mk diff --git a/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh b/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh new file mode 100755 index 000000000000..5c982ec56373 --- /dev/null +++ b/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh @@ -0,0 +1,44 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2019 SUSE Linux GmbH + +. $(dirname $0)/functions.sh + +set -e + +MOD_LIVEPATCH=test_klp_livepatch + +# TEST: ftrace a live patched function +# - load a livepatch that modifies the output from /proc/cmdline +# - install a function tracer at the live patched function +# - verify that the function is still patched by reading /proc/cmdline +# - unload the livepatch and make sure the patch was removed + +echo -n "TEST: ftrace a live patched function ... " +dmesg -C + +for i in $(seq 1 3); do + load_lp $MOD_LIVEPATCH + + ( echo cmdline_proc_show > /sys/kernel/debug/tracing/set_ftrace_filter; + echo function > /sys/kernel/debug/tracing/current_tracer ) & + + for j in $(seq 1 200); do + if [[ "$(cat /proc/cmdline)" != \ + "$MOD_LIVEPATCH: this has been live patched" ]] ; then + echo -e "FAIL\n\n" + die "livepatch kselftest(s) failed" + fi + done + + wait %1 + + echo nop > /sys/kernel/debug/tracing/current_tracer + echo > /sys/kernel/debug/tracing/set_ftrace_filter + + disable_lp $MOD_LIVEPATCH + unload_lp $MOD_LIVEPATCH +done + +echo "ok" +exit 0
linux-kselftest-mirror@lists.linaro.org