On Mon, Apr 29, 2019 at 12:13 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, Apr 29, 2019, 12:02 Linus Torvalds torvalds@linux-foundation.org wrote:
If nmi were to break it, it would be a cpu bug.
Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
Where? STI; IRET would be nuts.
Before:
commit 4214a16b02971c60960afd675d03544e109e0d75 Author: Andy Lutomirski luto@kernel.org Date: Thu Apr 2 17:12:12 2015 -0700
x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
we did sti; sysxit, but, when we discussed this, I don't recall anyone speaking up in favor of the safely of the old code.
Not to mention that the crash we'll get if we get an NMI and a rescheduling interrupt in this path will be very, very hard to debug.
On Mon, Apr 29, 2019 at 12:24 PM Andy Lutomirski luto@kernel.org wrote:
Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
Where? STI; IRET would be nuts.
Sorry, not 'sti;iret' but 'sti;sysexit'
before commit 4214a16b02971c60960afd675d03544e109e0d75 x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
we did sti; sysxit, but, when we discussed this, I don't recall anyone speaking up in favor of the safely of the old code.
We still have that sti sysexit in the 32-bit code.
Linus
On Mon, Apr 29, 2019 at 01:07:33PM -0700, Linus Torvalds wrote:
On Mon, Apr 29, 2019 at 12:24 PM Andy Lutomirski luto@kernel.org wrote:
Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
Where? STI; IRET would be nuts.
Sorry, not 'sti;iret' but 'sti;sysexit'
before commit 4214a16b02971c60960afd675d03544e109e0d75 x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
we did sti; sysxit, but, when we discussed this, I don't recall anyone speaking up in favor of the safely of the old code.
We still have that sti sysexit in the 32-bit code.
We also have both: "STI; HLT" and "STI; MWAIT" where we rely on the STI shadow. Getting an NMI in between shouldn't hurt too much, but if that in turn can lead to an actual interrupt happening, we're up some creek without no paddle.
Most moden systems don't use either anymore though. As mwait_idle_with_hints() relies on MWAIT ECX[0]=1 to allow MWAIT to wake from pending interrupts.
On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra peterz@infradead.org wrote:
On Mon, Apr 29, 2019 at 01:07:33PM -0700, Linus Torvalds wrote:
We still have that sti sysexit in the 32-bit code.
We also have both: "STI; HLT" and "STI; MWAIT" where we rely on the STI shadow.
I guess the good news is that in all cases we really only ever protect against a very unlikely race, and if the race happens it's not actually fatal.
Yes, if we get an NMI and then an interrupt in between the "st;hlt" we might wait for the next interrupt and get a (potentially fairly horrible) latency issue. I guess that with maximal luck it might be a one-shot timer and not get re-armed, but it sounds very very very unlikely.
Googling around, I actually find a patch from Avi Kivity from back in 2010 for this exact issue, apparently because kvm got this case wrong and somebody hit it. The patch never made it upstream exactly because kvm could be fixed and people decided that most real hardware didn't have the issue in the first place.
In the discussion I found, Peter Anvin tried to get confirmation from AMD engineers about this too, but I don't see any resolution.
Realistically, I don't think you can hit the problem in practice. The only way to hit that incredibly small race of "one instruction, *both* NMI and interrupts" is to have a lot of interrupts going all at the same time, but that will also then solve the latency problem, so the very act of triggering it will also fix it.
I don't see any case where it's really bad. The "sti sysexit" race is similar, just about latency of user space signal reporting (and perhaps any pending TIF_WORK_xyz flags).
So maybe we don't care deeply about the sti shadow. It's a potential latecy problem when broken, but not a huge issue. And for the instruction rewriting hack, moving to "push+sti+ret" also makes a lost sti shadow just a "possibly odd stack frame visibility" issue rather than anything deeply fatal.
We can probably just write it off as "some old CPU's (and a smattering or very rare and not relevant new ones) have potential but unlikely latency issues because of a historical CPU mis-design - don't do perf on them".
Linus
On Tue, Apr 30, 2019 at 9:06 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra peterz@infradead.org wrote:
Realistically, I don't think you can hit the problem in practice. The only way to hit that incredibly small race of "one instruction, *both* NMI and interrupts" is to have a lot of interrupts going all at the same time, but that will also then solve the latency problem, so the very act of triggering it will also fix it.
I don't see any case where it's really bad. The "sti sysexit" race is similar, just about latency of user space signal reporting (and perhaps any pending TIF_WORK_xyz flags).
In the worst case, it actually kills the machine. Last time I tracked a bug like this down, I think the issue was that we got preempted after the last TIF_ check, entered a VM, exited, context switched back, and switched to user mode without noticing that there was a ending KVM user return notifier. This left us with bogus CPU state and the machine exploded.
Linus, can I ask you to reconsider your opposition to Josh's other approach of just shifting the stack on int3 entry? I agree that it's ugly, but the ugliness is easily manageable and fairly self-contained. We add a little bit of complication to the entry asm (but it's not like it's unprecedented -- the entry asm does all kinds of stack rearrangement due to IST and PTI crap already), and we add an int3_emulate_call(struct pt_regs *regs, unsigned long target) helper that has appropriate assertions that the stack is okay and emulates the call. And that's it.
In contrast, your approach involves multiple asm trampolines, hash tables, batching complications, and sti shadows.
As an additional argument, with the stack-shifting approach, it runs on *every int3 from kernel mode*. This means that we can do something like this:
static bool int3_emulate_call_okay(struct pt_regs *regs) { unsigned long available_stack = regs->sp - (unsigned long); return available_stack >= sizeof(long); }
void do_int3(...) { { WARN_ON_ONCE(!user_mode(regs) && !int3_emulate_call_okay(regs)); ...; }
static void int3_emulate_call(struct pt_regs *regs, unsigned long target) { BUG_ON(user_mode(regs) || !int3_emulate_call_okey(regs)); regs->sp -= sizeof(unsigned long); *(unsigned long *)regs->sp = target; /* CET SHSTK fixup goes here */ }
Obviously the CET SHSTK fixup might be rather nasty, but I suspect it's a solvable problem.
A major benefit of this is that the entry asm nastiness will get exercised all the time, and, if we screw it up, the warning will fire. This is the basic principle behind why the entry stuff *works* these days. I've put a lot of effort into making sure that running kernels with CONFIG_DEBUG_ENTRY and running the selftests actually exercises the nasty cases.
--Andy
On Tue, 30 Apr 2019 09:33:51 -0700 Andy Lutomirski luto@kernel.org wrote:
Linus, can I ask you to reconsider your opposition to Josh's other approach of just shifting the stack on int3 entry? I agree that it's ugly, but the ugliness is easily manageable and fairly self-contained. We add a little bit of complication to the entry asm (but it's not like it's unprecedented -- the entry asm does all kinds of stack rearrangement due to IST and PTI crap already), and we add an int3_emulate_call(struct pt_regs *regs, unsigned long target) helper that has appropriate assertions that the stack is okay and emulates the call. And that's it.
I also prefer Josh's stack shift solution, as I personally believe that's a cleaner solution. But I went ahead and implemented Linus's version to get it working for ftrace. Here's the code, and it survived some preliminary tests.
There's three places that use the update code. One is the start of every function call (yes, I counted that as one, and that case is determined by: ftrace_location(ip)). The other is the trampoline itself has an update. That could also be converted to a text poke, but for now its here as it was written before text poke existed. The third place is actually a jump (to the function graph code). But that can be safely skipped if we are converting it, as it only goes from jump to nop, or nop to jump.
The trampolines reflect this. Also, as NMI code is traced by ftrace, I had to duplicate the trampolines for the nmi case (but only for the interrupts disabled case as NMIs don't have interrupts enabled).
-- Steve
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..bf320bf791dd 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -17,6 +17,7 @@ #include <linux/uaccess.h> #include <linux/ftrace.h> #include <linux/percpu.h> +#include <linux/frame.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/init.h> @@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
static unsigned long ftrace_update_func;
+/* Used within inline asm below */ +unsigned long ftrace_update_func_call; + static int update_ftrace_func(unsigned long ip, void *new) { unsigned char old[MCOUNT_INSN_SIZE]; @@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
+ ftrace_update_func_call = (unsigned long)func; + new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -280,6 +286,70 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) return 0; }
+extern asmlinkage void ftrace_emulate_call_irqon(void); +extern asmlinkage void ftrace_emulate_call_irqoff(void); +extern asmlinkage void ftrace_emulate_call_nmi(void); +extern asmlinkage void ftrace_emulate_call_update_irqoff(void); +extern asmlinkage void ftrace_emulate_call_update_irqon(void); +extern asmlinkage void ftrace_emulate_call_update_nmi(void); + +static DEFINE_PER_CPU(void *, ftrace_bp_call_return); +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return); + +asm( + ".text\n" + ".global ftrace_emulate_call_irqoff\n" + ".type ftrace_emulate_call_irqoff, @function\n" + "ftrace_emulate_call_irqoff:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "sti\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n" + + ".global ftrace_emulate_call_irqon\n" + ".type ftrace_emulate_call_irqon, @function\n" + "ftrace_emulate_call_irqon:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n" + + ".global ftrace_emulate_call_nmi\n" + ".type ftrace_emulate_call_nmi, @function\n" + "ftrace_emulate_call_nmi:\n\t" + "push %gs:ftrace_bp_call_nmi_return\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n" + + ".global ftrace_emulate_call_update_irqoff\n" + ".type ftrace_emulate_call_update_irqoff, @function\n" + "ftrace_emulate_call_update_irqoff:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "sti\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_irqoff, .-ftrace_emulate_call_update_irqoff\n" + + ".global ftrace_emulate_call_update_irqon\n" + ".type ftrace_emulate_call_update_irqon, @function\n" + "ftrace_emulate_call_update_irqon:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_irqon, .-ftrace_emulate_call_update_irqon\n" + + ".global ftrace_emulate_call_update_nmi\n" + ".type ftrace_emulate_call_update_nmi, @function\n" + "ftrace_emulate_call_update_nmi:\n\t" + "push %gs:ftrace_bp_call_nmi_return\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_nmi, .-ftrace_emulate_call_update_nmi\n" + ".previous\n"); + +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqoff); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqon); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqoff); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqon); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi); + /* * A breakpoint was added to the code address we are about to * modify, and this is the handle that will just skip over it. @@ -295,10 +365,40 @@ int ftrace_int3_handler(struct pt_regs *regs) return 0;
ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) + if (ftrace_location(ip)) { + if (in_nmi()) { + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE); + regs->ip = (unsigned long) ftrace_emulate_call_nmi; + return 1; + } + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE); + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) ftrace_emulate_call_irqoff; + } else { + regs->ip = (unsigned long) ftrace_emulate_call_irqon; + } + } else if (is_ftrace_caller(ip)) { + /* If it's a jump, just need to skip it */ + if (!ftrace_update_func_call) { + regs->ip += MCOUNT_INSN_SIZE -1; + return 1; + } + if (in_nmi()) { + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE); + regs->ip = (unsigned long) ftrace_emulate_call_update_nmi; + return 1; + } + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE); + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff; + } else { + regs->ip = (unsigned long) ftrace_emulate_call_update_irqon; + } + } else { return 0; - - regs->ip += MCOUNT_INSN_SIZE - 1; + }
return 1; } @@ -859,6 +959,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func; + /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -960,6 +1062,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new;
+ ftrace_update_func_call = 0; new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
On Tue, 30 Apr 2019 13:03:59 -0400 Steven Rostedt rostedt@goodmis.org wrote:
I also prefer Josh's stack shift solution, as I personally believe that's a cleaner solution. But I went ahead and implemented Linus's version to get it working for ftrace. Here's the code, and it survived some preliminary tests.
Well it past the second level of tests.
If this is the way we are going, I could add comments to the code and apply it to my queue and run it through the rest of my test suite and make it ready for the merge window. I may add a stable tag to it to go back to where live kernel patching was added, as it fixes a potential bug there.
-- Steve
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the function tracer started tracing the same function that was patched, the conversion of the fentry call site during the translation of going from calling the live kernel patch trampoline to the iterator trampoline, would have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to prevent the function being traced from being called, as it will redirect it). This small window would allow the old buggy function to be called, and this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is similar to the static call emulation issues that came up a while ago[3], Linus suggested using per CPU data along with special trampolines[4] to emulate the calls.
Linus's solution was for text poke (which was mostly what the static_call code did), but as ftrace has its own mechanism, it required doing its own thing.
Having ftrace use its own per CPU data and having its own set of specialized trampolines solves the issue of missed calls that live kernel patching suffers.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de [2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de [3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.1541711457... [4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=A...
Inspired-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org --- arch/x86/kernel/ftrace.c | 146 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 143 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..835277043348 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -17,6 +17,7 @@ #include <linux/uaccess.h> #include <linux/ftrace.h> #include <linux/percpu.h> +#include <linux/frame.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/init.h> @@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
static unsigned long ftrace_update_func;
+/* Used within inline asm below */ +unsigned long ftrace_update_func_call; + static int update_ftrace_func(unsigned long ip, void *new) { unsigned char old[MCOUNT_INSN_SIZE]; @@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
+ ftrace_update_func_call = (unsigned long)func; + new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -280,6 +286,103 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) return 0; }
+/* + * We need to handle the "call func1" -> "call func2" case. + * Just skipping the call is not sufficient as it will be like + * changing to "nop" first and then updating the call. But some + * users of ftrace require calls never to be missed. + * + * To emulate the call while converting the call site with a breakpoint, + * some trampolines are used along with per CPU buffers. + * There are three trampolines for the call sites and three trampolines + * for the updating of the call in ftrace trampoline. The three + * trampolines are: + * + * 1) Interrupts are enabled when the breakpoint is hit + * 2) Interrupts are disabled when the breakpoint is hit + * 3) The breakpoint was hit in an NMI + * + * As per CPU data is used, interrupts must be disabled to prevent them + * from corrupting the data. A separate NMI trampoline is used for the + * NMI case. If interrupts are already disabled, then the return path + * of where the breakpoint was hit (saved in the per CPU data) is pushed + * on the stack and then a jump to either the ftrace_caller (which will + * loop through all registered ftrace_ops handlers depending on the ip + * address), or if its a ftrace trampoline call update, it will call + * ftrace_update_func_call which will hold the call that should be + * called. + */ +extern asmlinkage void ftrace_emulate_call_irqon(void); +extern asmlinkage void ftrace_emulate_call_irqoff(void); +extern asmlinkage void ftrace_emulate_call_nmi(void); +extern asmlinkage void ftrace_emulate_call_update_irqoff(void); +extern asmlinkage void ftrace_emulate_call_update_irqon(void); +extern asmlinkage void ftrace_emulate_call_update_nmi(void); + +static DEFINE_PER_CPU(void *, ftrace_bp_call_return); +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return); + +asm( + ".text\n" + + /* Trampoline for function update with interrupts enabled */ + ".global ftrace_emulate_call_irqoff\n" + ".type ftrace_emulate_call_irqoff, @function\n" + "ftrace_emulate_call_irqoff:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "sti\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n" + + /* Trampoline for function update with interrupts disabled*/ + ".global ftrace_emulate_call_irqon\n" + ".type ftrace_emulate_call_irqon, @function\n" + "ftrace_emulate_call_irqon:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n" + + /* Trampoline for function update in an NMI */ + ".global ftrace_emulate_call_nmi\n" + ".type ftrace_emulate_call_nmi, @function\n" + "ftrace_emulate_call_nmi:\n\t" + "push %gs:ftrace_bp_call_nmi_return\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n" + + /* Trampoline for ftrace trampoline call update with interrupts enabled */ + ".global ftrace_emulate_call_update_irqoff\n" + ".type ftrace_emulate_call_update_irqoff, @function\n" + "ftrace_emulate_call_update_irqoff:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "sti\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_irqoff, .-ftrace_emulate_call_update_irqoff\n" + + /* Trampoline for ftrace trampoline call update with interrupts disabled */ + ".global ftrace_emulate_call_update_irqon\n" + ".type ftrace_emulate_call_update_irqon, @function\n" + "ftrace_emulate_call_update_irqon:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_irqon, .-ftrace_emulate_call_update_irqon\n" + + /* Trampoline for ftrace trampoline call update in an NMI */ + ".global ftrace_emulate_call_update_nmi\n" + ".type ftrace_emulate_call_update_nmi, @function\n" + "ftrace_emulate_call_update_nmi:\n\t" + "push %gs:ftrace_bp_call_nmi_return\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_nmi, .-ftrace_emulate_call_update_nmi\n" + ".previous\n"); + +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqoff); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqon); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqoff); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqon); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi); + /* * A breakpoint was added to the code address we are about to * modify, and this is the handle that will just skip over it. @@ -295,10 +398,44 @@ int ftrace_int3_handler(struct pt_regs *regs) return 0;
ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) + if (ftrace_location(ip)) { + /* A breakpoint at the beginning of the function was hit */ + if (in_nmi()) { + /* NMIs have their own trampoline */ + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE); + regs->ip = (unsigned long) ftrace_emulate_call_nmi; + return 1; + } + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE); + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) ftrace_emulate_call_irqoff; + } else { + regs->ip = (unsigned long) ftrace_emulate_call_irqon; + } + } else if (is_ftrace_caller(ip)) { + /* An ftrace trampoline is being updated */ + if (!ftrace_update_func_call) { + /* If it's a jump, just need to skip it */ + regs->ip += MCOUNT_INSN_SIZE -1; + return 1; + } + if (in_nmi()) { + /* NMIs have their own trampoline */ + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE); + regs->ip = (unsigned long) ftrace_emulate_call_update_nmi; + return 1; + } + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE); + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff; + } else { + regs->ip = (unsigned long) ftrace_emulate_call_update_irqon; + } + } else { return 0; - - regs->ip += MCOUNT_INSN_SIZE - 1; + }
return 1; } @@ -859,6 +996,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func; + /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -960,6 +1099,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new;
+ ftrace_update_func_call = 0; new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
On Tue, Apr 30, 2019 at 10:49 AM Steven Rostedt rostedt@goodmis.org wrote:
+asm(
".text\n"
/* Trampoline for function update with interrupts enabled */
".global ftrace_emulate_call_irqoff\n"
".type ftrace_emulate_call_irqoff, @function\n"
"ftrace_emulate_call_irqoff:\n\t"
"push %gs:ftrace_bp_call_return\n\t"
Well, as mentioned in my original suggestion, this won't work on 32-bit, or on UP. They have different models for per-cpu data (32-bti uses %fs, and UP doesn't use a segment override at all).
Maybe we just don't care about UP at all for this code, of course.
And maybe we can make the decision to also make 32-bit just not use this either - so maybe the code is ok per se, just needs to make sure it never triggers for the cases that it's not written for..
"ftrace_emulate_call_update_irqoff:\n\t"
"push %gs:ftrace_bp_call_return\n\t"
"sti\n\t"
"jmp *ftrace_update_func_call\n"
.. and this should then use the "push push sti ret" model instead.
Plus get updated for objtool complaints.
Anyway, since Andy really likes the entry code change, can we have that patch in parallel and judge the difference that way? Iirc, that was x86-64 specific too.
Linus
On Tue, 30 Apr 2019 11:33:21 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, Apr 30, 2019 at 10:49 AM Steven Rostedt rostedt@goodmis.org wrote:
+asm(
".text\n"
/* Trampoline for function update with interrupts enabled */
".global ftrace_emulate_call_irqoff\n"
".type ftrace_emulate_call_irqoff, @function\n"
"ftrace_emulate_call_irqoff:\n\t"
"push %gs:ftrace_bp_call_return\n\t"
Well, as mentioned in my original suggestion, this won't work on 32-bit, or on UP. They have different models for per-cpu data (32-bti uses %fs, and UP doesn't use a segment override at all).
Ah, yeah, I forgot about 32-bit. I could easily make this use fs as well, and for UP, just use a static variable.
Maybe we just don't care about UP at all for this code, of course.
And maybe we can make the decision to also make 32-bit just not use this either - so maybe the code is ok per se, just needs to make sure it never triggers for the cases that it's not written for..
"ftrace_emulate_call_update_irqoff:\n\t"
"push %gs:ftrace_bp_call_return\n\t"
"sti\n\t"
"jmp *ftrace_update_func_call\n"
.. and this should then use the "push push sti ret" model instead.
Plus get updated for objtool complaints.
Yeah, I see that now. Somehow it disappeared when I looked for it after making some other changes. I can update it.
Anyway, since Andy really likes the entry code change, can we have that patch in parallel and judge the difference that way? Iirc, that was x86-64 specific too.
Note, I don't think live kernel patching supports 32 bit anyway, so that may not be an issue.
Josh,
When you come back to the office, can you look into that method?
-- Steve
On Tue, 30 Apr 2019 11:33:21 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
"ftrace_emulate_call_update_irqoff:\n\t"
"push %gs:ftrace_bp_call_return\n\t"
"sti\n\t"
"jmp *ftrace_update_func_call\n"
.. and this should then use the "push push sti ret" model instead.
Plus get updated for objtool complaints.
And unfortunately, this blows up on lockdep. Lockdep notices that the return from the breakpoint handler has interrupts enabled, and will not enable them in its shadow irqs disabled variable. But then we enabled them in the trampoline, without telling lockdep and we trigger something likes this:
------------[ cut here ]------------ IRQs not enabled as expected WARNING: CPU: 2 PID: 0 at kernel/time/tick-sched.c:979 tick_nohz_idle_enter+0x44/0x8c Modules linked in: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.1.0-rc3-test+ #123 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 EIP: tick_nohz_idle_enter+0x44/0x8c Code: f0 05 00 00 00 75 26 83 b8 c4 05 00 00 00 75 1d 80 3d 5f 0f 43 c1 00 75 14 68 72 74 16 c1 c6 05 5f 0f 43 c1 01 e8 33 d7 f8 ff <0f> 0b 58 fa e8 4e 2c 04 00 bb e0 36 6b c1 64 03 1d 28 81 56 c1 8b EAX: 0000001c EBX: ee769f84 ECX: 00000000 EDX: 00000006 ESI: 00000000 EDI: 00000002 EBP: ee769f50 ESP: ee769f48 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210292 CR0: 80050033 CR2: 00000000 CR3: 016c4000 CR4: 001406f0 Call Trace: do_idle+0x2a/0x1fc cpu_startup_entry+0x1e/0x20 start_secondary+0x1d3/0x1ec startup_32_smp+0x164/0x168
I have to fool lockdep with the following:
if (regs->flags & X86_EFLAGS_IF) { regs->flags &= ~X86_EFLAGS_IF; regs->ip = (unsigned long) ftrace_emulate_call_irqoff; /* Tell lockdep here we are enabling interrupts */ trace_hardirqs_on(); } else { regs->ip = (unsigned long) ftrace_emulate_call_irqon; }
-- Steve
On Tue, Apr 30, 2019 at 11:33:21AM -0700, Linus Torvalds wrote:
Anyway, since Andy really likes the entry code change, can we have that patch in parallel and judge the difference that way? Iirc, that was x86-64 specific too.
Here goes, compile tested only...
It obviously needs a self-test, but that shoulnd't be too hard to arrange.
--- arch/x86/entry/entry_32.S | 7 +++++++ arch/x86/entry/entry_64.S | 14 ++++++++++++-- arch/x86/include/asm/text-patching.h | 20 ++++++++++++++++++++ arch/x86/kernel/ftrace.c | 24 +++++++++++++++++++----- 4 files changed, 58 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..d246302085a3 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1479,6 +1479,13 @@ ENTRY(int3) ASM_CLAC pushl $-1 # mark this as an int
+ testl $SEGMENT_RPL_MASK, PT_CS(%esp) + jnz .Lfrom_usermode_no_gap + .rept 6 + pushl 5*4(%esp) + .endr +.Lfrom_usermode_no_gap: + SAVE_ALL switch_stacks=1 ENCODE_FRAME_POINTER TRACE_IRQS_OFF diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 20e45d9b4e15..268cd9affe04 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -898,6 +898,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index e85ff65c43c3..ba275b6292db 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -39,4 +39,24 @@ extern int poke_int3_handler(struct pt_regs *regs); extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler); extern int after_bootmem;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val) +{ + regs->sp -= sizeof(unsigned long); + *(unsigned long *)regs->sp = val; +} + +static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ + int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + int3_emulate_jmp(regs, func); +} + #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..90d319687d7e 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -29,6 +29,7 @@ #include <asm/kprobes.h> #include <asm/ftrace.h> #include <asm/nops.h> +#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, }
static unsigned long ftrace_update_func; +static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new) { @@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
+ ftrace_update_func_call = (unsigned long)func; + new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -295,12 +299,19 @@ int ftrace_int3_handler(struct pt_regs *regs) return 0;
ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) - return 0; - - regs->ip += MCOUNT_INSN_SIZE - 1; + if (ftrace_location(ip)) { + int3_emulate_call(regs, ftrace_update_func_call); + return 1; + } else if (is_ftrace_caller(ip)) { + if (!ftrace_update_func_call) { + int3_emulate_jmp(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + return 1; + } + int3_emulate_call(regs, ftrace_update_func_call); + return 1; + }
- return 1; + return 0; } NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +870,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func; + /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -960,6 +973,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new;
+ ftrace_update_func_call = 0UL; new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
On Wed, 1 May 2019 15:11:17 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Tue, Apr 30, 2019 at 11:33:21AM -0700, Linus Torvalds wrote:
Anyway, since Andy really likes the entry code change, can we have that patch in parallel and judge the difference that way? Iirc, that was x86-64 specific too.
Here goes, compile tested only...
It obviously needs a self-test, but that shoulnd't be too hard to arrange.
I was able to get it applied (with slight tweaking) but it then crashed. But that was due to incorrect updates in the ftrace_int3_handler().
arch/x86/entry/entry_32.S | 7 +++++++ arch/x86/entry/entry_64.S | 14 ++++++++++++-- arch/x86/include/asm/text-patching.h | 20 ++++++++++++++++++++ arch/x86/kernel/ftrace.c | 24 +++++++++++++++++++----- 4 files changed, 58 insertions(+), 7 deletions(-)
#endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..90d319687d7e 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -29,6 +29,7 @@ #include <asm/kprobes.h> #include <asm/ftrace.h> #include <asm/nops.h> +#include <asm/text-patching.h> #ifdef CONFIG_DYNAMIC_FTRACE @@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, } static unsigned long ftrace_update_func; +static unsigned long ftrace_update_func_call; static int update_ftrace_func(unsigned long ip, void *new) { @@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
- ftrace_update_func_call = (unsigned long)func;
- new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -295,12 +299,19 @@ int ftrace_int3_handler(struct pt_regs *regs) return 0; ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
return 0;
- regs->ip += MCOUNT_INSN_SIZE - 1;
- if (ftrace_location(ip)) {
int3_emulate_call(regs, ftrace_update_func_call);
Should be:
int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
return 1;
- } else if (is_ftrace_caller(ip)) {
if (!ftrace_update_func_call) {
int3_emulate_jmp(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
I see what you did here, but I think:
int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
looks better. But that said, we could in the beginning do:
ip = regs->ip - INT3_INSN_SIZE;
instead of
ip = regs->ip - 1;
I made these updates and posted them to Linus.
-- Steve
return 1;
}
int3_emulate_call(regs, ftrace_update_func_call);
return 1;
- }
- return 1;
- return 0;
} NOKPROBE_SYMBOL(ftrace_int3_handler); @@ -859,6 +870,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops) func = ftrace_ops_get_func(ops);
- ftrace_update_func_call = (unsigned long)func;
- /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -960,6 +973,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new;
- ftrace_update_func_call = 0UL; new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
On Wed, May 01, 2019 at 02:58:24PM -0400, Steven Rostedt wrote:
- if (ftrace_location(ip)) {
int3_emulate_call(regs, ftrace_update_func_call);
Should be:
int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
Ah, I lost the plot a little there.
return 1;
- } else if (is_ftrace_caller(ip)) {
if (!ftrace_update_func_call) {
int3_emulate_jmp(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
I see what you did here, but I think:
int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
looks better. But that said, we could in the beginning do:
ip = regs->ip - INT3_INSN_SIZE;
instead of
ip = regs->ip - 1;
I made these updates and posted them to Linus.
I was actually considering:
static inline void int3_emulate_nop(struct pt_regs *regs, unsigned long size) { int3_emulate_jmp(regs, regs->ip - INT3_INSN_SIZE + size); }
And then the above becomes:
int3_emulate_nop(regs, CALL_INSN_SIZE);
Hmm?
On Wed, May 1, 2019 at 6:11 AM Peter Zijlstra peterz@infradead.org wrote:
Here goes, compile tested only...
Ugh, two different threads. This has the same bug (same source) as the one Steven posted:
--- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1479,6 +1479,13 @@ ENTRY(int3) ASM_CLAC pushl $-1 # mark this as an int
testl $SEGMENT_RPL_MASK, PT_CS(%esp)
jnz .Lfrom_usermode_no_gap
.rept 6
pushl 5*4(%esp)
.endr
+.Lfrom_usermode_no_gap:
This will corrupt things horribly if you still use vm86 mode. Checking CS RPL is simply not correct.
Linus
On Wed, May 01, 2019 at 12:03:52PM -0700, Linus Torvalds wrote:
On Wed, May 1, 2019 at 6:11 AM Peter Zijlstra peterz@infradead.org wrote:
Here goes, compile tested only...
Ugh, two different threads. This has the same bug (same source) as the one Steven posted:
This is what Steve started from; lets continue in the other thread.
--- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1479,6 +1479,13 @@ ENTRY(int3) ASM_CLAC pushl $-1 # mark this as an int
testl $SEGMENT_RPL_MASK, PT_CS(%esp)
jnz .Lfrom_usermode_no_gap
.rept 6
pushl 5*4(%esp)
.endr
+.Lfrom_usermode_no_gap:
This will corrupt things horribly if you still use vm86 mode. Checking CS RPL is simply not correct.
I'll go fix; I never really understood that vm86 crud and I cobbled this 32bit thing together based on the 64bit version (that Josh did a while ago).
On Wed, 1 May 2019 12:03:52 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, May 1, 2019 at 6:11 AM Peter Zijlstra peterz@infradead.org wrote:
Here goes, compile tested only...
Ugh, two different threads. This has the same bug (same source) as the one Steven posted:
--- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1479,6 +1479,13 @@ ENTRY(int3) ASM_CLAC pushl $-1 # mark this as an int
testl $SEGMENT_RPL_MASK, PT_CS(%esp)
jnz .Lfrom_usermode_no_gap
.rept 6
pushl 5*4(%esp)
.endr
+.Lfrom_usermode_no_gap:
This will corrupt things horribly if you still use vm86 mode. Checking CS RPL is simply not correct.
I never tested the 32 bit version of this. And we could just not implement it (I don't think there's live kernel patching for it either).
But this doesn't make it any worse than my version, because under the full testing of my patch with the trampolines, I would easily crash the 32 bit version. That was one reason I made my last patch only support 64 bit.
Under light load, 32 bit works, but when I stress it (running perf and ftrace together) it blows up. Could be an NMI issue.
-- Steve
On Wed, 1 May 2019, Steven Rostedt wrote:
I never tested the 32 bit version of this. And we could just not implement it (I don't think there's live kernel patching for it either).
That's correct, there is no livepatching on x86_32 (and no plans for it). CONFIG_LIVEPATCH is not available for 32bit builds at all.
On Wed, May 01, 2019 at 09:33:28PM +0200, Jiri Kosina wrote:
On Wed, 1 May 2019, Steven Rostedt wrote:
I never tested the 32 bit version of this. And we could just not implement it (I don't think there's live kernel patching for it either).
That's correct, there is no livepatching on x86_32 (and no plans for it). CONFIG_LIVEPATCH is not available for 32bit builds at all.
We still want this for static_call(), even on 32bit.
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the function tracer started tracing the same function that was patched, the conversion of the fentry call site during the translation of going from calling the live kernel patch trampoline to the iterator trampoline, would have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to prevent the function being traced from being called, as it will redirect it). This small window would allow the old buggy function to be called, and this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is similar to the static call emulation issues that came up a while ago[3], Linus suggested using per CPU data along with special trampolines[4] to emulate the calls.
Linus's solution was for text poke (which was mostly what the static_call code did), but as ftrace has its own mechanism, it required doing its own thing.
Having ftrace use its own per CPU data and having its own set of specialized trampolines solves the issue of missed calls that live kernel patching suffers.
[1] http://lkml.kernel.org/r/20180726104029.7736-1-nstange@suse.de [2] http://lkml.kernel.org/r/20190427100639.15074-1-nstange@suse.de [3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.1541711457... [4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=A...
Inspired-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org ---
Changes since v1:
- Use "push push ret" instead of indirect jumps (Linus) - Handle 32 bit as well as non SMP - Fool lockdep into thinking interrupts are enabled
arch/x86/kernel/ftrace.c | 175 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 170 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..9160f5cc3b6d 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -17,6 +17,7 @@ #include <linux/uaccess.h> #include <linux/ftrace.h> #include <linux/percpu.h> +#include <linux/frame.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/init.h> @@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
static unsigned long ftrace_update_func;
+/* Used within inline asm below */ +unsigned long ftrace_update_func_call; + static int update_ftrace_func(unsigned long ip, void *new) { unsigned char old[MCOUNT_INSN_SIZE]; @@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
+ ftrace_update_func_call = (unsigned long)func; + new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -280,6 +286,125 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) return 0; }
+/* + * We need to handle the "call func1" -> "call func2" case. + * Just skipping the call is not sufficient as it will be like + * changing to "nop" first and then updating the call. But some + * users of ftrace require calls never to be missed. + * + * To emulate the call while converting the call site with a breakpoint, + * some trampolines are used along with per CPU buffers. + * There are three trampolines for the call sites and three trampolines + * for the updating of the call in ftrace trampoline. The three + * trampolines are: + * + * 1) Interrupts are enabled when the breakpoint is hit + * 2) Interrupts are disabled when the breakpoint is hit + * 3) The breakpoint was hit in an NMI + * + * As per CPU data is used, interrupts must be disabled to prevent them + * from corrupting the data. A separate NMI trampoline is used for the + * NMI case. If interrupts are already disabled, then the return path + * of where the breakpoint was hit (saved in the per CPU data) is pushed + * on the stack and then a jump to either the ftrace_caller (which will + * loop through all registered ftrace_ops handlers depending on the ip + * address), or if its a ftrace trampoline call update, it will call + * ftrace_update_func_call which will hold the call that should be + * called. + */ +extern asmlinkage void ftrace_emulate_call_irqon(void); +extern asmlinkage void ftrace_emulate_call_irqoff(void); +extern asmlinkage void ftrace_emulate_call_nmi(void); +extern asmlinkage void ftrace_emulate_call_update_irqoff(void); +extern asmlinkage void ftrace_emulate_call_update_irqon(void); +extern asmlinkage void ftrace_emulate_call_update_nmi(void); + +static DEFINE_PER_CPU(void *, ftrace_bp_call_return); +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return); + +#ifdef CONFIG_SMP +#ifdef CONFIG_X86_64 +# define BP_CALL_RETURN "%gs:ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "%gs:ftrace_bp_call_nmi_return" +#else +# define BP_CALL_RETURN "%fs:ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "%fs:ftrace_bp_call_nmi_return" +#endif +#else /* SMP */ +# define BP_CALL_RETURN "ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "ftrace_bp_call_nmi_return" +#endif + +/* To hold the ftrace_caller address to push on the stack */ +void *ftrace_caller_func = (void *)ftrace_caller; + +asm( + ".text\n" + + /* Trampoline for function update with interrupts enabled */ + ".global ftrace_emulate_call_irqoff\n" + ".type ftrace_emulate_call_irqoff, @function\n" + "ftrace_emulate_call_irqoff:\n\t" + "push "BP_CALL_RETURN"\n\t" + "push ftrace_caller_func\n" + "sti\n\t" + "ret\n\t" + ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n" + + /* Trampoline for function update with interrupts disabled*/ + ".global ftrace_emulate_call_irqon\n" + ".type ftrace_emulate_call_irqon, @function\n" + "ftrace_emulate_call_irqon:\n\t" + "push "BP_CALL_RETURN"\n\t" + "push ftrace_caller_func\n" + "ret\n\t" + ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n" + + /* Trampoline for function update in an NMI */ + ".global ftrace_emulate_call_nmi\n" + ".type ftrace_emulate_call_nmi, @function\n" + "ftrace_emulate_call_nmi:\n\t" + "push "BP_CALL_NMI_RETURN"\n\t" + "push ftrace_caller_func\n" + "ret\n\t" + ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n" + + /* Trampoline for ftrace trampoline call update with interrupts enabled */ + ".global ftrace_emulate_call_update_irqoff\n" + ".type ftrace_emulate_call_update_irqoff, @function\n" + "ftrace_emulate_call_update_irqoff:\n\t" + "push "BP_CALL_RETURN"\n\t" + "push ftrace_update_func_call\n" + "sti\n\t" + "ret\n\t" + ".size ftrace_emulate_call_update_irqoff, .-ftrace_emulate_call_update_irqoff\n" + + /* Trampoline for ftrace trampoline call update with interrupts disabled */ + ".global ftrace_emulate_call_update_irqon\n" + ".type ftrace_emulate_call_update_irqon, @function\n" + "ftrace_emulate_call_update_irqon:\n\t" + "push "BP_CALL_RETURN"\n\t" + "push ftrace_update_func_call\n" + "ret\n\t" + ".size ftrace_emulate_call_update_irqon, .-ftrace_emulate_call_update_irqon\n" + + /* Trampoline for ftrace trampoline call update in an NMI */ + ".global ftrace_emulate_call_update_nmi\n" + ".type ftrace_emulate_call_update_nmi, @function\n" + "ftrace_emulate_call_update_nmi:\n\t" + "push "BP_CALL_NMI_RETURN"\n\t" + "push ftrace_update_func_call\n" + "ret\n\t" + ".size ftrace_emulate_call_update_nmi, .-ftrace_emulate_call_update_nmi\n" + ".previous\n"); + +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqoff); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqon); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqoff); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqon); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi); + /* * A breakpoint was added to the code address we are about to * modify, and this is the handle that will just skip over it. @@ -295,12 +420,49 @@ int ftrace_int3_handler(struct pt_regs *regs) return 0;
ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) - return 0; - - regs->ip += MCOUNT_INSN_SIZE - 1; + if (ftrace_location(ip)) { + /* A breakpoint at the beginning of the function was hit */ + if (in_nmi()) { + /* NMIs have their own trampoline */ + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE); + regs->ip = (unsigned long) ftrace_emulate_call_nmi; + return 1; + } + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE); + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) ftrace_emulate_call_irqoff; + /* Tell lockdep here we are enabling interrupts */ + trace_hardirqs_on(); + } else { + regs->ip = (unsigned long) ftrace_emulate_call_irqon; + } + return 1; + } else if (is_ftrace_caller(ip)) { + /* An ftrace trampoline is being updated */ + if (!ftrace_update_func_call) { + /* If it's a jump, just need to skip it */ + regs->ip += MCOUNT_INSN_SIZE -1; + return 1; + } + if (in_nmi()) { + /* NMIs have their own trampoline */ + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE); + regs->ip = (unsigned long) ftrace_emulate_call_update_nmi; + return 1; + } + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE); + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff; + trace_hardirqs_on(); + } else { + regs->ip = (unsigned long) ftrace_emulate_call_update_irqon; + } + return 1; + }
- return 1; + return 0; } NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +1021,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func; + /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -960,6 +1124,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new;
+ ftrace_update_func_call = 0; new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
On Tue, 30 Apr 2019 17:53:34 -0400 Steven Rostedt rostedt@goodmis.org wrote:
- if (ftrace_location(ip)) {
/* A breakpoint at the beginning of the function was hit */
if (in_nmi()) {
/* NMIs have their own trampoline */
this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
regs->ip = (unsigned long) ftrace_emulate_call_nmi;
return 1;
}
this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
if (regs->flags & X86_EFLAGS_IF) {
regs->flags &= ~X86_EFLAGS_IF;
regs->ip = (unsigned long) ftrace_emulate_call_irqoff;
/* Tell lockdep here we are enabling interrupts */
trace_hardirqs_on();
This isn't good enough. The return from interrupt does call lockdep saying interrupts are disabled. Need to add the lockdep tracking in the asm as well.
Probably easier to move it from inline asm to ftrace_X.S and use the lockdep TRACE_ON/OFF macros.
-- Steve
} else {
regs->ip = (unsigned long) ftrace_emulate_call_irqon;
}
return 1;
- } else if (is_ftrace_caller(ip)) {
/* An ftrace trampoline is being updated */
if (!ftrace_update_func_call) {
/* If it's a jump, just need to skip it */
regs->ip += MCOUNT_INSN_SIZE -1;
return 1;
}
if (in_nmi()) {
/* NMIs have their own trampoline */
this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
regs->ip = (unsigned long) ftrace_emulate_call_update_nmi;
return 1;
}
this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
if (regs->flags & X86_EFLAGS_IF) {
regs->flags &= ~X86_EFLAGS_IF;
regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff;
trace_hardirqs_on();
} else {
regs->ip = (unsigned long) ftrace_emulate_call_update_irqon;
}
return 1;
- }
- return 1;
- return 0;
}
On Tue, Apr 30, 2019 at 6:35 PM Steven Rostedt rostedt@goodmis.org wrote:
Probably easier to move it from inline asm to ftrace_X.S and use the lockdep TRACE_ON/OFF macros.
Yeah, that should clean up the percpu stuff too since we have helper macros for it for *.S files anyway.
I only did the asm() in C because it made the "look, something like this" patch simpler to test (and it made it easy to check the generated asm file). Not because it was a good idea ;)
Linus
Hi Steve,
many thanks for moving this forward!
Steven Rostedt rostedt@goodmis.org writes:
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..9160f5cc3b6d 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -17,6 +17,7 @@ #include <linux/uaccess.h> #include <linux/ftrace.h> #include <linux/percpu.h> +#include <linux/frame.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/init.h> @@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, static unsigned long ftrace_update_func; +/* Used within inline asm below */ +unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new) { unsigned char old[MCOUNT_INSN_SIZE]; @@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
- ftrace_update_func_call = (unsigned long)func;
- new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -280,6 +286,125 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) return 0; } +/*
- We need to handle the "call func1" -> "call func2" case.
- Just skipping the call is not sufficient as it will be like
- changing to "nop" first and then updating the call. But some
- users of ftrace require calls never to be missed.
- To emulate the call while converting the call site with a breakpoint,
- some trampolines are used along with per CPU buffers.
- There are three trampolines for the call sites and three trampolines
- for the updating of the call in ftrace trampoline. The three
- trampolines are:
- Interrupts are enabled when the breakpoint is hit
- Interrupts are disabled when the breakpoint is hit
- The breakpoint was hit in an NMI
- As per CPU data is used, interrupts must be disabled to prevent them
- from corrupting the data. A separate NMI trampoline is used for the
- NMI case. If interrupts are already disabled, then the return path
- of where the breakpoint was hit (saved in the per CPU data) is pushed
- on the stack and then a jump to either the ftrace_caller (which will
- loop through all registered ftrace_ops handlers depending on the ip
- address), or if its a ftrace trampoline call update, it will call
- ftrace_update_func_call which will hold the call that should be
- called.
- */
+extern asmlinkage void ftrace_emulate_call_irqon(void); +extern asmlinkage void ftrace_emulate_call_irqoff(void); +extern asmlinkage void ftrace_emulate_call_nmi(void); +extern asmlinkage void ftrace_emulate_call_update_irqoff(void); +extern asmlinkage void ftrace_emulate_call_update_irqon(void); +extern asmlinkage void ftrace_emulate_call_update_nmi(void);
+static DEFINE_PER_CPU(void *, ftrace_bp_call_return); +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
Andy mentioned #DB and #MC exceptions here: https://lkml.kernel.org/r/C55DED25-C60D-4731-9A6B-92BDA8771766@amacapital.ne...
I think that #DB won't be possible, provided the trampolines below get tagged as NOKPROBE (do_int3() and ftrace_int3_handler() already have it).
It's highly theoretic, but tracing do_machine_check() could clobber ftrace_bp_call_return or ftrace_bp_call_nmi_return?
+#ifdef CONFIG_SMP +#ifdef CONFIG_X86_64 +# define BP_CALL_RETURN "%gs:ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "%gs:ftrace_bp_call_nmi_return" +#else +# define BP_CALL_RETURN "%fs:ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "%fs:ftrace_bp_call_nmi_return" +#endif +#else /* SMP */ +# define BP_CALL_RETURN "ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "ftrace_bp_call_nmi_return" +#endif
+/* To hold the ftrace_caller address to push on the stack */ +void *ftrace_caller_func = (void *)ftrace_caller;
The live patching ftrace_ops need ftrace_regs_caller.
+asm(
- ".text\n"
- /* Trampoline for function update with interrupts enabled */
- ".global ftrace_emulate_call_irqoff\n"
- ".type ftrace_emulate_call_irqoff, @function\n"
- "ftrace_emulate_call_irqoff:\n\t"
"push "BP_CALL_RETURN"\n\t"
"push ftrace_caller_func\n"
"sti\n\t"
"ret\n\t"
- ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
- /* Trampoline for function update with interrupts disabled*/
- ".global ftrace_emulate_call_irqon\n"
The naming is perhaps a bit confusing, i.e. "update with interrupts disabled" vs. "irqon"... How about swapping irqoff<->irqon?
Thanks,
Nicolai
- ".type ftrace_emulate_call_irqon, @function\n"
- "ftrace_emulate_call_irqon:\n\t"
"push "BP_CALL_RETURN"\n\t"
"push ftrace_caller_func\n"
"ret\n\t"
- ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n"
- /* Trampoline for function update in an NMI */
- ".global ftrace_emulate_call_nmi\n"
- ".type ftrace_emulate_call_nmi, @function\n"
- "ftrace_emulate_call_nmi:\n\t"
"push "BP_CALL_NMI_RETURN"\n\t"
"push ftrace_caller_func\n"
"ret\n\t"
- ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n"
On Wed, 01 May 2019 10:26:32 +0200 Nicolai Stange nstange@suse.de wrote:
+extern asmlinkage void ftrace_emulate_call_irqon(void); +extern asmlinkage void ftrace_emulate_call_irqoff(void); +extern asmlinkage void ftrace_emulate_call_nmi(void); +extern asmlinkage void ftrace_emulate_call_update_irqoff(void); +extern asmlinkage void ftrace_emulate_call_update_irqon(void); +extern asmlinkage void ftrace_emulate_call_update_nmi(void);
+static DEFINE_PER_CPU(void *, ftrace_bp_call_return); +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
Andy mentioned #DB and #MC exceptions here: https://lkml.kernel.org/r/C55DED25-C60D-4731-9A6B-92BDA8771766@amacapital.ne...
I think that #DB won't be possible, provided the trampolines below get tagged as NOKPROBE (do_int3() and ftrace_int3_handler() already have it).
It's highly theoretic, but tracing do_machine_check() could clobber ftrace_bp_call_return or ftrace_bp_call_nmi_return?
Probably shouldn't trace do_machine_check() then ;-)
+#ifdef CONFIG_SMP +#ifdef CONFIG_X86_64 +# define BP_CALL_RETURN "%gs:ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "%gs:ftrace_bp_call_nmi_return" +#else +# define BP_CALL_RETURN "%fs:ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "%fs:ftrace_bp_call_nmi_return" +#endif +#else /* SMP */ +# define BP_CALL_RETURN "ftrace_bp_call_return" +# define BP_CALL_NMI_RETURN "ftrace_bp_call_nmi_return" +#endif
+/* To hold the ftrace_caller address to push on the stack */ +void *ftrace_caller_func = (void *)ftrace_caller;
The live patching ftrace_ops need ftrace_regs_caller.
Ah, you're right. Luckily ftrace_regs_caller is a superset of ftrace_caller. That is, those only needing ftrace_caller can do fine with ftrace_regs_caller (but not vice versa).
Easy enough to fix.
+asm(
- ".text\n"
- /* Trampoline for function update with interrupts enabled */
- ".global ftrace_emulate_call_irqoff\n"
- ".type ftrace_emulate_call_irqoff, @function\n"
- "ftrace_emulate_call_irqoff:\n\t"
"push "BP_CALL_RETURN"\n\t"
"push ftrace_caller_func\n"
"sti\n\t"
"ret\n\t"
- ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
- /* Trampoline for function update with interrupts disabled*/
- ".global ftrace_emulate_call_irqon\n"
The naming is perhaps a bit confusing, i.e. "update with interrupts disabled" vs. "irqon"... How about swapping irqoff<->irqon?
I just used the terminology Linus used. It is confusing. Perhaps just call it ftrace_emulate_call (for non sti case) and ftrace_emulate_call_sti for the sti case. That should remove the confusion.
-- Steve
linux-kselftest-mirror@lists.linaro.org