From: Peter Zijlstra peterz@infradead.org
In order to allow breakpoints to emulate call functions, they need to push the return address onto the stack. But because the breakpoint exception frame is added to the stack when the breakpoint is hit, there's no room to add the address onto the stack and return to the address of the emulated called funtion.
To handle this, copy the exception frame on entry of the breakpoint handler and have leave a gap that can be used to add a return address to the stack frame and return from the breakpoint to the emulated called function, allowing for that called function to return back to the location after the breakpoint was placed.
The helper functions were also added:
int3_emulate_push(): to push the address onto the gap in the stack int3_emulate_jmp(): changes the location of the regs->ip to return there. int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski luto@kernel.org Cc: Nicolai Stange nstange@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: Borislav Petkov bp@alien8.de Cc: "H. Peter Anvin" hpa@zytor.com Cc: the arch/x86 maintainers x86@kernel.org Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Jiri Kosina jikos@kernel.org Cc: Miroslav Benes mbenes@suse.cz Cc: Petr Mladek pmladek@suse.com Cc: Joe Lawrence joe.lawrence@redhat.com Cc: Shuah Khan shuah@kernel.org Cc: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Cc: Tim Chen tim.c.chen@linux.intel.com Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: Mimi Zohar zohar@linux.ibm.com Cc: Juergen Gross jgross@suse.com Cc: Nick Desaulniers ndesaulniers@google.com Cc: Nayna Jain nayna@linux.ibm.com Cc: Masahiro Yamada yamada.masahiro@socionext.com Cc: Joerg Roedel jroedel@suse.de Cc: "open list:KERNEL SELFTEST FRAMEWORK" linux-kselftest@vger.kernel.org Cc: stable@vger.kernel.org Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching") Signed-off-by: *** Need Peter Zijlstra's SoB here! *** Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org --- arch/x86/entry/entry_32.S | 11 +++++++++++ arch/x86/entry/entry_64.S | 14 ++++++++++++-- arch/x86/include/asm/text-patching.h | 20 ++++++++++++++++++++ 3 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index d309f30cf7af..50bbf4035baf 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1478,6 +1478,17 @@ ENTRY(int3) ASM_CLAC pushl $-1 # mark this as an int
+#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, PT_EFLAGS(%esp) + jnz .Lfrom_usermode_no_gap +#endif + testl $SEGMENT_RPL_MASK, PT_CS(%esp) + jnz .Lfrom_usermode_no_gap + .rept 6 + pushl 5*4(%esp) + .endr +.Lfrom_usermode_no_gap: + SAVE_ALL switch_stacks=1 ENCODE_FRAME_POINTER TRACE_IRQS_OFF diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 1f0efdb7b629..834ec1397dab 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -899,6 +899,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1130,7 +1140,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index e85ff65c43c3..ba275b6292db 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -39,4 +39,24 @@ extern int poke_int3_handler(struct pt_regs *regs); extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler); extern int after_bootmem;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val) +{ + regs->sp -= sizeof(unsigned long); + *(unsigned long *)regs->sp = val; +} + +static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ + int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + int3_emulate_jmp(regs, func); +} + #endif /* _ASM_X86_TEXT_PATCHING_H */
On Wed, 01 May 2019 16:28:31 -0400 Steven Rostedt rostedt@goodmis.org wrote:
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index d309f30cf7af..50bbf4035baf 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1478,6 +1478,17 @@ ENTRY(int3) ASM_CLAC pushl $-1 # mark this as an int +#ifdef CONFIG_VM86
- testl $X86_EFLAGS_VM, PT_EFLAGS(%esp)
- jnz .Lfrom_usermode_no_gap
+#endif
- testl $SEGMENT_RPL_MASK, PT_CS(%esp)
- jnz .Lfrom_usermode_no_gap
- .rept 6
- pushl 5*4(%esp)
- .endr
+.Lfrom_usermode_no_gap:
- SAVE_ALL switch_stacks=1 ENCODE_FRAME_POINTER TRACE_IRQS_OFF
This failed to work on 32 bit at all (crashed and burned badly - triple fault!). Looking at it I found one issue. This code is done before the regs are saved, and PT_EFLAGS(%esp) and PT_CS(%esp) expect %esp to contain them. I applied this patch against this but it didn't totally fix the problems. It still constantly crashes (although, with this update I can put in some printks to get some ideas). I haven't spent too much time on it, but it looks like there's an issue with the entry-stack that int3 switches to. I'm not sure its handling the copy well.
-- Steve
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 50bbf4035baf..4f427285e421 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1479,10 +1479,10 @@ ENTRY(int3) pushl $-1 # mark this as an int
#ifdef CONFIG_VM86 - testl $X86_EFLAGS_VM, PT_EFLAGS(%esp) + testl $X86_EFLAGS_VM, PT_EFLAGS-PT_ORIG_EAX(%esp) jnz .Lfrom_usermode_no_gap #endif - testl $SEGMENT_RPL_MASK, PT_CS(%esp) + testl $SEGMENT_RPL_MASK, PT_CS-PT_ORIG_EAX(%esp) jnz .Lfrom_usermode_no_gap .rept 6 pushl 5*4(%esp)
On Wed, May 01, 2019 at 11:24:12PM -0400, Steven Rostedt wrote:
On Wed, 01 May 2019 16:28:31 -0400 Steven Rostedt rostedt@goodmis.org wrote:
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index d309f30cf7af..50bbf4035baf 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1478,6 +1478,17 @@ ENTRY(int3) ASM_CLAC pushl $-1 # mark this as an int +#ifdef CONFIG_VM86
- testl $X86_EFLAGS_VM, PT_EFLAGS(%esp)
- jnz .Lfrom_usermode_no_gap
+#endif
- testl $SEGMENT_RPL_MASK, PT_CS(%esp)
- jnz .Lfrom_usermode_no_gap
- .rept 6
- pushl 5*4(%esp)
- .endr
+.Lfrom_usermode_no_gap:
- SAVE_ALL switch_stacks=1 ENCODE_FRAME_POINTER TRACE_IRQS_OFF
This failed to work on 32 bit at all (crashed and burned badly - triple fault!).
Indeed so; find a working version below (albeit with a lot of debug garbage still in).
It also includes the self-test code that Andy wanted -- it's what I used to debug this mess.
Much thanks to Joerg Roedel for talking entry_32.S with me.
TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do not include ESP/SS, so not only wasn't regs->sp setup, if you changed it it wouldn't be effective and corrupt random stack state.
--- arch/x86/entry/entry_32.S | 87 +++++++++++++++++++++++--- arch/x86/entry/entry_64.S | 14 ++++- arch/x86/include/asm/text-patching.h | 20 ++++++ arch/x86/kernel/alternative.c | 116 +++++++++++++++++++++++++++++++++-- arch/x86/kernel/traps.c | 1 + 5 files changed, 225 insertions(+), 13 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..01c5bdbe5f39 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -203,7 +203,7 @@ .Lend_@: .endm
-.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 +.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 clear_cs=1 cld PUSH_GS pushl %fs @@ -225,7 +225,7 @@
/* Switch to kernel stack if necessary */ .if \switch_stacks > 0 - SWITCH_TO_KERNEL_STACK + SWITCH_TO_KERNEL_STACK \clear_cs .endif
.endm @@ -377,8 +377,9 @@
#define CS_FROM_ENTRY_STACK (1 << 31) #define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_INT3 (1 << 29)
-.macro SWITCH_TO_KERNEL_STACK +.macro SWITCH_TO_KERNEL_STACK clear_cs=1
ALTERNATIVE "", "jmp .Lend_@", X86_FEATURE_XENPV
@@ -391,12 +392,13 @@ * that register for the time this macro runs */
+ .if \clear_cs /* - * The high bits of the CS dword (__csh) are used for - * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case - * hardware didn't do this for us. + * The high bits of the CS dword (__csh) are used for CS_FROM_*. Clear + * them in case hardware didn't do this for us. */ andl $(0x0000ffff), PT_CS(%esp) + .endif
/* Are we on the entry stack? Bail out if not! */ movl PER_CPU_VAR(cpu_entry_area), %ecx @@ -1019,6 +1021,29 @@ ENTRY(entry_INT80_32) /* Restore user state */ RESTORE_REGS pop=4 # skip orig_eax/error_code .Lirq_return: + testl $CS_FROM_INT3, 4(%esp) + jz .Lno_iret_fixup + + /* + * Undo the magic from ENTRY(int3), in particular consider the case + * where regs->sp has been modified. + */ + + pushl %eax + movl %esp, %eax + + movl 4*4(%eax), %esp # restore (modified) regs->sp + + /* rebuild IRET frame */ + pushl 3*4(%eax) # flags + pushl 2*4(%eax) # cs + pushl 1*4(%eax) # ip + + andl $0x0000ffff, 4(%esp) # clear high CS bits + + movl (%eax), %eax # restore eax + +.Lno_iret_fixup: /* * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization * when returning from IPI handler and when returning from @@ -1477,9 +1502,57 @@ END(nmi)
ENTRY(int3) ASM_CLAC + + /* + * The high bits of the CS dword (__csh) are used for CS_FROM_*. Clear + * them in case hardware didn't do this for us. + */ + andl $0x0000ffff, 4(%esp) + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 8(%esp) + jnz .Lfrom_usermode_no_gap +#endif + testl $SEGMENT_RPL_MASK, 4(%esp) + jnz .Lfrom_usermode_no_gap + + /* + * Here from kernel mode; so the (exception) stack looks like: + * + * 12(esp) - <previous context> + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us + * the original 3 enties as gap: + * + * 32(esp) - <previous context> + * 28(esp) - orig_flags / gap + * 24(esp) - orig_cs / gap + * 20(esp) - orig_ip / gap + * 16(esp) - ss + * 12(esp) - sp + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + */ + pushl %ss # ss + pushl %esp # sp (points at ss) + pushl 4*4(%esp) # flags + pushl 4*4(%esp) # cs + pushl 4*4(%esp) # ip + + add $16, 12(%esp) # point sp back at the previous context + + orl $CS_FROM_INT3, 4(%esp) # mark magic IRET + +.Lfrom_usermode_no_gap: + pushl $-1 # mark this as an int
- SAVE_ALL switch_stacks=1 + SAVE_ALL switch_stacks=1 clear_cs=0 ENCODE_FRAME_POINTER TRACE_IRQS_OFF xorl %edx, %edx # zero error code diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 20e45d9b4e15..268cd9affe04 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -898,6 +898,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index c90678fd391a..6aac6abf931e 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -42,4 +42,24 @@ extern int after_bootmem; extern __ro_after_init struct mm_struct *poking_mm; extern __ro_after_init unsigned long poking_addr;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val) +{ + regs->sp -= sizeof(unsigned long); + *(unsigned long *)regs->sp = val; +} + +static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ + int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + int3_emulate_jmp(regs, func); +} + #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 4db9c0d29bc1..1e11076c3a2b 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -613,11 +613,118 @@ extern struct paravirt_patch_site __start_parainstructions[], __stop_parainstructions[]; #endif /* CONFIG_PARAVIRT */
+static __always_inline void print_stack(struct pt_regs *regs) +{ +#if 1 + unsigned long *end = (unsigned long *)current_stack_pointer; + unsigned long *frame = (unsigned long *)__builtin_frame_address(0); + unsigned long *stack = (unsigned long *)(current_stack_pointer & ~(THREAD_SIZE - 1)); + int i, j; + + stack += THREAD_SIZE / sizeof(unsigned long); + + printk("stack dump from: %lx\n", stack); + + for (i=0; ; i++) { + pr_info("stack[%03d]: ", 16*i); + for (j=0; j<16; j++) { + if (i==0 && j==0) { + pr_cont(" %08lx ", 0UL); + stack--; + continue; + } + if (stack == end) + pr_cont(">%08lx< ", *(stack--)); + else if (stack == frame) + pr_cont("*%08lx* ", *(stack--)); + else if (stack == regs) + pr_cont("r%08lxr ", *(stack--)); + else if (regs && stack == regs->sp) + pr_cont("s%08lxs ", *(stack--)); + else + pr_cont(" %08lx ", *(stack--)); + } + pr_cont("\n"); + + if (stack < end) + break; + } +#endif +} + +static void __init int3_magic(unsigned int *ptr) +{ + printk("*************** %lx\n", (unsigned long)ptr); + print_stack(NULL); + *ptr = 1; +} + +static __initdata unsigned long int3_ip; + +static int __init int3_exception_notify(struct notifier_block *self, unsigned long val, void *data) +{ + struct die_args *args = data; + struct pt_regs *regs = args->regs; + + if (!regs || user_mode(regs)) + return NOTIFY_DONE; + + if (val != DIE_INT3) + return NOTIFY_DONE; + + printk("XXXXXXXXXXXXXXXXXXXXXXXXXX %lx %lx\n", regs->ip, int3_ip); + if (regs->ip - INT3_INSN_SIZE != int3_ip) + return NOTIFY_DONE; + + print_stack(regs); + int3_emulate_call(regs, (unsigned long)&int3_magic); + print_stack(regs); + + return NOTIFY_STOP; +} + +static void __init int3_selftest(void) +{ + static __initdata struct notifier_block int3_exception_nb = { + .notifier_call = int3_exception_notify, + .priority = INT_MAX-1, /* last */ + }; + unsigned int val = 0; + + BUG_ON(register_die_notifier(&int3_exception_nb)); + + printk("+++++++++++++++++++ %lx %lx\n", (unsigned long)&val, (unsigned long)&int3_ip); + + print_stack(NULL); + + /* + * Basically: int3_magic(&val); but really complicated :-) + * + * Stick the address of the INT3 instruction into int3_ip, then trigger + * the INT3, padded with NOPs to match a CALL instruction length. + */ +#ifdef CONFIG_X86_32 + asm volatile ("call 1f; 1: pop (%%edx); add $5, (%%edx);" + "int3; nop; nop; nop; nop" : : "d" (&int3_ip), "a" (&val) : "memory"); +#else /* CONFIG_X86_64 */ + asm volatile ("call 1f; 1: pop (%%rdx); add $5, (%%rdx);" + "int3; nop; nop; nop; nop" : : "d" (&int3_ip), "D" (&val) : "memory"); +#endif + + BUG_ON(val != 1); + + unregister_die_notifier(&int3_exception_nb); +} + void __init alternative_instructions(void) { - /* The patching is not fully atomic, so try to avoid local interruptions - that might execute the to be patched code. - Other CPUs are not running. */ + int3_selftest(); + + /* + * The patching is not fully atomic, so try to avoid local + * interruptions that might execute the to be patched code. + * Other CPUs are not running. + */ stop_nmi();
/* @@ -642,10 +749,11 @@ void __init alternative_instructions(void) _text, _etext); }
- if (!uniproc_patched || num_possible_cpus() == 1) + if (!uniproc_patched || num_possible_cpus() == 1) { free_init_pages("SMP alternatives", (unsigned long)__smp_locks, (unsigned long)__smp_locks_end); + } #endif
apply_paravirt(__parainstructions, __parainstructions_end); diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 8b6d03e55d2f..e072cdd07284 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -572,6 +572,7 @@ NOKPROBE_SYMBOL(do_general_protection);
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) { + printk("int3 frame: %lx\n", __builtin_frame_address(0)); #ifdef CONFIG_DYNAMIC_FTRACE /* * ftrace must be first, everything else may cause a recursive crash.
On Thu, May 02, 2019 at 06:21:33PM +0200, Peter Zijlstra wrote:
Much thanks to Joerg Roedel for talking entry_32.S with me.
TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do not include ESP/SS, so not only wasn't regs->sp setup, if you changed it it wouldn't be effective and corrupt random stack state.
Also, i386 is bloody insane...
On Thu, May 2, 2019 at 9:21 AM Peter Zijlstra peterz@infradead.org wrote:
TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do not include ESP/SS, so not only wasn't regs->sp setup, if you changed it it wouldn't be effective and corrupt random stack state.
Indeed, the 32-bit case for same-RPL exceptions/iret is entirely different, and I'd forgotten about that.
And honestly, this makes the 32-bit case much worse. Now the entry stack modifications of int3 suddenly affect not just the entry, but every exit too.
This is _exactly_ the kind of subtle kernel entry/exit code I wanted us to avoid.
And while your code looks kind of ok, it's subtly buggy. This sequence:
+ pushl %eax + movl %esp, %eax + + movl 4*4(%eax), %esp # restore (modified) regs->sp + + /* rebuild IRET frame */ + pushl 3*4(%eax) # flags + pushl 2*4(%eax) # cs + pushl 1*4(%eax) # ip + + andl $0x0000ffff, 4(%esp) # clear high CS bits + + movl (%eax), %eax # restore eax
looks very wrong to me. When you do that "restore (modified) regs->sp", isn't that now resetting %esp to the point where %eax now points below the stack? So if we get an NMI in this sequence, that will overwrite the parts you are trying to copy from?
Am I missing something? doesn't it need to be done something like
pushl %eax pushl %ecx movl 20(%esp),%eax # possibly modified regs->sp movl 16(%esp),%ecx # flags movl %ecx,-4(%eax) movl 12(%esp),%ecx # cs movl %ecx,-8(%eax) movl 8(%esp),%ecx # ip movl %ecx, -12(%eax) movl 4(%esp),%ecx # eax movl %ecx, -16(%eax) popl %ecx lea -16(%eax),%esp popl %eax
(NOTE NOTE NOTE I might have gotten the offsets and the direction of the moves *completely* wrong, this is not a serious patch, it's meant as a "something like this" thing!!)
But now I confused myself, and maybe I'm wrong.
Linus
On Thu, May 02, 2019 at 11:02:40AM -0700, Linus Torvalds wrote:
On Thu, May 2, 2019 at 9:21 AM Peter Zijlstra peterz@infradead.org wrote:
TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do not include ESP/SS, so not only wasn't regs->sp setup, if you changed it it wouldn't be effective and corrupt random stack state.
Indeed, the 32-bit case for same-RPL exceptions/iret is entirely different, and I'd forgotten about that.
And honestly, this makes the 32-bit case much worse. Now the entry stack modifications of int3 suddenly affect not just the entry, but every exit too.
We could fix this by not using the common exit path on int3; not sure we want to go there, but that is an option.
This is _exactly_ the kind of subtle kernel entry/exit code I wanted us to avoid.
And while your code looks kind of ok, it's subtly buggy. This sequence:
pushl %eax
movl %esp, %eax
movl 4*4(%eax), %esp # restore (modified) regs->sp
/* rebuild IRET frame */
pushl 3*4(%eax) # flags
pushl 2*4(%eax) # cs
pushl 1*4(%eax) # ip
andl $0x0000ffff, 4(%esp) # clear high CS bits
movl (%eax), %eax # restore eax
looks very wrong to me. When you do that "restore (modified) regs->sp", isn't that now resetting %esp to the point where %eax now points below the stack? So if we get an NMI in this sequence, that will overwrite the parts you are trying to copy from?
ARGH; I knew it was too pretty :/ Yes, something like what you suggest will be needed, I'll go look at that once my brain recovers a bit from staring at entry code all day.
On Thu, May 02, 2019 at 08:18:11PM +0200, Peter Zijlstra wrote:
ARGH; I knew it was too pretty :/ Yes, something like what you suggest will be needed, I'll go look at that once my brain recovers a bit from staring at entry code all day.
I forgot I can just run the thing, and it works!
--- diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..73b7bca8712f 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -203,7 +203,7 @@ .Lend_@: .endm
-.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 +.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 clear_csh=1 cld PUSH_GS pushl %fs @@ -225,7 +225,7 @@
/* Switch to kernel stack if necessary */ .if \switch_stacks > 0 - SWITCH_TO_KERNEL_STACK + SWITCH_TO_KERNEL_STACK \clear_csh .endif
.endm @@ -377,8 +377,9 @@
#define CS_FROM_ENTRY_STACK (1 << 31) #define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_INT3 (1 << 29)
-.macro SWITCH_TO_KERNEL_STACK +.macro SWITCH_TO_KERNEL_STACK clear_csh=1
ALTERNATIVE "", "jmp .Lend_@", X86_FEATURE_XENPV
@@ -391,12 +392,13 @@ * that register for the time this macro runs */
+ .if \clear_csh /* - * The high bits of the CS dword (__csh) are used for - * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case - * hardware didn't do this for us. + * The high bits of the CS dword (__csh) are used for CS_FROM_*. Clear + * them in case hardware didn't do this for us. */ andl $(0x0000ffff), PT_CS(%esp) + .endif
/* Are we on the entry stack? Bail out if not! */ movl PER_CPU_VAR(cpu_entry_area), %ecx @@ -1019,6 +1021,40 @@ ENTRY(entry_INT80_32) /* Restore user state */ RESTORE_REGS pop=4 # skip orig_eax/error_code .Lirq_return: + testl $CS_FROM_INT3, 4(%esp) + jz .Lno_iret_fixup + + /* + * Undo the magic from ENTRY(int3), in particular consider the case + * where regs->sp has been modified. + * + * Reconstruct the 3 entry IRET frame right after the (modified) + * regs->sp without lowering %esp in between, such that an NMI in the + * middle doesn't scribble our stack. + */ + + pushl %eax + pushl %ecx + movl 5*4(%esp), %eax # (modified) regs->sp + + movl 4*4(%esp), %ecx # flags + movl %ecx, -4(%eax) + + movl 3*4(%esp), %ecx # cs + andl $0x0000ffff, %ecx + movl %ecx, -8(%eax) + + movl 2*4(%esp), %ecx # ip + movl %ecx, -12(%eax) + + movl 1*4(%esp), %ecx # eax + movl %ecx, -16(%eax) + + popl %ecx + lea -16(%eax), %esp + popl %eax + +.Lno_iret_fixup: /* * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization * when returning from IPI handler and when returning from @@ -1477,9 +1513,57 @@ END(nmi)
ENTRY(int3) ASM_CLAC + + /* + * The high bits of the CS dword (__csh) are used for CS_FROM_*. Clear + * them in case hardware didn't do this for us. + */ + andl $0x0000ffff, 4(%esp) + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 8(%esp) + jnz .Lfrom_usermode_no_gap +#endif + testl $SEGMENT_RPL_MASK, 4(%esp) + jnz .Lfrom_usermode_no_gap + + /* + * Here from kernel mode; so the (exception) stack looks like: + * + * 12(esp) - <previous context> + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us + * the original 3 enties as gap: + * + * 32(esp) - <previous context> + * 28(esp) - orig_flags / gap + * 24(esp) - orig_cs / gap + * 20(esp) - orig_ip / gap + * 16(esp) - ss + * 12(esp) - sp + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + */ + pushl %ss # ss + pushl %esp # sp (points at ss) + pushl 4*4(%esp) # flags + pushl 4*4(%esp) # cs + pushl 4*4(%esp) # ip + + add $16, 12(%esp) # point sp back at the previous context + + orl $CS_FROM_INT3, 4(%esp) # mark magic IRET + +.Lfrom_usermode_no_gap: + pushl $-1 # mark this as an int
- SAVE_ALL switch_stacks=1 + SAVE_ALL switch_stacks=1 clear_csh=0 ENCODE_FRAME_POINTER TRACE_IRQS_OFF xorl %edx, %edx # zero error code
On Thu, May 2, 2019 at 11:18 AM Peter Zijlstra peterz@infradead.org wrote:
We could fix this by not using the common exit path on int3; not sure we want to go there, but that is an option.
I don't think it's an option in general, because *some* int3 invocations will need all the usual error return.
But I guess we could make "int3 from kernel space" special.
I'm not sure how much that would help, but it might be worth looking into.
ARGH; I knew it was too pretty :/ Yes, something like what you suggest will be needed, I'll go look at that once my brain recovers a bit from staring at entry code all day.
Looks like it works based on your other email.
What would it look like with the "int3-from-kernel is special" modification?
Because *if* we can make the "kernel int3" entirely special, that would make the "Eww factor" much less of this whole thing.
I forget: is #BP _only_ for the "int3" instruction? I know we have really nasty cases with #DB (int1) because of "pending exceptions happen on the first instruction in kernel space", and that makes it really really nasty to handle with all the stack switch and %cr3 handling etc.
But if "int3 from kernel space" _only_ happens on actual "int3" instructions, then we really could just special-case that case. We'd know that %cr3 has been switched, we'd know that we don't need to do fsgs switching, we'd know we already have a good stack and percpu data etc set up.
So then special casing #BP would actually allow us to have a simple and straightforward kernel-int3-only sequence?
And then having that odd stack setup special case would be *much* more palatable to me.
Linus
On Thu, 2 May 2019, Linus Torvalds wrote:
I forget: is #BP _only_ for the "int3" instruction?
Hmm, according to 17.3.2 in vol 3 of SDM (and table 6-1 there), that indeed seems to be the case, so we should be fine.
But if "int3 from kernel space" _only_ happens on actual "int3" instructions, then we really could just special-case that case. We'd know that %cr3 has been switched, we'd know that we don't need to do fsgs switching, we'd know we already have a good stack and percpu data etc set up.
That should indeed be doable, under the asumption that noone is doing any int3 games before we've switched away from entry trampoline.
I've briefly looked, and seems like we have proper notrace anotations for stackleak_erase(), which seems to be the only C (ftrace-able) code that's running on a trampoline (off-topic: where does PTI code make sure that we actually map this symbol into user pagetables in case we're not doing global mapping?).
On May 2, 2019, at 12:28 PM, Jiri Kosina jikos@kernel.org wrote:
On Thu, 2 May 2019, Linus Torvalds wrote:
I forget: is #BP _only_ for the "int3" instruction?
Hmm, according to 17.3.2 in vol 3 of SDM (and table 6-1 there), that indeed seems to be the case, so we should be fine.
I’m reasonably confident that the absurd MOV SS; INT3 sequence results in #BP from user mode and then #DB inside that (as opposed to vice versa), so this should be okay.
On Thu, May 02, 2019 at 11:43:37AM -0700, Linus Torvalds wrote:
What would it look like with the "int3-from-kernel is special" modification?
Something like so; it boots; but I could've made some horrible mistake (again).
--- diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..4de51cff5b8a 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -67,9 +67,20 @@ # define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else # define preempt_stop(clobbers) -# define resume_kernel restore_all_kernel #endif
+.macro RETINT_PREEMPT +#ifdef CONFIG_PREEMPT + DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, PER_CPU_VAR(__preempt_count) + jnz .Lend_@ + testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? + jz .Lend_@ + call preempt_schedule_irq +.Lend_@: +#endif +.endm + .macro TRACE_IRQS_IRET #ifdef CONFIG_TRACE_IRQFLAGS testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off? @@ -755,7 +766,7 @@ END(ret_from_fork) andl $SEGMENT_RPL_MASK, %eax #endif cmpl $USER_RPL, %eax - jb resume_kernel # not returning to v8086 or userspace + jb restore_all_kernel # not returning to v8086 or userspace
ENTRY(resume_userspace) DISABLE_INTERRUPTS(CLBR_ANY) @@ -765,18 +776,6 @@ ENTRY(resume_userspace) jmp restore_all END(ret_from_exception)
-#ifdef CONFIG_PREEMPT -ENTRY(resume_kernel) - DISABLE_INTERRUPTS(CLBR_ANY) - cmpl $0, PER_CPU_VAR(__preempt_count) - jnz restore_all_kernel - testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all_kernel - call preempt_schedule_irq - jmp restore_all_kernel -END(resume_kernel) -#endif - GLOBAL(__begin_SYSENTER_singlestep_region) /* * All code from here through __end_SYSENTER_singlestep_region is subject @@ -1027,6 +1026,7 @@ ENTRY(entry_INT80_32) INTERRUPT_RETURN
restore_all_kernel: + RETINT_PREEMPT TRACE_IRQS_IRET PARANOID_EXIT_TO_KERNEL_MODE BUG_IF_WRONG_CR3 @@ -1477,6 +1477,94 @@ END(nmi)
ENTRY(int3) ASM_CLAC + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 8(%esp) + jnz .Lfrom_usermode_no_gap +#endif + testl $SEGMENT_RPL_MASK, 4(%esp) + jnz .Lfrom_usermode_no_gap + + /* + * Here from kernel mode; so the (exception) stack looks like: + * + * 12(esp) - <previous context> + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us + * the original 3 enties as gap: + * + * 32(esp) - <previous context> + * 28(esp) - orig_flags / gap + * 24(esp) - orig_cs / gap + * 20(esp) - orig_ip / gap + * 16(esp) - ss + * 12(esp) - sp + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + */ + pushl %ss # ss + pushl %esp # sp (points at ss) + pushl 4*4(%esp) # flags + pushl 4*4(%esp) # cs + pushl 4*4(%esp) # ip + + add $16, 12(%esp) # point sp back at the previous context + + pushl $-1 # orig_eax; mark as interrupt + + SAVE_ALL + ENCODE_FRAME_POINTER + TRACE_IRQS_OFF + xorl %edx, %edx # zero error code + movl %esp, %eax # pt_regs pointer + call do_int3 + + RETINT_PREEMPT + TRACE_IRQS_IRET + /* + * If we really never INT3 from entry code, it looks like + * we can skip this one. + PARANOID_EXIT_TO_KERNEL_MODE + */ + BUG_IF_WRONG_CR3 + RESTORE_REGS 4 # consume orig_eax + + /* + * Reconstruct the 3 entry IRET frame right after the (modified) + * regs->sp without lowering %esp in between, such that an NMI in the + * middle doesn't scribble our stack. + */ + + pushl %eax + pushl %ecx + movl 5*4(%esp), %eax # (modified) regs->sp + + movl 4*4(%esp), %ecx # flags + movl %ecx, -4(%eax) + + movl 3*4(%esp), %ecx # cs + andl $0x0000ffff, %ecx + movl %ecx, -8(%eax) + + movl 2*4(%esp), %ecx # ip + movl %ecx, -12(%eax) + + movl 1*4(%esp), %ecx # eax + movl %ecx, -16(%eax) + + popl %ecx + lea -16(%eax), %esp + popl %eax + + jmp .Lirq_return + +.Lfrom_usermode_no_gap: + pushl $-1 # mark this as an int
SAVE_ALL switch_stacks=1
On Thu, May 2, 2019 at 1:22 PM Peter Zijlstra peterz@infradead.org wrote:
Something like so; it boots; but I could've made some horrible mistake (again).
This actually looks much better to me.
Maybe it's more lines (I didn't check), but it's a lot simpler in that now the magic of the int3 stack doesn't get exposed to anything else.
We *could* also make this kernel-mode-only do_int3() be a special function, and do something like
# args: pt_regs pointer (no error code for int3) movl %esp,%eax # allocate a bit of extra room on the stack, so that 'kernel_int3' can move the pt_regs subl $8,%esp call kernel_int3 movl %eax,%esp
and not do any stack switching magic in the asm code AT ALL. We'd do
struct pt_regs *kernel_int3(struct pt_regs *regs) { .. return regs; }
and now you the rule for call emulation ends up being that you need to "memmove()" the ptregs up and down properly, and return the new pt_regs pointer.
Hmm? That would simplify the asm code further, but some people might find it objectionable?
Linus
On Thu, May 02, 2019 at 01:49:29PM -0700, Linus Torvalds wrote:
We *could* also make this kernel-mode-only do_int3() be a special function, and do something like
I think I prefer the variant we have now. The int3_emulate_*() things work uniformly and as expected on 32 and 64 bit (it would even work for userspace if it weren't for SMAP).
So while the 32bit kernel entry is 'special' all the INT3 handlers can uniformly prod at pt_regs in a natural way and have it work.
Making it special -- just for 32bit, seems like the wrong thing to me.
On Thu, 2 May 2019 13:49:29 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, May 2, 2019 at 1:22 PM Peter Zijlstra peterz@infradead.org wrote:
Something like so; it boots; but I could've made some horrible mistake (again).
This actually looks much better to me.
Maybe it's more lines (I didn't check), but it's a lot simpler in that now the magic of the int3 stack doesn't get exposed to anything else.
We *could* also make this kernel-mode-only do_int3() be a special function, and do something like
# args: pt_regs pointer (no error code for int3) movl %esp,%eax # allocate a bit of extra room on the stack, so that
'kernel_int3' can move the pt_regs subl $8,%esp call kernel_int3 movl %eax,%esp
and not do any stack switching magic in the asm code AT ALL. We'd do
struct pt_regs *kernel_int3(struct pt_regs *regs) { .. return regs; }
and now you the rule for call emulation ends up being that you need to "memmove()" the ptregs up and down properly, and return the new pt_regs pointer.
Hmm? That would simplify the asm code further, but some people might find it objectionable?
The problem with this approach is that it would require doing the same for x86_64, as the int3 C code is the same for both. And that may be a bit more difficult on the x86_64 side because it's all done with a simple flag in the idtentry macro to add the gap.
-- Steve
On Fri, May 3, 2019 at 12:24 PM Steven Rostedt rostedt@goodmis.org wrote:
The problem with this approach is that it would require doing the same for x86_64, as the int3 C code is the same for both. And that may be a bit more difficult on the x86_64 side because it's all done with a simple flag in the idtentry macro to add the gap.
That argument is weakened by the fact that we have to do _other_ things differently on 32-bit and 64-bit anyway.
So we might as well have a "on 32-bit, the call emulation needs to move the pt_regs to make space" special case in the call emulation code. It's very easy to explain why.
And then we'd limit the special case to where it matters (with a big comment about what's going on), rather than adding random special case handling to random _other_ places.
Having to add s magic special case to "kernel_stack_pointer() is certainly not obvious. Neither is adding magic special cases to system call exit paths etc.
This has been why I've been arguing against the entry code changes. Exactly because they tend to have these kind of odd cascading effects. The entry code is fragile not just because it's a complex hardware interface, but also because we know about those complex hardware interfaces in random other places.
I'd much rather have the code that does special things be in one place, and be the place that *needs* to do the special thing. If we copy the pt_regs around when we do the "call" emulation, it's *really* easy to explain *exactly* what we're doing and why in *exactly* that one context where we are doing it. And it won't affect anything else, and our existing code that looks at pt_regs will work both before and after.
Would it need a #ifdef CONFIG_X86_32 around it because it's not needed on x86-64? Sure. But that #ifdef would be right there, and the comment that explains why the pt_regs need to be moved would also make it very obvious why it is only needed for x86-32.
There's a lot of advantages to keeping your problems localized, instead of letting your random hacks escape and become problems for other, entirely unrelated, code.
Linus
On Fri, 3 May 2019 14:46:11 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, May 3, 2019 at 12:24 PM Steven Rostedt rostedt@goodmis.org wrote:
The problem with this approach is that it would require doing the same for x86_64, as the int3 C code is the same for both. And that may be a bit more difficult on the x86_64 side because it's all done with a simple flag in the idtentry macro to add the gap.
That argument is weakened by the fact that we have to do _other_ things differently on 32-bit and 64-bit anyway.
So we might as well have a "on 32-bit, the call emulation needs to move the pt_regs to make space" special case in the call emulation code. It's very easy to explain why.
So if I understand correctly what you are implying, is to have the int3 code be different for 32 bit and 64 bit? This would require handling text_poke, ftrace and kprobes differently for the two. Or perhaps we can build hacks on top.
And then we'd limit the special case to where it matters (with a big comment about what's going on), rather than adding random special case handling to random _other_ places.
Having to add s magic special case to "kernel_stack_pointer() is certainly not obvious. Neither is adding magic special cases to system call exit paths etc.
Honestly, this sounds more of an argument for doing the buffered space for all exceptions on 32 bit, because it removes one of the special cases. If we were to move the frame and give it a full frame like x86_64, then we can remove kernel_stack_pointer() altogether. I've hated that function for some time, as I tripped over it in the past too, and it keeps coming up. It's sad that regs->sp is unreliable as is.
This has been why I've been arguing against the entry code changes. Exactly because they tend to have these kind of odd cascading effects. The entry code is fragile not just because it's a complex hardware interface, but also because we know about those complex hardware interfaces in random other places.
IMO, getting rid of the kernel_stack_pointer() function is a positive side effect of these changes.
I'd much rather have the code that does special things be in one place, and be the place that *needs* to do the special thing. If we copy the pt_regs around when we do the "call" emulation, it's *really* easy to explain *exactly* what we're doing and why in *exactly* that one context where we are doing it. And it won't affect anything else, and our existing code that looks at pt_regs will work both before and after.
Would it need a #ifdef CONFIG_X86_32 around it because it's not needed on x86-64? Sure. But that #ifdef would be right there, and the comment that explains why the pt_regs need to be moved would also make it very obvious why it is only needed for x86-32.
There's a lot of advantages to keeping your problems localized, instead of letting your random hacks escape and become problems for other, entirely unrelated, code.
But is it localize? It would definitely affect do_int3().
You are saying that we have a do_int3() for user space int3, and do_kernel_int3() for kernel space. That would need to be done in asm for both, because having x86_64 call do_int3() for kernel and user would be interesting. Looking at the do_int3() code, I'm not sure how to safely separate kernel and user int3 handlers if 64bit doesn't call do_kernel_int3() directly. It may end up looking something like this:
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) { #ifdef CONFIG_X86_64 do_kernel_int3(®s); #endif /* * Use ist_enter despite the fact that we don't use an IST stack. * We can be called from a kprobe in non-CONTEXT_KERNEL kernel * mode or even during context tracking state changes. * * This means that we can't schedule. That's okay. */ ist_enter(regs); RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
#ifdef CONFIG_X86_64 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) goto exit; #endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */
#ifdef CONFIG_KPROBES if (kprobe_int3_handler(®s)) goto exit; #endif #endif
if (notify_die(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) goto exit;
cond_local_irq_enable(regs); do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, error_code, 0, NULL); cond_local_irq_disable(regs);
exit: ist_exit(regs); }
dotraplinkage void notrace do_kernel_int3(struct pt_regs **regs) { #ifdef CONFIG_DYNAMIC_FTRACE /* * ftrace must be first, everything else may cause a recursive crash. * See note by declaration of modifying_ftrace_code in ftrace.c */ if (unlikely(atomic_read(&modifying_ftrace_code)) && ftrace_int3_handler(regs)) return; #endif if (poke_int3_handler(regs)) return;
#ifdef CONFIG_X86_64 return; #endif
/* * Use ist_enter despite the fact that we don't use an IST stack. * We can be called from a kprobe in non-CONTEXT_KERNEL kernel * mode or even during context tracking state changes. * * This means that we can't schedule. That's okay. */ ist_enter(*regs); RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", *regs, error_code, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) goto exit; #endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */
#ifdef CONFIG_KPROBES if (kprobe_int3_handler(regs)) goto exit; #endif
notify_die(DIE_INT3, "int3", *regs, error_code, X86_TRAP_BP, SIGTRAP); ist_exit(*regs); }
Or maybe I misunderstood what you envision :-/
-- Steve
On Fri, May 3, 2019 at 3:49 PM Steven Rostedt rostedt@goodmis.org wrote:
You are saying that we have a do_int3() for user space int3, and do_kernel_int3() for kernel space. That would need to be done in asm for both, because having x86_64 call do_int3() for kernel and user would be interesting.
The clean/simple way is to just do this
- x86-32 does the special asm for the kernel_do_int3(), case and calls user_do_int3 otherwise.
- x86-64 doesn't care, and just calls "do_int3()".
We have a trivial helper function like
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) { if (user_mode(regs)) user_int3(regs); else WARN_ON_ONCE(kernel_int3(regs) != regs); }
which adds that warning just for debug purposes.
Then we make the rule be that user_int3() does the normal stuff, and kernel_int3() returns the pt_regs it was passed in.
Easy-peasy, there is absolutely no difference between x86-64 and x86-32 here except for the trivial case that x86-32 does its thing at the asm layer, which is what allows "kernel_int3()" to move pt_regs around by a small amount.
Now, the _real_ difference is when you do the "call_emulate()" case, which will have to do something like this
static struct pt_regs *emulate_call(struct pt_regs *regs, unsigned long return, unsigned long target) { #ifdef CONFIG_X86_32 /* BIG comment about how we need to move pt_regs to make room and to update the return 'sp' */ struct pt_regs *new = (void *)regs - 4; unsigned long *sp = (unsigned long *)(new + 1); memmove(new, regs, sizeof(*regs)); regs = new; #else unsigned long *sp = regs->sp; regs->sp -= 4; #endif *sp = value; regs->ip = target; return regs; }
but look, the above isn't that complicated, is it? And notice how the subtle pt_regs movement is exactly where it needs to be and nowhere else.
And what's the cost of all of this? NOTHING. The x86-32 entry code has to do the test for kernel space anyway, and *all* it does now is to call "kernel_int3" for the kernel case after having made a bit of extra room on the stack so that you *can* move pt_regs around (maybe people want to pop things too? It would work as well).
See what I mean by "localized to the cases the need it"?
Linus
On Fri, 3 May 2019 16:07:59 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
static struct pt_regs *emulate_call(struct pt_regs *regs, unsigned
long return, unsigned long target) { #ifdef CONFIG_X86_32 /* BIG comment about how we need to move pt_regs to make room and to update the return 'sp' */ struct pt_regs *new = (void *)regs - 4; unsigned long *sp = (unsigned long *)(new + 1);
I tried this, and it crashed. After a bit of debugging, I found that the issue is that we can't use sizeof(*regs), because we don't have a full pt_regs. We are missing the first entry (which is why we have kernel_stack_pointer() in the first place!)
memmove(new, regs, sizeof(*regs));
This would need to be something like:
#define SAVED_REGS_SIZE (sizeof(struct pt_regs) - 4)
struct pt_regs *new = (void *)regs - 4; unsigned long *sp = (unsigned long *) ((void *)new + SAVED_REGS_SIZE);
memmove(new, regs, SAVED_REGS_SIZE);
regs = new; #else unsigned long *sp = regs->sp; regs->sp -= 4;
should be:
unsigned long *sp;
regs->sp -= sizeof(long); sp = (unsigned long *)regs->sp;
#endif *sp = value; regs->ip = target; return regs; }
I got it sorta working. And it appears that printk() no longer flushes partial lines, and I couldn't figure out where my self tests broke as it does:
pr_info("Testing ....: "); ret = do_test(); if (ret) pr_info("PASSED\n"); else pr_info("FAILED\n");
but because it doesn't flush anymore, I don't see what test it is :-p But that's another issue.
With some "early_printk" in the code, I do see it doing the calls, but for some reason, the start up test never moves forward.
Below is the patch that I tried. I started with Peter's changes (this works on x86_64) and tried to work your ideas in for x86_32. This still gets stuck (but doesn't crash).
Also note that this method prevents the die notifiers from doing this, which means I had to disable Peter's selftest because that no longer will work the way it is written.
-- Steve
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index d309f30cf7af..f0fd3e17bedd 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -67,9 +67,20 @@ # define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else # define preempt_stop(clobbers) -# define resume_kernel restore_all_kernel #endif
+.macro RETINT_PREEMPT +#ifdef CONFIG_PREEMPT + DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, PER_CPU_VAR(__preempt_count) + jnz .Lend_@ + testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? + jz .Lend_@ + call preempt_schedule_irq +.Lend_@: +#endif +.endm + .macro TRACE_IRQS_IRET #ifdef CONFIG_TRACE_IRQFLAGS testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off? @@ -377,6 +388,7 @@
#define CS_FROM_ENTRY_STACK (1 << 31) #define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_INT3 (1 << 29)
.macro SWITCH_TO_KERNEL_STACK
@@ -753,7 +765,7 @@ ret_from_intr: andl $SEGMENT_RPL_MASK, %eax #endif cmpl $USER_RPL, %eax - jb resume_kernel # not returning to v8086 or userspace + jb restore_all_kernel # not returning to v8086 or userspace
ENTRY(resume_userspace) DISABLE_INTERRUPTS(CLBR_ANY) @@ -763,19 +775,6 @@ ENTRY(resume_userspace) jmp restore_all END(ret_from_exception)
-#ifdef CONFIG_PREEMPT -ENTRY(resume_kernel) - DISABLE_INTERRUPTS(CLBR_ANY) -.Lneed_resched: - cmpl $0, PER_CPU_VAR(__preempt_count) - jnz restore_all_kernel - testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all_kernel - call preempt_schedule_irq - jmp .Lneed_resched -END(resume_kernel) -#endif - GLOBAL(__begin_SYSENTER_singlestep_region) /* * All code from here through __end_SYSENTER_singlestep_region is subject @@ -1026,6 +1025,7 @@ restore_all: INTERRUPT_RETURN
restore_all_kernel: + RETINT_PREEMPT TRACE_IRQS_IRET PARANOID_EXIT_TO_KERNEL_MODE BUG_IF_WRONG_CR3 @@ -1476,6 +1476,27 @@ END(nmi)
ENTRY(int3) ASM_CLAC + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 8(%esp) + jnz .Lfrom_usermode_no_gap +#endif + testl $SEGMENT_RPL_MASK, 4(%esp) + jnz .Lfrom_usermode_no_gap + + pushl $-1 # mark this as an int + + SAVE_ALL switch_stacks=1 + ENCODE_FRAME_POINTER + TRACE_IRQS_OFF + movl %esp, %eax # pt_regs pointer + subl $8, %esp + call do_kernel_int3 + movl %eax, %esp + jmp ret_from_exception + +.Lfrom_usermode_no_gap: + pushl $-1 # mark this as an int
SAVE_ALL switch_stacks=1 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 1f0efdb7b629..834ec1397dab 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -899,6 +899,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1130,7 +1140,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h index cf350639e76d..4b335ac5afcc 100644 --- a/arch/x86/include/asm/ftrace.h +++ b/arch/x86/include/asm/ftrace.h @@ -37,7 +37,7 @@ struct dyn_arch_ftrace { /* No extra data needed for x86 */ };
-int ftrace_int3_handler(struct pt_regs *regs); +int ftrace_int3_handler(struct pt_regs **regs);
#define FTRACE_GRAPH_TRAMP_ADDR FTRACE_GRAPH_ADDR
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index e85ff65c43c3..8e638231cf52 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -39,4 +39,33 @@ extern int poke_int3_handler(struct pt_regs *regs); extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler); extern int after_bootmem;
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline struct pt_regs * int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ +#ifdef CONFIG_X86_32 +#define SAVED_REGS_SIZE (sizeof(struct pt_regs) - 8) + /* BIG comment about how we need to move pt_regs to make + room and to update the return 'sp' */ + struct pt_regs *new = (void *)regs - sizeof(long); + unsigned long *sp = (unsigned long *) + ((void *)new + SAVED_REGS_SIZE); + memmove(new, regs, SAVED_REGS_SIZE); + regs = new; +#else + unsigned long *sp; + regs->sp -= sizeof(long); + sp = (unsigned long *)regs->sp; +#endif + *sp = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE; + regs->ip = func; + return regs; +} + #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 9a79c7808f9c..c0e9002e2708 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -613,11 +613,84 @@ extern struct paravirt_patch_site __start_parainstructions[], __stop_parainstructions[]; #endif /* CONFIG_PARAVIRT */
+/* + * Self-test for the INT3 based CALL emulation code. + * + * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up + * properly and that there is a stack gap between the INT3 frame and the + * previous context. Without this gap doing a virtual PUSH on the interrupted + * stack would corrupt the INT3 IRET frame. + * + * See entry_{32,64}.S for more details. + */ +static void __init int3_magic(unsigned int *ptr) +{ + *ptr = 1; +} + +extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */ + +static int __init +int3_exception_notify(struct notifier_block *self, unsigned long val, void *data) +{ + struct die_args *args = data; + struct pt_regs *regs = args->regs; + + if (!regs || user_mode(regs)) + return NOTIFY_DONE; + + if (val != DIE_INT3) + return NOTIFY_DONE; + + if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip) + return NOTIFY_DONE; + + int3_emulate_call(regs, (unsigned long)&int3_magic); + return NOTIFY_STOP; +} + +static void __init int3_selftest(void) +{ + static __initdata struct notifier_block int3_exception_nb = { + .notifier_call = int3_exception_notify, + .priority = INT_MAX-1, /* last */ + }; + unsigned int val = 0; + + return; + BUG_ON(register_die_notifier(&int3_exception_nb)); + + /* + * Basically: int3_magic(&val); but really complicated :-) + * + * Stick the address of the INT3 instruction into int3_selftest_ip, + * then trigger the INT3, padded with NOPs to match a CALL instruction + * length. + */ + asm volatile ("1: int3; nop; nop; nop; nop\n\t" + ".pushsection .init.data,"aw"\n\t" + ".align " __ASM_SEL(4, 8) "\n\t" + ".type int3_selftest_ip, @object\n\t" + ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t" + "int3_selftest_ip:\n\t" + __ASM_SEL(.long, .quad) " 1b\n\t" + ".popsection\n\t" + : : __ASM_SEL_RAW(a, D) (&val) : "memory"); + + BUG_ON(val != 1); + + unregister_die_notifier(&int3_exception_nb); +} + void __init alternative_instructions(void) { - /* The patching is not fully atomic, so try to avoid local interruptions - that might execute the to be patched code. - Other CPUs are not running. */ + int3_selftest(); + + /* + * The patching is not fully atomic, so try to avoid local + * interruptions that might execute the to be patched code. + * Other CPUs are not running. + */ stop_nmi();
/* @@ -642,10 +715,11 @@ void __init alternative_instructions(void) _text, _etext); }
- if (!uniproc_patched || num_possible_cpus() == 1) + if (!uniproc_patched || num_possible_cpus() == 1) { free_init_pages("SMP alternatives", (unsigned long)__smp_locks, (unsigned long)__smp_locks_end); + } #endif
apply_paravirt(__parainstructions, __parainstructions_end); diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..c5e2ae1efbd8 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -29,6 +29,7 @@ #include <asm/kprobes.h> #include <asm/ftrace.h> #include <asm/nops.h> +#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, }
static unsigned long ftrace_update_func; +static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new) { @@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
+ ftrace_update_func_call = (unsigned long)func; + new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -287,20 +291,30 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) * call to a nop. While the change is taking place, we treat * it just like it was a nop. */ -int ftrace_int3_handler(struct pt_regs *regs) +int ftrace_int3_handler(struct pt_regs **pregs) { + struct pt_regs *regs = *pregs; unsigned long ip;
if (WARN_ON_ONCE(!regs)) return 0;
- ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) - return 0; + ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1; + if (ftrace_location(ip)) { + *pregs = int3_emulate_call(regs, (unsigned long)ftrace_regs_caller); +// early_printk("ip=%pS was=%px is=%px\n", (void *)ip, regs, *pregs); + return 1; + } else if (is_ftrace_caller(ip)) { + if (!ftrace_update_func_call) { + int3_emulate_jmp(regs, ip + CALL_INSN_SIZE); + return 1; + } + *pregs = int3_emulate_call(regs, ftrace_update_func_call); + return 1; + }
- return 1; + return 0; } NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +873,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func; + /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -960,6 +976,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new;
+ ftrace_update_func_call = 0UL; new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new); diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..03ba69a6e594 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -163,6 +163,9 @@ static inline bool invalid_selector(u16 value) * stack pointer we fall back to regs as stack if no previous stack * exists. * + * There is a special case for INT3, there we construct a full pt_regs + * environment. We can detect this case by a high bit in regs->cs + * * This is valid only for kernel mode traps. */ unsigned long kernel_stack_pointer(struct pt_regs *regs) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index d26f9e9c3d83..17e0c56326c9 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -570,7 +570,7 @@ do_general_protection(struct pt_regs *regs, long error_code) } NOKPROBE_SYMBOL(do_general_protection);
-dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) +dotraplinkage struct pt_regs * notrace do_kernel_int3(struct pt_regs *regs) { #ifdef CONFIG_DYNAMIC_FTRACE /* @@ -578,11 +578,11 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) * See note by declaration of modifying_ftrace_code in ftrace.c */ if (unlikely(atomic_read(&modifying_ftrace_code)) && - ftrace_int3_handler(regs)) - return; + ftrace_int3_handler(®s)) + return regs; #endif if (poke_int3_handler(regs)) - return; + return regs;
/* * Use ist_enter despite the fact that we don't use an IST stack. @@ -594,7 +594,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) ist_enter(regs); RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP - if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, + if (kgdb_ll_trap(DIE_INT3, "int3", regs, 0, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) goto exit; #endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */ @@ -604,16 +604,26 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) goto exit; #endif
- if (notify_die(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, + if (notify_die(DIE_INT3, "int3", regs, 0, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) goto exit;
cond_local_irq_enable(regs); - do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, error_code, 0, NULL); + do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, 0, 0, NULL); cond_local_irq_disable(regs);
exit: ist_exit(regs); + return regs; +} +NOKPROBE_SYMBOL(do_kernel_int3); + +dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) +{ + struct pt_regs *ret_regs; + + ret_regs = do_kernel_int3(regs); + WARN_ON_ONCE(ret_regs != regs); } NOKPROBE_SYMBOL(do_int3);
On May 3, 2019, at 2:46 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, May 3, 2019 at 12:24 PM Steven Rostedt rostedt@goodmis.org wrote:
The problem with this approach is that it would require doing the same for x86_64, as the int3 C code is the same for both. And that may be a bit more difficult on the x86_64 side because it's all done with a simple flag in the idtentry macro to add the gap.
That argument is weakened by the fact that we have to do _other_ things differently on 32-bit and 64-bit anyway.
So we might as well have a "on 32-bit, the call emulation needs to move the pt_regs to make space" special case in the call emulation code. It's very easy to explain why.
And then we'd limit the special case to where it matters (with a big comment about what's going on), rather than adding random special case handling to random _other_ places.
If we do this, it should IMO look like this:
struct pt_regs *change_kernel_stack_pointer(struct pt_regs *, unsigned long new_sp);
And that helper should be used on both variants.
But I think this will end up worse than the version where the entry code fixes it up. This is because, if the C code moves pt_regs, then we need some way to pass the new pointer back to the asm. We will also have a much harder time with runtime sanity checks. In the model where the C code merely updates regs->sp, it’s very easy to compare sp and ®s to check for overlap, but it’s much harder to tell whether memmoveing it is going to overwrite something important. And we have to worry about whether there’s some other code that assumes that pt_regs stays put.
So my intuition is that the pure asm fixup will result is more maintainable code.
On Fri, May 3, 2019 at 3:55 PM Andy Lutomirski luto@amacapital.net wrote:
But I think this will end up worse than the version where the entry code fixes it up. This is because, if the C code moves pt_regs, then we need some way to pass the new pointer back to the asm.
What? I already posted that code. Let me quote it again:
Message-ID: CAHk-=wh8bi5c_GkyjPtDAiaXaZRqtmhWs30usUvs4qK_F+c9tg@mail.gmail.com
# args: pt_regs pointer (no error code for int3) movl %esp,%eax # allocate a bit of extra room on the stack, so that # 'kernel_int3' can move the pt_regs subl $8,%esp call kernel_int3 movl %eax,%esp
It's that easy (this is with the assumption that we've already applied the "standalone simple int3" case, but I think the above might work even with the current code model, just the "call do_int3" needs to have the kernel/not-kernel distinction and do the above for the kernel case)
That's *MUCH* easier than your code to move entries around on the stack just as you return, and has the advantage of not changing any C-visible layout.
The C interface looks like this
/* Note: on x86-32, we can move 'regs' around for push/pop emulation */ struct pt_regs *kernel_int3(struct pt_regs *regs) { .. .. need to pass regs to emulation functions .. and call emulation needs to return it .. return regs; }
and I just posted as a response to Stephen the *trivial* do_int3() wrapper (so that x86-64 doesn't need to care), and the *trivial* code to actually emulate a call instruction.
And when I say "trivial", I obviously mean "totally untested and probably buggy", but it sure seems *simple*.,
Notice? Simple and minimal changes to entry code that only affect int3, and nothing else.
Linus
On May 3, 2019, at 4:16 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, May 3, 2019 at 3:55 PM Andy Lutomirski luto@amacapital.net wrote:
But I think this will end up worse than the version where the entry code fixes it up. This is because, if the C code moves pt_regs, then we need some way to pass the new pointer back to the asm.
What? I already posted that code. Let me quote it again:
Message-ID: CAHk-=wh8bi5c_GkyjPtDAiaXaZRqtmhWs30usUvs4qK_F+c9tg@mail.gmail.com
# args: pt_regs pointer (no error code for int3) movl %esp,%eax # allocate a bit of extra room on the stack, so that # 'kernel_int3' can move the pt_regs subl $8,%esp call kernel_int3 movl %eax,%esp
It's that easy (this is with the assumption that we've already applied the "standalone simple int3" case, but I think the above might work even with the current code model, just the "call do_int3" needs to have the kernel/not-kernel distinction and do the above for the kernel case)
That's *MUCH* easier than your code to move entries around on the stack just as you return, and has the advantage of not changing any C-visible layout.
The C interface looks like this
/* Note: on x86-32, we can move 'regs' around for push/pop emulation */ struct pt_regs *kernel_int3(struct pt_regs *regs) { .. .. need to pass regs to emulation functions .. and call emulation needs to return it .. return regs; }
and I just posted as a response to Stephen the *trivial* do_int3() wrapper (so that x86-64 doesn't need to care), and the *trivial* code to actually emulate a call instruction.
And when I say "trivial", I obviously mean "totally untested and probably buggy", but it sure seems *simple*.,
Notice? Simple and minimal changes to entry code that only affect int3, and nothing else.
I can get on board with this.
On Thu, 2 May 2019 22:21:46 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 02, 2019 at 11:43:37AM -0700, Linus Torvalds wrote:
What would it look like with the "int3-from-kernel is special" modification?
Something like so; it boots; but I could've made some horrible mistake (again).
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..4de51cff5b8a 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S
Oh, and so close!
I was running this on my i386 tests and for test 8 of 9 (passed 1-7) I hit this:
Testing all events: OK Testing ftrace filter: OK trace_kprobe: Testing kprobe tracing: BUG: unable to handle kernel paging request at 10ec839c #PF error: [INSTR] *pdpt = 0000000000000000 *pde = 0000000000000000 Oops: 0010 [#1] SMP PTI CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc3-test+ #1 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 EIP: 0x10ec839c Code: Bad RIP value. EAX: 54e24bbc EBX: 00000000 ECX: 00000003 EDX: 00000002 ESI: ebcae400 EDI: c1513a88 EBP: ee641f30 ESP: c0439a07 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210202 CR0: 80050033 CR2: 10ec8392 CR3: 0167e000 CR4: 001406f0 Call Trace: Modules linked in: CR2: 0000000010ec839c ---[ end trace 8aa996061578b437 ]--- EIP: 0x10ec839c Code: Bad RIP value. EAX: 54e24bbc EBX: 00000000 ECX: 00000003 EDX: 00000002 ESI: ebcae400 EDI: c1513a88 EBP: ee641f30 ESP: c16834bc DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210202 CR0: 80050033 CR2: 10ec8392 CR3: 0167e000 CR4: 001406f0 Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 Kernel Offset: disabled ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]--- ------------[ cut here ]------------ sched: Unexpected reschedule of offline CPU#3! WARNING: CPU: 1 PID: 1 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x1c/0x37 Modules linked in: CPU: 1 PID: 1 Comm: swapper/0 Tainted: G D 5.1.0-rc3-test+ #1 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 EIP: native_smp_send_reschedule+0x1c/0x37 Code: 4a 18 ba fb 00 00 00 e8 71 d6 9c 00 5d c3 55 89 e5 e8 37 6f 00 00 0f a3 05 e0 20 44 c1 72 11 50 68 d2 75 15 c1 e8 c7 a4 01 00 <0f> 0b 58 5a eb 13 8b 15 c0 36 28 c1 8b 4a 18 ba fd 00 00 00 e8 3a EAX: 0000002e EBX: 00000003 ECX: c049977e EDX: ee638000 ESI: 00000003 EDI: 00000003 EBP: ee641e04 ESP: ee641dfc DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210092 CR0: 80050033 CR2: 10ec8392 CR3: 0167e000 CR4: 001406f0 Call Trace: kick_ilb+0x77/0x7c trigger_load_balance+0x279/0x280 ? __pick_first_entity+0x18/0x18 scheduler_tick+0x90/0xa9 ? __pick_first_entity+0x18/0x18 update_process_times+0x3f/0x4a tick_sched_handle+0x4c/0x5a tick_sched_timer+0x3b/0x79 ? tick_sched_do_timer+0x44/0x44 __hrtimer_run_queues+0x180/0x26a ? tick_sched_do_timer+0x44/0x44 hrtimer_interrupt+0xa6/0x18e ? rcu_irq_enter+0x60/0x7e smp_apic_timer_interrupt+0xdf/0x179 apic_timer_interrupt+0xda/0xe0 EIP: panic+0x208/0x24a Code: 83 c3 64 eb ad 83 3d a8 d1 68 c1 00 74 05 e8 01 bf 01 00 68 e0 d1 68 c1 68 cc e6 15 c1 e8 bd e7 04 00 e8 02 ff 0a 00 fb 31 db <58> 5a e8 c5 a9 09 00 39 fb 7c 18 83 f6 01 8b 15 a0 d1 68 c1 89 f0 EAX: c044c764 EBX: 00000000 ECX: c049977e EDX: 318e4000 ESI: 00000000 EDI: 00000000 EBP: ee641f6c ESP: ee641f54 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00200246 ? console_unlock+0x466/0x4b8 ? panic+0x205/0x24a ? do_exit+0x4bd/0x8d1 do_exit+0x4e3/0x8d1 ? copy_oldmem_page+0x63/0x9b rewind_stack_do_exit+0x11/0x13 irq event stamp: 1400780 hardirqs last enabled at (1400779): [<c0401568>] trace_hardirqs_on_thunk+0xc/0x10 hardirqs last disabled at (1400780): [<c0401578>] trace_hardirqs_off_thunk+0xc/0x10 softirqs last enabled at (1400760): [<c0dfefb2>] __do_softirq+0x2a2/0x2d2 softirqs last disabled at (1400753): [<c0416fb7>] call_on_stack+0x45/0x4b ---[ end trace 8aa996061578b438 ]--- ------------[ cut here ]------------
That's one of the startup tests that happens on boot up.
Config attached.
Good news is, it appears very reproducible.
-- Steve
Here's the patch I applied (both folded together):
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index d309f30cf7af..2885acd691ac 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -67,9 +67,20 @@ # define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else # define preempt_stop(clobbers) -# define resume_kernel restore_all_kernel #endif
+.macro RETINT_PREEMPT +#ifdef CONFIG_PREEMPT + DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, PER_CPU_VAR(__preempt_count) + jnz .Lend_@ + testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? + jz .Lend_@ + call preempt_schedule_irq +.Lend_@: +#endif +.endm + .macro TRACE_IRQS_IRET #ifdef CONFIG_TRACE_IRQFLAGS testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off? @@ -753,7 +764,7 @@ ret_from_intr: andl $SEGMENT_RPL_MASK, %eax #endif cmpl $USER_RPL, %eax - jb resume_kernel # not returning to v8086 or userspace + jb restore_all_kernel # not returning to v8086 or userspace
ENTRY(resume_userspace) DISABLE_INTERRUPTS(CLBR_ANY) @@ -763,19 +774,6 @@ ENTRY(resume_userspace) jmp restore_all END(ret_from_exception)
-#ifdef CONFIG_PREEMPT -ENTRY(resume_kernel) - DISABLE_INTERRUPTS(CLBR_ANY) -.Lneed_resched: - cmpl $0, PER_CPU_VAR(__preempt_count) - jnz restore_all_kernel - testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all_kernel - call preempt_schedule_irq - jmp .Lneed_resched -END(resume_kernel) -#endif - GLOBAL(__begin_SYSENTER_singlestep_region) /* * All code from here through __end_SYSENTER_singlestep_region is subject @@ -1026,6 +1024,7 @@ restore_all: INTERRUPT_RETURN
restore_all_kernel: + RETINT_PREEMPT TRACE_IRQS_IRET PARANOID_EXIT_TO_KERNEL_MODE BUG_IF_WRONG_CR3 @@ -1476,6 +1475,94 @@ END(nmi)
ENTRY(int3) ASM_CLAC + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 8(%esp) + jnz .Lfrom_usermode_no_gap +#endif + testl $SEGMENT_RPL_MASK, 4(%esp) + jnz .Lfrom_usermode_no_gap + + /* + * Here from kernel mode; so the (exception) stack looks like: + * + * 12(esp) - <previous context> + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us + * the original 3 enties as gap: + * + * 32(esp) - <previous context> + * 28(esp) - orig_flags / gap + * 24(esp) - orig_cs / gap + * 20(esp) - orig_ip / gap + * 16(esp) - ss + * 12(esp) - sp + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + */ + pushl %ss # ss + pushl %esp # sp (points at ss) + pushl 4*4(%esp) # flags + pushl 4*4(%esp) # cs + pushl 4*4(%esp) # ip + + add $16, 12(%esp) # point sp back at the previous context + + pushl $-1 # orig_eax; mark as interrupt + + SAVE_ALL + ENCODE_FRAME_POINTER + TRACE_IRQS_OFF + xorl %edx, %edx # zero error code + movl %esp, %eax # pt_regs pointer + call do_int3 + + RETINT_PREEMPT + TRACE_IRQS_IRET + /* + * If we really never INT3 from entry code, it looks like + * we can skip this one. + PARANOID_EXIT_TO_KERNEL_MODE + */ + BUG_IF_WRONG_CR3 + RESTORE_REGS 4 # consume orig_eax + + /* + * Reconstruct the 3 entry IRET frame right after the (modified) + * regs->sp without lowering %esp in between, such that an NMI in the + * middle doesn't scribble our stack. + */ + + pushl %eax + pushl %ecx + movl 5*4(%esp), %eax # (modified) regs->sp + + movl 4*4(%esp), %ecx # flags + movl %ecx, -4(%eax) + + movl 3*4(%esp), %ecx # cs + andl $0x0000ffff, %ecx + movl %ecx, -8(%eax) + + movl 2*4(%esp), %ecx # ip + movl %ecx, -12(%eax) + + movl 1*4(%esp), %ecx # eax + movl %ecx, -16(%eax) + + popl %ecx + lea -16(%eax), %esp + popl %eax + + jmp .Lirq_return + +.Lfrom_usermode_no_gap: + pushl $-1 # mark this as an int
SAVE_ALL switch_stacks=1 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 1f0efdb7b629..834ec1397dab 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -899,6 +899,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1130,7 +1140,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index e85ff65c43c3..ba275b6292db 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -39,4 +39,24 @@ extern int poke_int3_handler(struct pt_regs *regs); extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler); extern int after_bootmem;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val) +{ + regs->sp -= sizeof(unsigned long); + *(unsigned long *)regs->sp = val; +} + +static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ + int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + int3_emulate_jmp(regs, func); +} + #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..fd152f5a937b 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -29,6 +29,7 @@ #include <asm/kprobes.h> #include <asm/ftrace.h> #include <asm/nops.h> +#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, }
static unsigned long ftrace_update_func; +static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new) { @@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
+ ftrace_update_func_call = (unsigned long)func; + new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -294,13 +298,21 @@ int ftrace_int3_handler(struct pt_regs *regs) if (WARN_ON_ONCE(!regs)) return 0;
- ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) - return 0; + ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1; + if (ftrace_location(ip)) { + int3_emulate_call(regs, (unsigned long)ftrace_regs_caller); + return 1; + } else if (is_ftrace_caller(ip)) { + if (!ftrace_update_func_call) { + int3_emulate_jmp(regs, ip + CALL_INSN_SIZE); + return 1; + } + int3_emulate_call(regs, ftrace_update_func_call); + return 1; + }
- return 1; + return 0; } NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +871,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func; + /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -960,6 +974,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new;
+ ftrace_update_func_call = 0UL; new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
On Thu, 2 May 2019 18:52:25 -0400 Steven Rostedt rostedt@goodmis.org wrote:
On Thu, 2 May 2019 22:21:46 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 02, 2019 at 11:43:37AM -0700, Linus Torvalds wrote:
What would it look like with the "int3-from-kernel is special" modification?
Something like so; it boots; but I could've made some horrible mistake (again).
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..4de51cff5b8a 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S
Oh, and so close!
I was running this on my i386 tests and for test 8 of 9 (passed 1-7) I hit this:
Digging a little further, I pinpointed it out to being kretprobes. The problem I believe is the use of kernel_stack_pointer() which does some magic on x86_32. kretprobes uses this to hijack the return address of the function (much like the function graph tracer does). I do have code that would allow kretprobes to use the function graph tracer instead, but that's still in progress (almost done!). But still, we should not have this break the use of kernel_stack_pointer() either.
Adding some printks in that code, it looks to be returning "®s->sp" which I think we changed.
-- Steve
On Thu, 2 May 2019 19:31:29 -0400 Steven Rostedt rostedt@goodmis.org wrote:
Digging a little further, I pinpointed it out to being kretprobes. The problem I believe is the use of kernel_stack_pointer() which does some magic on x86_32. kretprobes uses this to hijack the return address of the function (much like the function graph tracer does). I do have code that would allow kretprobes to use the function graph tracer instead, but that's still in progress (almost done!). But still, we should not have this break the use of kernel_stack_pointer() either.
Adding some printks in that code, it looks to be returning "®s->sp" which I think we changed.
This appears to fix it!
-- Steve
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..600ead178bf4 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -171,8 +171,12 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs) unsigned long sp = (unsigned long)®s->sp; u32 *prev_esp;
- if (context == (sp & ~(THREAD_SIZE - 1))) + if (context == (sp & ~(THREAD_SIZE - 1))) { + /* int3 code adds a gap */ + if (sp == regs->sp - 5*4) + return regs->sp; return sp; + }
prev_esp = (u32 *)(context); if (*prev_esp)
[ This version of the patch passed all my tests! ]
From: Peter Zijlstra peterz@infradead.org
In order to allow breakpoints to emulate call functions, they need to push the return address onto the stack. But because the breakpoint exception frame is added to the stack when the breakpoint is hit, there's no room to add the address onto the stack and return to the address of the emulated called funtion.
To handle this, copy the exception frame on entry of the breakpoint handler and have leave a gap that can be used to add a return address to the stack frame and return from the breakpoint to the emulated called function, allowing for that called function to return back to the location after the breakpoint was placed.
The helper functions were also added:
int3_emulate_push(): to push the address onto the gap in the stack int3_emulate_jmp(): changes the location of the regs->ip to return there. int3_emulate_call(): push the return address and change regs->ip
Cc: Andy Lutomirski luto@kernel.org Cc: Nicolai Stange nstange@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: Borislav Petkov bp@alien8.de Cc: "H. Peter Anvin" hpa@zytor.com Cc: the arch/x86 maintainers x86@kernel.org Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Jiri Kosina jikos@kernel.org Cc: Miroslav Benes mbenes@suse.cz Cc: Petr Mladek pmladek@suse.com Cc: Joe Lawrence joe.lawrence@redhat.com Cc: Shuah Khan shuah@kernel.org Cc: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Cc: Tim Chen tim.c.chen@linux.intel.com Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: Mimi Zohar zohar@linux.ibm.com Cc: Juergen Gross jgross@suse.com Cc: Nick Desaulniers ndesaulniers@google.com Cc: Nayna Jain nayna@linux.ibm.com Cc: Masahiro Yamada yamada.masahiro@socionext.com Cc: Joerg Roedel jroedel@suse.de Cc: "open list:KERNEL SELFTEST FRAMEWORK" linux-kselftest@vger.kernel.org Cc: stable@vger.kernel.org Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching") Signed-off-by: *** Need Peter Zijlstra's SoB here! *** Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org ---
Changes since v1:
- Updated the 32bit code with Peter's latest changes - Added int3 stack check in kernel_stack_pointer()
arch/x86/entry/entry_32.S | 117 +++++++++++++++++++++++---- arch/x86/entry/entry_64.S | 14 +++- arch/x86/include/asm/text-patching.h | 20 +++++ arch/x86/kernel/ptrace.c | 6 +- 4 files changed, 139 insertions(+), 18 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index d309f30cf7af..2885acd691ac 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -67,9 +67,20 @@ # define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else # define preempt_stop(clobbers) -# define resume_kernel restore_all_kernel #endif
+.macro RETINT_PREEMPT +#ifdef CONFIG_PREEMPT + DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, PER_CPU_VAR(__preempt_count) + jnz .Lend_@ + testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? + jz .Lend_@ + call preempt_schedule_irq +.Lend_@: +#endif +.endm + .macro TRACE_IRQS_IRET #ifdef CONFIG_TRACE_IRQFLAGS testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off? @@ -753,7 +764,7 @@ ret_from_intr: andl $SEGMENT_RPL_MASK, %eax #endif cmpl $USER_RPL, %eax - jb resume_kernel # not returning to v8086 or userspace + jb restore_all_kernel # not returning to v8086 or userspace
ENTRY(resume_userspace) DISABLE_INTERRUPTS(CLBR_ANY) @@ -763,19 +774,6 @@ ENTRY(resume_userspace) jmp restore_all END(ret_from_exception)
-#ifdef CONFIG_PREEMPT -ENTRY(resume_kernel) - DISABLE_INTERRUPTS(CLBR_ANY) -.Lneed_resched: - cmpl $0, PER_CPU_VAR(__preempt_count) - jnz restore_all_kernel - testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all_kernel - call preempt_schedule_irq - jmp .Lneed_resched -END(resume_kernel) -#endif - GLOBAL(__begin_SYSENTER_singlestep_region) /* * All code from here through __end_SYSENTER_singlestep_region is subject @@ -1026,6 +1024,7 @@ restore_all: INTERRUPT_RETURN
restore_all_kernel: + RETINT_PREEMPT TRACE_IRQS_IRET PARANOID_EXIT_TO_KERNEL_MODE BUG_IF_WRONG_CR3 @@ -1476,6 +1475,94 @@ END(nmi)
ENTRY(int3) ASM_CLAC + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 8(%esp) + jnz .Lfrom_usermode_no_gap +#endif + testl $SEGMENT_RPL_MASK, 4(%esp) + jnz .Lfrom_usermode_no_gap + + /* + * Here from kernel mode; so the (exception) stack looks like: + * + * 12(esp) - <previous context> + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us + * the original 3 enties as gap: + * + * 32(esp) - <previous context> + * 28(esp) - orig_flags / gap + * 24(esp) - orig_cs / gap + * 20(esp) - orig_ip / gap + * 16(esp) - ss + * 12(esp) - sp + * 8(esp) - flags + * 4(esp) - cs + * 0(esp) - ip + */ + pushl %ss # ss + pushl %esp # sp (points at ss) + pushl 4*4(%esp) # flags + pushl 4*4(%esp) # cs + pushl 4*4(%esp) # ip + + add $16, 12(%esp) # point sp back at the previous context + + pushl $-1 # orig_eax; mark as interrupt + + SAVE_ALL + ENCODE_FRAME_POINTER + TRACE_IRQS_OFF + xorl %edx, %edx # zero error code + movl %esp, %eax # pt_regs pointer + call do_int3 + + RETINT_PREEMPT + TRACE_IRQS_IRET + /* + * If we really never INT3 from entry code, it looks like + * we can skip this one. + PARANOID_EXIT_TO_KERNEL_MODE + */ + BUG_IF_WRONG_CR3 + RESTORE_REGS 4 # consume orig_eax + + /* + * Reconstruct the 3 entry IRET frame right after the (modified) + * regs->sp without lowering %esp in between, such that an NMI in the + * middle doesn't scribble our stack. + */ + + pushl %eax + pushl %ecx + movl 5*4(%esp), %eax # (modified) regs->sp + + movl 4*4(%esp), %ecx # flags + movl %ecx, -4(%eax) + + movl 3*4(%esp), %ecx # cs + andl $0x0000ffff, %ecx + movl %ecx, -8(%eax) + + movl 2*4(%esp), %ecx # ip + movl %ecx, -12(%eax) + + movl 1*4(%esp), %ecx # eax + movl %ecx, -16(%eax) + + popl %ecx + lea -16(%eax), %esp + popl %eax + + jmp .Lirq_return + +.Lfrom_usermode_no_gap: + pushl $-1 # mark this as an int
SAVE_ALL switch_stacks=1 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 1f0efdb7b629..834ec1397dab 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -899,6 +899,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1130,7 +1140,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index e85ff65c43c3..ba275b6292db 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -39,4 +39,24 @@ extern int poke_int3_handler(struct pt_regs *regs); extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler); extern int after_bootmem;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val) +{ + regs->sp -= sizeof(unsigned long); + *(unsigned long *)regs->sp = val; +} + +static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ + int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + int3_emulate_jmp(regs, func); +} + #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..600ead178bf4 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -171,8 +171,12 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs) unsigned long sp = (unsigned long)®s->sp; u32 *prev_esp;
- if (context == (sp & ~(THREAD_SIZE - 1))) + if (context == (sp & ~(THREAD_SIZE - 1))) { + /* int3 code adds a gap */ + if (sp == regs->sp - 5*4) + return regs->sp; return sp; + }
prev_esp = (u32 *)(context); if (*prev_esp)
On Thu, May 02, 2019 at 07:50:52PM -0400, Steven Rostedt wrote:
On Thu, 2 May 2019 19:31:29 -0400 Steven Rostedt rostedt@goodmis.org wrote:
Digging a little further, I pinpointed it out to being kretprobes. The problem I believe is the use of kernel_stack_pointer() which does some magic on x86_32. kretprobes uses this to hijack the return address of the function (much like the function graph tracer does). I do have code that would allow kretprobes to use the function graph tracer instead, but that's still in progress (almost done!). But still, we should not have this break the use of kernel_stack_pointer() either.
Adding some printks in that code, it looks to be returning "®s->sp" which I think we changed.
This appears to fix it!
-- Steve
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..600ead178bf4 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -171,8 +171,12 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs) unsigned long sp = (unsigned long)®s->sp; u32 *prev_esp;
- if (context == (sp & ~(THREAD_SIZE - 1)))
- if (context == (sp & ~(THREAD_SIZE - 1))) {
/* int3 code adds a gap */
if (sp == regs->sp - 5*4)
return sp;return regs->sp;
- }
prev_esp = (u32 *)(context); if (*prev_esp)
OMG, WTF, ARGH... That code is fsck'ing horrible. I'd almost argue to always do the INT3 thing, just to avoid games like that.
That said; for normal traps ®s->sp is indeed the previous context -- if it doesn't fall off the stack. Your hack detects the regular INT3 frame. Howver if regs->sp has been modified (int3_emulate_push, for example) your detectoring comes unstuck.
Now, it is rather unlikely these two code paths interact, but just to be safe, something like so might be more reliable:
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..aceaad0cc9a9 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -163,6 +163,9 @@ static inline bool invalid_selector(u16 value) * stack pointer we fall back to regs as stack if no previous stack * exists. * + * There is a special case for INT3, there we construct a full pt_regs + * environment. We can detect this case by a high bit in regs->cs + * * This is valid only for kernel mode traps. */ unsigned long kernel_stack_pointer(struct pt_regs *regs) @@ -171,6 +174,9 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs) unsigned long sp = (unsigned long)®s->sp; u32 *prev_esp;
+ if (regs->__csh & (1 << 13)) /* test CS_FROM_INT3 */ + return regs->sp; + if (context == (sp & ~(THREAD_SIZE - 1))) return sp;
--- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -388,6 +388,7 @@
#define CS_FROM_ENTRY_STACK (1 << 31) #define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_INT3 (1 << 29)
.macro SWITCH_TO_KERNEL_STACK
@@ -1515,6 +1516,9 @@ ENTRY(int3)
add $16, 12(%esp) # point sp back at the previous context
+ andl $0x0000ffff, 4(%esp) + orl $CS_FROM_INT3, 4(%esp) + pushl $-1 # orig_eax; mark as interrupt
SAVE_ALL
On Fri, 3 May 2019 11:29:59 +0200 Peter Zijlstra peterz@infradead.org wrote:
OMG, WTF, ARGH... That code is fsck'ing horrible. I'd almost argue to always do the INT3 thing, just to avoid games like that.
Hehe, that's almost the exact same thoughts I had when seeing this code ;-)
That said; for normal traps ®s->sp is indeed the previous context -- if it doesn't fall off the stack. Your hack detects the regular INT3 frame. Howver if regs->sp has been modified (int3_emulate_push, for example) your detectoring comes unstuck.
Yep. I realized the issue as well. But wanted to make sure this did work when sp wasn't changed.
Now, it is rather unlikely these two code paths interact, but just to be safe, something like so might be more reliable:
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..aceaad0cc9a9 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -163,6 +163,9 @@ static inline bool invalid_selector(u16 value)
- stack pointer we fall back to regs as stack if no previous stack
- exists.
- There is a special case for INT3, there we construct a full pt_regs
- environment. We can detect this case by a high bit in regs->cs
*/
- This is valid only for kernel mode traps.
unsigned long kernel_stack_pointer(struct pt_regs *regs) @@ -171,6 +174,9 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs) unsigned long sp = (unsigned long)®s->sp; u32 *prev_esp;
- if (regs->__csh & (1 << 13)) /* test CS_FROM_INT3 */
return regs->sp;
Thanks, I was looking into doing something like this (setting a flag in the int3 code), but didn't have the time to see the best way to do this.
I'll add this version of the code and run it through my tests.
-- Steve
if (context == (sp & ~(THREAD_SIZE - 1))) return sp; --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -388,6 +388,7 @@ #define CS_FROM_ENTRY_STACK (1 << 31) #define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_INT3 (1 << 29) .macro SWITCH_TO_KERNEL_STACK @@ -1515,6 +1516,9 @@ ENTRY(int3) add $16, 12(%esp) # point sp back at the previous context
- andl $0x0000ffff, 4(%esp)
- orl $CS_FROM_INT3, 4(%esp)
- pushl $-1 # orig_eax; mark as interrupt
SAVE_ALL
On May 3, 2019, at 6:22 AM, Steven Rostedt rostedt@goodmis.org wrote:
On Fri, 3 May 2019 11:29:59 +0200 Peter Zijlstra peterz@infradead.org wrote:
OMG, WTF, ARGH... That code is fsck'ing horrible. I'd almost argue to always do the INT3 thing, just to avoid games like that.
Hehe, that's almost the exact same thoughts I had when seeing this code ;-)
That said; for normal traps ®s->sp is indeed the previous context -- if it doesn't fall off the stack. Your hack detects the regular INT3 frame. Howver if regs->sp has been modified (int3_emulate_push, for example) your detectoring comes unstuck.
Yep. I realized the issue as well. But wanted to make sure this did work when sp wasn't changed.
Now, it is rather unlikely these two code paths interact, but just to be safe, something like so might be more reliable:
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..aceaad0cc9a9 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -163,6 +163,9 @@ static inline bool invalid_selector(u16 value)
- stack pointer we fall back to regs as stack if no previous stack
- exists.
- There is a special case for INT3, there we construct a full pt_regs
- environment. We can detect this case by a high bit in regs->cs
- This is valid only for kernel mode traps.
*/ unsigned long kernel_stack_pointer(struct pt_regs *regs) @@ -171,6 +174,9 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs) unsigned long sp = (unsigned long)®s->sp; u32 *prev_esp;
- if (regs->__csh & (1 << 13)) /* test CS_FROM_INT3 */
return regs->sp;
Thanks, I was looking into doing something like this (setting a flag in the int3 code), but didn't have the time to see the best way to do this.
I'll add this version of the code and run it through my tests.
-- Steve
if (context == (sp & ~(THREAD_SIZE - 1))) return sp;
--- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -388,6 +388,7 @@
#define CS_FROM_ENTRY_STACK (1 << 31) #define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_INT3 (1 << 29)
.macro SWITCH_TO_KERNEL_STACK
@@ -1515,6 +1516,9 @@ ENTRY(int3)
add $16, 12(%esp) # point sp back at the previous context
andl $0x0000ffff, 4(%esp)
orl $CS_FROM_INT3, 4(%esp)
pushl $-1 # orig_eax; mark as interrupt
SAVE_ALL
So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always? Basically, the entry asm for entries from kernel mode would do, roughly:
push $0 ;dummy for call emulation push %ss push $0 ;a dummy for ESP push 3*4(%esp) ;EFLAGS push 3*4(%esp) ;CS push 3*4(%esp) ;EIP push %rax lea 7*4(%esp), %rax mov %rax, 4*4(%esp) ;ESP
And the exit asm would do a little dance to write EFLAGS, CS, and EIP to the right spot, then load ESP-3*4 into %esp and do IRET.
Now the annoying kernel_stack_pointer() hack can just go away, since regs->sp is always correct!
I probably screwed up some arithmetic there, but it’s the idea that counts :)
On Fri, 3 May 2019 09:20:55 -0700 Andy Lutomirski luto@amacapital.net wrote:
So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always? Basically, the entry asm for entries from kernel mode would do, roughly:
push $0 ;dummy for call emulation push %ss push $0 ;a dummy for ESP push 3*4(%esp) ;EFLAGS push 3*4(%esp) ;CS push 3*4(%esp) ;EIP push %rax lea 7*4(%esp), %rax mov %rax, 4*4(%esp) ;ESP
And the exit asm would do a little dance to write EFLAGS, CS, and EIP to the right spot, then load ESP-3*4 into %esp and do IRET.
Now the annoying kernel_stack_pointer() hack can just go away, since regs->sp is always correct!
I probably screwed up some arithmetic there, but it’s the idea that counts :)
Yeah, as it will end up with:
$0 ; dummy for call emulation %ss $0 ; dummy for ESP EIP $0 %ss $0
As 3 only gets you over what you already pushed. I think 5*4 is what you want.
I guess the real question is, what's the performance impact of doing that? Although, this is only needed for kernel -> kernel exceptions, which are hopefully a rarity.
-- Steve
On Fri, May 3, 2019 at 9:35 AM Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 03, 2019 at 12:31:26PM -0400, Steven Rostedt wrote:
I guess the real question is, what's the performance impact of doing that?
Is there anyone that considers i386 a performance platform?
Not me. As far as I'm concerned, I will basically always gladly trade several cycles for simplicity on 32-bit.
--Andy
On Fri, 3 May 2019 09:44:35 -0700 Andy Lutomirski luto@kernel.org wrote:
On Fri, May 3, 2019 at 9:35 AM Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 03, 2019 at 12:31:26PM -0400, Steven Rostedt wrote:
I guess the real question is, what's the performance impact of doing that?
Is there anyone that considers i386 a performance platform?
Not me. As far as I'm concerned, I will basically always gladly trade several cycles for simplicity on 32-bit.
So should I wait for a new PATCH 1 from Peter to implement it this way? I'd like to get this into the next merge window (although, I am marking it for stable since it fixes a bug for live kernel patching).
Linus, what's your thoughts?
-- Steve
On Fri, May 03, 2019 at 09:20:55AM -0700, Andy Lutomirski wrote:
On May 3, 2019, at 6:22 AM, Steven Rostedt rostedt@goodmis.org wrote:
On Fri, 3 May 2019 11:29:59 +0200 Peter Zijlstra peterz@infradead.org wrote:
OMG, WTF, ARGH... That code is fsck'ing horrible. I'd almost argue to always do the INT3 thing, just to avoid games like that.
So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always?
Basically what I wrote above, right? I'd be fine with that... Linus?
On Fri, May 3, 2019 at 9:21 AM Andy Lutomirski luto@amacapital.net wrote:
So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always?
What would the code actually end up looking like? I don't necessarily object, since that kernel_stack_pointer() thing certainly looks horrible, but honestly, my suggestion to just pass in the 'struct pt_regs' and let the call emulation fix it up would have also worked, and avoided that bug (and who knows what else might be hiding).
I really think that you're now hitting all the special case magic low-level crap that I wanted to avoid.
Linus
On Fri, May 03, 2019 at 11:57:22AM -0700, Linus Torvalds wrote:
On Fri, May 3, 2019 at 9:21 AM Andy Lutomirski luto@amacapital.net wrote:
So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always?
What would the code actually end up looking like? I don't necessarily object, since that kernel_stack_pointer() thing certainly looks horrible, but honestly, my suggestion to just pass in the 'struct pt_regs' and let the call emulation fix it up would have also worked, and avoided that bug (and who knows what else might be hiding).
I really think that you're now hitting all the special case magic low-level crap that I wanted to avoid.
This did actually boot on first try; so there must be something horribly wrong...
Now, I know you like that other approach; but I figured I should at least show you what this one looks like. Maybe I've been staring at entry_32.S too much, but I really don't dislike this.
--- arch/x86/entry/entry_32.S | 150 +++++++++++++++++++++++++++++------ arch/x86/entry/entry_64.S | 14 +++- arch/x86/include/asm/ptrace.h | 4 - arch/x86/include/asm/text-patching.h | 20 +++++ arch/x86/kernel/alternative.c | 81 ++++++++++++++++++- arch/x86/kernel/ptrace.c | 29 ------- 6 files changed, 235 insertions(+), 63 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..d29cf03219c5 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -67,9 +67,20 @@ # define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else # define preempt_stop(clobbers) -# define resume_kernel restore_all_kernel #endif
+.macro RETINT_PREEMPT +#ifdef CONFIG_PREEMPT + DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, PER_CPU_VAR(__preempt_count) + jnz .Lend_@ + testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? + jz .Lend_@ + call preempt_schedule_irq +.Lend_@: +#endif +.endm + .macro TRACE_IRQS_IRET #ifdef CONFIG_TRACE_IRQFLAGS testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off? @@ -203,8 +214,119 @@ .Lend_@: .endm
+#define CS_FROM_ENTRY_STACK (1 << 31) +#define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_KERNEL (1 << 29) + +.macro FIXUP_FRAME + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 3*4(%esp) + jnz .Lfrom_usermode_no_fixup_@ +#endif + testl $SEGMENT_RPL_MASK, 2*4(%esp) + jnz .Lfrom_usermode_no_fixup_@ + + orl $CS_FROM_KERNEL, 2*4(%esp) + + /* + * When we're here from kernel mode; the (exception) stack looks like: + * + * 4*4(%esp) - <previous context> + * 3*4(%esp) - flags + * 2*4(%esp) - cs + * 1*4(%esp) - ip + * 0*4(%esp) - orig_eax + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us + * the original 4 enties as gap: + * + * 10*4(%esp) - <previous context> + * 9*4(%esp) - gap / flags + * 8*4(%esp) - gap / cs + * 7*4(%esp) - gap / ip + * 6*4(%esp) - gap / orig_eax + * 5*4(%esp) - ss + * 4*4(%esp) - sp + * 3*4(%esp) - flags + * 2*4(%esp) - cs + * 1*4(%esp) - ip + * 0*4(%esp) - orig_eax + */ + + pushl %ss # ss + pushl %esp # sp (points at ss) + add $5*4, (%esp) # point sp back at the previous context + pushl 5*4(%esp) # flags + pushl 5*4(%esp) # cs + pushl 5*4(%esp) # ip + pushl 5*4(%esp) # orig_eax + +.Lfrom_usermode_no_fixup_@: +.endm + +.macro IRET_FRAME + + /* orig_eax is already POP'ed when we're here */ + + testl $CS_FROM_KERNEL, 1*4(%esp) + jz .Lfinished_frame_@ + + pushl %eax + + lea 10*4(%esp), %eax # address of <previous context> + cmpl %eax, 4*4(%esp) # if ->sp is unmodified + jnz .Lmodified_sp_do_fixup_@ + + /* + * Fast path; regs->sp wasn't modified, reuse the original IRET frame. + */ + pop %eax + add $6*4, %esp + jmp .Lfinished_frame_@; + +.Lmodified_sp_do_fixup_@: + + /* + * Reconstruct the 3 entry IRET frame right after the (modified) + * regs->sp without lowering %esp in between, such that an NMI in the + * middle doesn't scribble our stack. + */ + pushl %ecx + movl 5*4(%esp), %eax # (modified) regs->sp + + movl 4*4(%esp), %ecx # flags + movl %ecx, -4(%eax) + + movl 3*4(%esp), %ecx # cs + andl $0x0000ffff, %ecx + movl %ecx, -8(%eax) + + movl 2*4(%esp), %ecx # ip + movl %ecx, -12(%eax) + + movl 1*4(%esp), %ecx # eax + movl %ecx, -16(%eax) + + popl %ecx + lea -16(%eax), %esp + popl %eax + +.Lfinished_frame_@: +.endm + .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 + cld + /* + * The high bits of the CS dword (__csh) are used for CS_FROM_*. + * Clear them in case hardware didn't do this for us. + */ + andl $(0x0000ffff), 2*4(%esp) + + FIXUP_FRAME + PUSH_GS pushl %fs pushl %es @@ -375,9 +497,6 @@ * switch to it before we do any copying. */
-#define CS_FROM_ENTRY_STACK (1 << 31) -#define CS_FROM_USER_CR3 (1 << 30) - .macro SWITCH_TO_KERNEL_STACK
ALTERNATIVE "", "jmp .Lend_@", X86_FEATURE_XENPV @@ -391,13 +510,6 @@ * that register for the time this macro runs */
- /* - * The high bits of the CS dword (__csh) are used for - * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case - * hardware didn't do this for us. - */ - andl $(0x0000ffff), PT_CS(%esp) - /* Are we on the entry stack? Bail out if not! */ movl PER_CPU_VAR(cpu_entry_area), %ecx addl $CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx @@ -755,7 +867,7 @@ END(ret_from_fork) andl $SEGMENT_RPL_MASK, %eax #endif cmpl $USER_RPL, %eax - jb resume_kernel # not returning to v8086 or userspace + jb restore_all_kernel # not returning to v8086 or userspace
ENTRY(resume_userspace) DISABLE_INTERRUPTS(CLBR_ANY) @@ -765,18 +877,6 @@ ENTRY(resume_userspace) jmp restore_all END(ret_from_exception)
-#ifdef CONFIG_PREEMPT -ENTRY(resume_kernel) - DISABLE_INTERRUPTS(CLBR_ANY) - cmpl $0, PER_CPU_VAR(__preempt_count) - jnz restore_all_kernel - testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all_kernel - call preempt_schedule_irq - jmp restore_all_kernel -END(resume_kernel) -#endif - GLOBAL(__begin_SYSENTER_singlestep_region) /* * All code from here through __end_SYSENTER_singlestep_region is subject @@ -1019,6 +1119,7 @@ ENTRY(entry_INT80_32) /* Restore user state */ RESTORE_REGS pop=4 # skip orig_eax/error_code .Lirq_return: + IRET_FRAME /* * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization * when returning from IPI handler and when returning from @@ -1027,6 +1128,7 @@ ENTRY(entry_INT80_32) INTERRUPT_RETURN
restore_all_kernel: + RETINT_PREEMPT TRACE_IRQS_IRET PARANOID_EXIT_TO_KERNEL_MODE BUG_IF_WRONG_CR3 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 20e45d9b4e15..268cd9affe04 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -898,6 +898,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 8a7fc0cca2d1..5ff42dc8b396 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -166,14 +166,10 @@ static inline bool user_64bit_mode(struct pt_regs *regs) #define compat_user_stack_pointer() current_pt_regs()->sp #endif
-#ifdef CONFIG_X86_32 -extern unsigned long kernel_stack_pointer(struct pt_regs *regs); -#else static inline unsigned long kernel_stack_pointer(struct pt_regs *regs) { return regs->sp; } -#endif
#define GET_IP(regs) ((regs)->ip) #define GET_FP(regs) ((regs)->bp) diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index c90678fd391a..6aac6abf931e 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -42,4 +42,24 @@ extern int after_bootmem; extern __ro_after_init struct mm_struct *poking_mm; extern __ro_after_init unsigned long poking_addr;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val) +{ + regs->sp -= sizeof(unsigned long); + *(unsigned long *)regs->sp = val; +} + +static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ + int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + int3_emulate_jmp(regs, func); +} + #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 4db9c0d29bc1..9519bc553d6d 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -613,11 +613,83 @@ extern struct paravirt_patch_site __start_parainstructions[], __stop_parainstructions[]; #endif /* CONFIG_PARAVIRT */
+/* + * Self-test for the INT3 based CALL emulation code. + * + * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up + * properly and that there is a stack gap between the INT3 frame and the + * previous context. Without this gap doing a virtual PUSH on the interrupted + * stack would corrupt the INT3 IRET frame. + * + * See entry_{32,64}.S for more details. + */ +static void __init int3_magic(unsigned int *ptr) +{ + *ptr = 1; +} + +extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */ + +static int __init +int3_exception_notify(struct notifier_block *self, unsigned long val, void *data) +{ + struct die_args *args = data; + struct pt_regs *regs = args->regs; + + if (!regs || user_mode(regs)) + return NOTIFY_DONE; + + if (val != DIE_INT3) + return NOTIFY_DONE; + + if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip) + return NOTIFY_DONE; + + int3_emulate_call(regs, (unsigned long)&int3_magic); + return NOTIFY_STOP; +} + +static void __init int3_selftest(void) +{ + static __initdata struct notifier_block int3_exception_nb = { + .notifier_call = int3_exception_notify, + .priority = INT_MAX-1, /* last */ + }; + unsigned int val = 0; + + BUG_ON(register_die_notifier(&int3_exception_nb)); + + /* + * Basically: int3_magic(&val); but really complicated :-) + * + * Stick the address of the INT3 instruction into int3_selftest_ip, + * then trigger the INT3, padded with NOPs to match a CALL instruction + * length. + */ + asm volatile ("1: int3; nop; nop; nop; nop\n\t" + ".pushsection .init.data,"aw"\n\t" + ".align " __ASM_SEL(4, 8) "\n\t" + ".type int3_selftest_ip, @object\n\t" + ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t" + "int3_selftest_ip:\n\t" + __ASM_SEL(.long, .quad) " 1b\n\t" + ".popsection\n\t" + : : __ASM_SEL_RAW(a, D) (&val) : "memory"); + + BUG_ON(val != 1); + + unregister_die_notifier(&int3_exception_nb); +} + void __init alternative_instructions(void) { - /* The patching is not fully atomic, so try to avoid local interruptions - that might execute the to be patched code. - Other CPUs are not running. */ + int3_selftest(); + + /* + * The patching is not fully atomic, so try to avoid local + * interruptions that might execute the to be patched code. + * Other CPUs are not running. + */ stop_nmi();
/* @@ -642,10 +714,11 @@ void __init alternative_instructions(void) _text, _etext); }
- if (!uniproc_patched || num_possible_cpus() == 1) + if (!uniproc_patched || num_possible_cpus() == 1) { free_init_pages("SMP alternatives", (unsigned long)__smp_locks, (unsigned long)__smp_locks_end); + } #endif
apply_paravirt(__parainstructions, __parainstructions_end); diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..d13f892d2c47 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -153,35 +153,6 @@ static inline bool invalid_selector(u16 value)
#define FLAG_MASK FLAG_MASK_32
-/* - * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode - * when it traps. The previous stack will be directly underneath the saved - * registers, and 'sp/ss' won't even have been saved. Thus the '®s->sp'. - * - * Now, if the stack is empty, '®s->sp' is out of range. In this - * case we try to take the previous stack. To always return a non-null - * stack pointer we fall back to regs as stack if no previous stack - * exists. - * - * This is valid only for kernel mode traps. - */ -unsigned long kernel_stack_pointer(struct pt_regs *regs) -{ - unsigned long context = (unsigned long)regs & ~(THREAD_SIZE - 1); - unsigned long sp = (unsigned long)®s->sp; - u32 *prev_esp; - - if (context == (sp & ~(THREAD_SIZE - 1))) - return sp; - - prev_esp = (u32 *)(context); - if (*prev_esp) - return (unsigned long)*prev_esp; - - return (unsigned long)regs; -} -EXPORT_SYMBOL_GPL(kernel_stack_pointer); - static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno) { BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0);
On Mon, 6 May 2019 10:19:51 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 03, 2019 at 11:57:22AM -0700, Linus Torvalds wrote:
On Fri, May 3, 2019 at 9:21 AM Andy Lutomirski luto@amacapital.net wrote:
So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always?
What would the code actually end up looking like? I don't necessarily object, since that kernel_stack_pointer() thing certainly looks horrible, but honestly, my suggestion to just pass in the 'struct pt_regs' and let the call emulation fix it up would have also worked, and avoided that bug (and who knows what else might be hiding).
I really think that you're now hitting all the special case magic low-level crap that I wanted to avoid.
This did actually boot on first try; so there must be something horribly wrong...
Now, I know you like that other approach; but I figured I should at least show you what this one looks like. Maybe I've been staring at entry_32.S too much, but I really don't dislike this.
I can test this too. I was hoping to get this in by this merge window. I spent 3 hours yesterday trying to get Linus's version working on i386 with no success. Not sure how much time Linus will have to look at this, as he just opened the merge window.
Again, I think Peter's solution here is the more elegant one. But as long as we get *a* solution, I'll be happy. And my time to work on it has pretty much already been depleted.
-- Steve
On Mon, May 6, 2019 at 6:56 AM Steven Rostedt rostedt@goodmis.org wrote:
I can test this too. I was hoping to get this in by this merge window. I spent 3 hours yesterday trying to get Linus's version working on i386 with no success. Not sure how much time Linus will have to look at this, as he just opened the merge window.
I acvtually just tested it in qemu, and it worked fine.
Ok, so my test was admittedly really simple and stupid, in that al I did was
printk("Before int3\n"); asm("int3; nop; nop; nop; nop" : : :"ax","dx","cx"); printk("After int3\n");
and then I hacked up do_kernel_int3() to just unconditionally do
return int3_emulate_call(regs, (unsigned long)test_int3);
with a stupid test-function:
void test_int3(void) { printk("In int3 handler"); }
instead fo anything fancy.
But it did exactly the expected thing, and resulted in
Before int3 In int3 handler After int3
on the screen.
So what is it that doesn't actually work? I've looked at the patch even more, and I can't for the life of me see how it wouldn't work.
Of course, I didn't test any of the actual ftrace parts, since I short-circuited them intentionally with the above test function hack. I have no idea what the semantics for those ftrace_location(ip)/is_ftrace_caller(ip) cases are supposed to be, I only tested that yes, the infrastructure clearly emulates a call instruction.
Linus
On Mon, May 6, 2019 at 9:17 AM Linus Torvalds torvalds@linux-foundation.org wrote:
So what is it that doesn't actually work? I've looked at the patch even more, and I can't for the life of me see how it wouldn't work.
And I do still react to PeterZ's
arch/x86/entry/entry_32.S | 150 +++++++++++++++++++++++++++++------
vs
arch/x86/entry/entry_32.S | 7 ++++++-
for mine. And PeterZ's apparently still had bugs.
Yes, mine can have bugs too, but I really *have* done some basic testing, and with 6 lines of code, I think they are more localized.
Linus
On Mon, 6 May 2019 09:17:19 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
So what is it that doesn't actually work? I've looked at the patch even more, and I can't for the life of me see how it wouldn't work.
Of course, I didn't test any of the actual ftrace parts, since I short-circuited them intentionally with the above test function hack. I have no idea what the semantics for those ftrace_location(ip)/is_ftrace_caller(ip) cases are supposed to be, I only tested that yes, the infrastructure clearly emulates a call instruction.
Can you try booting with:
CONFIG_KPROBE_EVENTS=y CONFIG_UPROBE_EVENTS=y CONFIG_DYNAMIC_EVENTS=y CONFIG_PROBE_EVENTS=y CONFIG_DYNAMIC_FTRACE=y CONFIG_DYNAMIC_FTRACE_WITH_REGS=y CONFIG_FUNCTION_PROFILER=y CONFIG_FTRACE_MCOUNT_RECORD=y CONFIG_FTRACE_SELFTEST=y CONFIG_FTRACE_STARTUP_TEST=y CONFIG_TRACEPOINT_BENCHMARK=y CONFIG_RING_BUFFER_BENCHMARK=y CONFIG_RING_BUFFER_STARTUP_TEST=y
And see if it boots?
-- Steve
On Mon, May 6, 2019 at 10:06 AM Steven Rostedt rostedt@goodmis.org wrote:
Can you try booting with: [ snip snip ]
And see if it boots?
No it doesn't.
Dang, I tried to figure out what's up, but now I really have to start handling all the puill requests..
I thought it might be an int3 that happens on the entry stack, but I don't think that should ever happen. Either it's a user-mode int3, or we're in the kernel and have switched stacks. So I still don't see why my patch doesn't work, but now I have no time to debug it.
Linus
On Mon, 6 May 2019 11:06:51 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, May 6, 2019 at 10:06 AM Steven Rostedt rostedt@goodmis.org wrote:
Can you try booting with: [ snip snip ]
And see if it boots?
No it doesn't.
Dang, I tried to figure out what's up, but now I really have to start handling all the puill requests..
I thought it might be an int3 that happens on the entry stack, but I don't think that should ever happen. Either it's a user-mode int3, or we're in the kernel and have switched stacks. So I still don't see why my patch doesn't work, but now I have no time to debug it.
You should have waited another week to open that merge window ;-)
-- Steve
On Mon, May 6, 2019 at 11:57 AM Steven Rostedt rostedt@goodmis.org wrote:
You should have waited another week to open that merge window ;-)
Hmm. I'm looking at it while the test builds happen, and since I don't see what's wrong in the low-level entry code, I'm looking at the ftrace code instead.
What's going on here?
*pregs = int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
that line makes no sense. Why would we emulate a call to ftrace_regs_caller()? That function sets up a pt_regs, and then calls ftrace_stub().
But we *have* pt_regs here already with the right values. Why isn't this just a direct call to ftrace_stub() from within the int3 handler?
And the thing is, calling ftrace_regs_caller() looks wrong, because that's not what happens for *real* mcount handling, which uses that "create_trampoline()" to create the thing we're supposed to really use?
Anyway, I simply don't understand the code, so I'm confused. But why is the int3 emulation creating a call that doesn't seem to match what the instruction that we're actually rewriting is supposed to do?
IOW, it looks to me like ftrace_int3_handler() is actually emulating something different than what ftrace_modify_code() is actually modifying the code to do!
Since the only caller of ftrace_modify_code() is update_ftrace_func(), why is that function not just saving the target and we'd emulate the call to that? Using anything else looks crazy?
But as mentioned, I just don't understand the ftrace logic. It looks insane to me, and much more likely to be buggy than the very simple entry code.
Linus
On Mon, 6 May 2019 12:46:11 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, May 6, 2019 at 11:57 AM Steven Rostedt rostedt@goodmis.org wrote:
You should have waited another week to open that merge window ;-)
Hmm. I'm looking at it while the test builds happen, and since I don't see what's wrong in the low-level entry code, I'm looking at the ftrace code instead.
What's going on here?
*pregs = int3_emulate_call(regs, (unsigned
long)ftrace_regs_caller);
that line makes no sense. Why would we emulate a call to ftrace_regs_caller()? That function sets up a pt_regs, and then calls ftrace_stub().
Because that call to ftrace_stub is also dynamic.
In entry_32.S
.globl ftrace_call ftrace_call: call ftrace_stub
update_ftrace_func() { [..] } else if (is_ftrace_caller(ip)) { if (!ftrace_update_func_call) { int3_emulate_jmp(regs, ip + CALL_INSN_SIZE); return 1; } *pregs = int3_emulate_call(regs, ftrace_update_func_call); return 1; }
Part of the code will change it to call the function needed directly.
struct ftrace_ops my_ops { .func = my_handler };
register_ftrace_function(&my_ops);
Will change "call ftrace_stub" into "call my_handler"
If you register another ftrace_ops, it will change that to
call ftrace_ops_list_func
Which will iterate over all registered ftrace_ops, and depending on the hashs in ftrace_ops, will call the registered handler for them.
But we *have* pt_regs here already with the right values. Why isn't this just a direct call to ftrace_stub() from within the int3 handler?
And the thing is, calling ftrace_regs_caller() looks wrong, because that's not what happens for *real* mcount handling, which uses that "create_trampoline()" to create the thing we're supposed to really use?
The ftrace_regs_caller() is what is called if there's two or more callbacks registered to a single function. For example, you have a function that is being lived patch (it uses the ftrace_regs_caller copy of the trampoline). But if you enable function tracing (which doesn't need a copy of regs), it will call the ftrace_regs_caller, which will call a ftrace_ops_list_func() which will look at the ftrace_ops (which is the descriptor representing registered callbacks to ftrace), and based on the hash value in them, will call their handler if the ftrace_ops hashes match the function being called.
Anyway, I simply don't understand the code, so I'm confused. But why is the int3 emulation creating a call that doesn't seem to match what the instruction that we're actually rewriting is supposed to do?
IOW, it looks to me like ftrace_int3_handler() is actually emulating something different than what ftrace_modify_code() is actually modifying the code to do!
Since the only caller of ftrace_modify_code() is update_ftrace_func(), why is that function not just saving the target and we'd emulate the call to that? Using anything else looks crazy?
But as mentioned, I just don't understand the ftrace logic. It looks insane to me, and much more likely to be buggy than the very simple entry code.
Let's go an example. Let's say we live patched do_IRQ() and __migrate_task(). We would have this:
live_patch_trampoline: (which is a modified copy of the ftrace_regs_caller) pushl $__KERNEL_CS pushl 4(%esp) pushl $0 pushl %gs pushl %fs pushl %es pushl %ds pushl %eax pushf popl %eax movl %eax, 8*4(%esp) pushl %ebp pushl %edi pushl %esi pushl %edx pushl %ecx pushl %ebx movl 12*4(%esp), %eax subl $MCOUNT_INSN_SIZE, %eax movl 15*4(%esp), %edx /* Load parent ip (2nd parameter) */ movl function_trace_op, %ecx /* Save ftrace_pos in 3rd parameter */ pushl %esp /* Save pt_regs as 4th parameter */
call live_kernel_patch_func
addl $4, %esp /* Skip pt_regs */ push 14*4(%esp) popf movl 12*4(%esp), %eax movl %eax, 14*4(%esp) popl %ebx popl %ecx popl %edx popl %esi popl %edi popl %ebp popl %eax popl %ds popl %es popl %fs popl %gs lea 3*4(%esp), %esp /* Skip orig_ax, ip and cs */ jmp .Lftrace_ret
<do_IRQ>: call live_patch_trampoline [..]
<__migrate_task>: call_live_patch_trampoline
Now we enable function tracing on all functions that can be traced, and this includes do_IRQ() and __migrate_task(). Thus, we first modify that call to ftrace_stub in the ftrace_regs_caller to point to the ftrace_ops_list_func() as that will iterate over the ftrace_ops for live kernel patching, and the ftrace_ops for the function tracer. That iterator will check the hashes against the called functions, and for live kernel patching, it will it will call its handler if the passed in ip matches either do_IRQ() or __migrate_task(). It will see that the ftrace_ops for function tracing is set to trace all functions and just call its handler in that loop too.
Today, when we place an int3 on those functions, we basically turn them into nops.
<do_IRQ>: <int3>(convert from call live_patch_trampoline to call ftrace_regs_caller) [..]
But that int3 handler, doesn't call either the live_patch_trampoline or ftrace_regs_caller, which means, the live kernel patching doesn't get to make that function call something different. We basically, just disabled tracing completely for those functions during that transition.
Remember that ftrace_regs_caller gets updated to not call ftrace_stub, but to the list iterator if there's more than one handler registered with ftrace (and so does ftrace_caller). By making the int3 handler call it, will do the iteration over all registered ftrace_ops and nothing will be missed.
Does that help explain what's going on?
-- Steve
On Mon, May 6, 2019 at 1:29 PM Steven Rostedt rostedt@goodmis.org wrote:
Because that call to ftrace_stub is also dynamic.
You're missing the point.
We are rewriting a single "cal" instruction to point to something.
The "int3" emulation should call THE SAME THING.
Right now it doesn't.
Part of the code will change it to call the function needed directly.
struct ftrace_ops my_ops { .func = my_handler };
register_ftrace_function(&my_ops);
Will change "call ftrace_stub" into "call my_handler"
But that's not what you're actually *doing*.
Instead, you're now _emulating_ calling ftrace_regs_caller, which will call that ftrace_stub, which in turn will try to update the call site.
But that's insane. It's insane because
- it's not even what your call rewriting is doing Why aren't you just doing the emulation using the *SAME* target that you're rewriting the actual call instruction with?
- even if ftrace_regs_caller ends up being that same function, you'd be better off just passing the "struct pt_regs" that you *ALREADY HAVE* directly to ftrace_stub in the int3 handler, rather than create *another* pt_regs stack
See? In that second case, why don't you just use "int3_emulate_call()" to do the reguired 'struct pt_regs' updates, and then call ftrace_stub() *with* that fixed-up pt_regs thing?
In other words, I think you should always do "int3_emulate_call()" with the *exact* same address that the instruction you are rewriting is using. There's no "three different cases". The only possible cases are "am I rewriting a jump" or "am I rewriting a call".
There is no "am I rewriting a call to one address, and then emulating it with a call to another address" case that makes sense.
What *can* make sense is "Oh, I'm emulating a call, but I know that call will be rewritten, so let me emulate the call and then short-circuit the emulation immediately".
But that is not what the ftrace code is doing. The ftrace code is doing something odd and insane.
And no, your "explanation" makes no sense. Because it doesn't actually touch on the fundamental insanity.
Linus
On Mon, May 6, 2019 at 1:42 PM Linus Torvalds torvalds@linux-foundation.org wrote:
What *can* make sense is "Oh, I'm emulating a call, but I know that call will be rewritten, so let me emulate the call and then short-circuit the emulation immediately".
That made no sense. The end should have been "and then short-circuit the _rewriting_ immediately" of course.
The "emulate a call" is just to make the "struct pt_regs" state look like it would have after the call. The "short circuit the rewriting" part is the thing that then rewrites the actual instruction based on that pt_regs state.
Linus
On Mon, 6 May 2019 13:42:12 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, May 6, 2019 at 1:29 PM Steven Rostedt rostedt@goodmis.org wrote:
Because that call to ftrace_stub is also dynamic.
You're missing the point.
We are rewriting a single "cal" instruction to point to something.
The "int3" emulation should call THE SAME THING.
Right now it doesn't.
To do that we would need to rewrite the logic to update each of those 40,000 calls one at a time, or group them together to what gets changed. As each function can call a different trampoline. One of the custom trampolines associated with a ftrace_ops, or the list one when a function has more than one ftrace_ops attached to it.
The ftrace code does this in batches:
1) Add int3 to all functions being affected (could be 40,000 of them) (sync the cores)
(Now all those functions are going through the int3 handler)
2) Modify the address of those call sites (sync the cores)
(Still going through the int3 handlers)
3) Remove the int3
Now each of those functions are calling something, and they may not all be the same thing they are calling.
Part of the code will change it to call the function needed directly.
struct ftrace_ops my_ops { .func = my_handler };
register_ftrace_function(&my_ops);
Will change "call ftrace_stub" into "call my_handler"
But that's not what you're actually *doing*.
Instead, you're now _emulating_ calling ftrace_regs_caller, which will call that ftrace_stub, which in turn will try to update the call site.
The call site gets updated before the loop gets called (at least it should, I have to go and look at the code, but I'm 99% that it does). There should not be a breakpoint on the call to ftrace_stub when we add breakpoints to the functions that call the ftrace_regs_caller.
But that's insane. It's insane because
- it's not even what your call rewriting is doing Why aren't you just
doing the emulation using the *SAME* target that you're rewriting the actual call instruction with?
- even if ftrace_regs_caller ends up being that same function, you'd
be better off just passing the "struct pt_regs" that you *ALREADY HAVE* directly to ftrace_stub in the int3 handler, rather than create *another* pt_regs stack
Are you saying to call the ftrace_ops handlers from int3 directly?
See? In that second case, why don't you just use "int3_emulate_call()" to do the reguired 'struct pt_regs' updates, and then call ftrace_stub() *with* that fixed-up pt_regs thing?
In other words, I think you should always do "int3_emulate_call()" with the *exact* same address that the instruction you are rewriting is using. There's no "three different cases". The only possible cases are "am I rewriting a jump" or "am I rewriting a call".
There is no "am I rewriting a call to one address, and then emulating it with a call to another address" case that makes sense.
What *can* make sense is "Oh, I'm emulating a call, but I know that call will be rewritten, so let me emulate the call and then short-circuit the emulation immediately".
But that is not what the ftrace code is doing. The ftrace code is doing something odd and insane.
And no, your "explanation" makes no sense. Because it doesn't actually touch on the fundamental insanity.
Also, talking with Peter about the ftrace_32.S version of ftrace_regs_caller, makes Peter's patch sound even better.
The "ftrace_regs_caller" was created to allow kprobes to use the function tracer when a probe was added to the start of a function (which is a common occurrence). That is, kprobes uses int3 to inject a handler pretty much anywhere in the code. kprobes can even be used for dynamic trace events.
Now we found that if we use function tracing, it's faster to do the call then to take the int3 hit. So kprobes knows of ftrace, and will register a callback if it happens to be placed on a fentry call site.
Thus, Masami asked me to create a way to have ftrace be able to simulate an int3. As kprobe handlers can do pretty much anything (bpf uses them), I had to make that call from ftrace look like a real int3 just happened.
As ftrace_caller, is optimized for fast function tracing, I needed to make another trampoline for the slower "emulate int3" operation, and that was the birth of ftrace_regs_caller. For x86_64, it was really straight forward and I had that done rather quickly. For i386, it was much more difficult, and that was because of not having regs->sp on the stack. I had to play this game to be able to pass in a pt_regs that would be the same regardless if it was called by ftrace or an int3.
The problem was the call to ftrace_regs_caller would place the return code on the stack, but I had to move it, because the kprobes handlers, expected ®s->sp to point to the location of the stack just before the call was hit! This means, regs->flags was were the return code was.
When we enter ftrace_regs_caller from the function being traced, the top of the stack has the return code. But then we needed to do this:
pushl $__KERNEL_CS pushl 4(%esp) /* Save the return ip */ pushl $0 /* Load 0 into orig_ax */ pushl %gs pushl %fs pushl %es pushl %ds pushl %eax
The above push regs->cs, regs->ip (the return code), then regs->gs... to regs->ax, where now I finally have saved a scratch register to use.
/* Get flags and place them into the return ip slot */ pushf popl %eax movl %eax, 8*4(%esp)
I would then save flags into %eax and move it to where the return address was placed by the call to this trampoline.
At the end, I had to undo this song and dance as well:
/* restore flags */ push 14*4(%esp) popf
/* Move return ip back to its original location */ movl 12*4(%esp), %eax movl %eax, 14*4(%esp)
If we go with Peter's patch, I can make this code much more sane, and not have to worry about having ®s->sp be at the top of the stack. I could simply, just push everything in the order of pt_regs and call the handler.
-- Steve
On Mon, May 6, 2019 at 2:45 PM Steven Rostedt rostedt@goodmis.org wrote:
To do that we would need to rewrite the logic to update each of those 40,000 calls one at a time, or group them together to what gets changed.
Stephen, YOU ARE NOT LISTENING.
You are already fixing the value of the call in the instruction as part of the instruction rewriting.
When you do things like this:
unsigned long ip = (unsigned long)(&ftrace_call); unsigned char *new; int ret;
new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
you have already decided to rewrite the instruction with one single fixed call target: "func".
I'm just saying that you should ALWAYS use the same call target in the int3 emulation.
Instead, you hardcode something else than what you are AT THE SAME TIME rewriting the instruction with.
See what I'm saying?
You already save off the "ip" of the instruction you modify in update_ftrace_func(). I'm just saying that you should *also* save off the actual target of the call, and use *THAT*.
So that the int3 emulation and the instruction rewriting *match*.
What you do now makes no sense. You're modifing the code with one thing (the "func" argument in update_ftrace_func), so if your modification completed, that's what you'll actually *run*. But you're then _emulating_ doing somethiing completely different, not using "func" at all there.
So let me say one more time: how can it *possibly* make sense to emulate something else than you are changing the instruction to read?
Are you finally understanding what craziness I'm talking about?
Stop with the "there could be thousands of targets" arguyment. The "call" instruction THAT YOU ARE REWRITING has exactly one target. There aren't 40,000 of them. x86 does not have that kind of "call" instruction that randomly calls 40k different functions. You are replacing FIVE BYTES of memory, and the emulation you do should emulate those FIVE BYTES.
See?
Why are you emulating something different than what you are rewriting?
Linus
On Mon, May 6, 2019 at 3:06 PM Linus Torvalds torvalds@linux-foundation.org wrote:
Why are you emulating something different than what you are rewriting?
Side note: I'm also finding another bug on the ftrace side, which is a simple race condition.
In particular, the logic with 'modifying_ftrace_code' is fundamentally racy.
What can happen is that on one CPU we rewrite one instruction:
ftrace_update_func = ip; /* Make sure the breakpoints see the ftrace_update_func update */ smp_wmb();
/* See comment above by declaration of modifying_ftrace_code */ atomic_inc(&modifying_ftrace_code);
ret = ftrace_modify_code(ip, old, new);
atomic_dec(&modifying_ftrace_code);
but then another CPU hits the 'int3' while the modification is going on, and takes the fault.
The fault handler does that
if (unlikely(atomic_read(&modifying_ftrace_code))..
and sees that "yes, it's in the middle of modifying the ftrace code", and calls ftrace_int3_handler(). All good and "obviously correct" so far, no?
HOWEVER. It's actually buggy. Because in the meantime, the CPU that was rewriting instructions has finished, and decrements the modifying_ftrace_code, which doesn't hurt us (because we already saw that the int3 was due to the modification.
BUT! There are two different races here:
(a) maybe the fault handling was slow, and we saw the 'int3' and took the fault, but the modifying CPU had already finished, so that atomic_read(&modifying_ftrace_code) didn't actually trigger at all.
(b) maybe the int3-faulting CPU *did* see the proper value of modifying_ftrace_code, but the modifying CPU went on and started *another* modification, and has changed ftrace_update_func in the meantime, so now the int3 handling is looking at the wrong values!
In the case of (a), we'll die with an oops due to the inexplicable 'int3' we hit. And in the case of (b) we'll be fixing up using the wrong address.
Things like this is why I'm wondering how much of the problems are due to the entry code, and how much of it is due to simply races and timing differences?
Again, I don't actually know the ftrace code, and maybe I'm missing something, but this really looks like _another_ fundamental bug.
The way to handle that modifying_ftrace_code thing is most likely by using a sequence counter. For example, one way to actually do some thing like this might be
ftrace_update_func = ip; ftrace_update_target = func; smp_wmb(); atomic_inc(&modifying_ftrace_head);
ret = ftrace_modify_code(ip, old, new);
atomic_inc(&modifying_ftrace_tail); smp_wmb();
and now the int3 code could do something like
int head, tail;
head = atomic_read(&modifying_ftrace_head); smp_rmb(); tail = atomic_read(&modifying_ftrace_tail);
/* Are we still in the process of modification? */ if (unlikely(head != tail+1)) return 0;
ip = ftrace_update_func; func = ftrace_update_target; smp_rmb(); /* Need to re-check that the above two values are consistent and we didn't exit */ if (atomic_read(&modifying_ftrace_tail) != tail) return 0;
*pregs int3_emulate_call(regs, ip, func); return 1;
although it probably really would be better to use a seqcount instead of writing it out like the above.
NOTE! The above only fixes the (b) race. The (a) race is probably best handled by actually checking if the 'int3' instruction is still there before dying.
Again, maybe there's something I'm missing, but having looked at that patch now what feels like a million times, I'm finding more worrisome things in the ftrace code than in the kernel entry code..
Linus
On Mon, 6 May 2019 15:31:57 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, May 6, 2019 at 3:06 PM Linus Torvalds torvalds@linux-foundation.org wrote:
Why are you emulating something different than what you are rewriting?
Side note: I'm also finding another bug on the ftrace side, which is a simple race condition.
In particular, the logic with 'modifying_ftrace_code' is fundamentally racy.
What can happen is that on one CPU we rewrite one instruction:
ftrace_update_func = ip; /* Make sure the breakpoints see the ftrace_update_func update */ smp_wmb(); /* See comment above by declaration of modifying_ftrace_code */ atomic_inc(&modifying_ftrace_code); ret = ftrace_modify_code(ip, old, new); atomic_dec(&modifying_ftrace_code);
but then another CPU hits the 'int3' while the modification is going on, and takes the fault.
The fault handler does that
if (unlikely(atomic_read(&modifying_ftrace_code))..
and sees that "yes, it's in the middle of modifying the ftrace code", and calls ftrace_int3_handler(). All good and "obviously correct" so far, no?
HOWEVER. It's actually buggy. Because in the meantime, the CPU that was rewriting instructions has finished, and decrements the modifying_ftrace_code, which doesn't hurt us (because we already saw that the int3 was due to the modification.
But the CPU that was rewriting instructions does a run_sync() after removing the int3:
static void run_sync(void) { int enable_irqs;
/* No need to sync if there's only one CPU */ if (num_online_cpus() == 1) return;
enable_irqs = irqs_disabled();
/* We may be called with interrupts disabled (on bootup). */ if (enable_irqs) local_irq_enable(); on_each_cpu(do_sync_core, NULL, 1); if (enable_irqs) local_irq_disable(); }
Which sends an IPI to all CPUs to make sure they no longer see the int3.
BUT! There are two different races here:
(a) maybe the fault handling was slow, and we saw the 'int3' and took the fault, but the modifying CPU had already finished, so that atomic_read(&modifying_ftrace_code) didn't actually trigger at all.
(b) maybe the int3-faulting CPU *did* see the proper value of modifying_ftrace_code, but the modifying CPU went on and started *another* modification, and has changed ftrace_update_func in the meantime, so now the int3 handling is looking at the wrong values!
In the case of (a), we'll die with an oops due to the inexplicable 'int3' we hit. And in the case of (b) we'll be fixing up using the wrong address.
Things like this is why I'm wondering how much of the problems are due to the entry code, and how much of it is due to simply races and timing differences?
Again, I don't actually know the ftrace code, and maybe I'm missing something, but this really looks like _another_ fundamental bug.
The way to handle that modifying_ftrace_code thing is most likely by using a sequence counter. For example, one way to actually do some thing like this might be
ftrace_update_func = ip; ftrace_update_target = func; smp_wmb(); atomic_inc(&modifying_ftrace_head); ret = ftrace_modify_code(ip, old, new); atomic_inc(&modifying_ftrace_tail); smp_wmb();
and now the int3 code could do something like
int head, tail; head = atomic_read(&modifying_ftrace_head); smp_rmb(); tail = atomic_read(&modifying_ftrace_tail); /* Are we still in the process of modification? */ if (unlikely(head != tail+1)) return 0; ip = ftrace_update_func; func = ftrace_update_target; smp_rmb(); /* Need to re-check that the above two values are consistent
and we didn't exit */ if (atomic_read(&modifying_ftrace_tail) != tail) return 0;
*pregs int3_emulate_call(regs, ip, func); return 1;
although it probably really would be better to use a seqcount instead of writing it out like the above.
NOTE! The above only fixes the (b) race. The (a) race is probably best handled by actually checking if the 'int3' instruction is still there before dying.
Again, maybe there's something I'm missing, but having looked at that patch now what feels like a million times, I'm finding more worrisome things in the ftrace code than in the kernel entry code..
I think you are missing the run_sync() which is the heavy hammer to make sure all CPUs are in sync. And this is done at each stage:
add int3 run_sync(); update call cite outside of int3 run_sync() remove int3 run_sync()
HPA said that the last run_sync() isn't needed, but I kept it because I wanted to make sure. Looks like your analysis shows that it is needed.
-- Steve
On Mon, May 6, 2019 at 5:10 PM Steven Rostedt rostedt@goodmis.org wrote:
But the CPU that was rewriting instructions does a run_sync() after removing the int3:
static void run_sync(void) { int enable_irqs;
/* No need to sync if there's only one CPU */ if (num_online_cpus() == 1) return; enable_irqs = irqs_disabled(); /* We may be called with interrupts disabled (on bootup). */ if (enable_irqs) local_irq_enable(); on_each_cpu(do_sync_core, NULL, 1); if (enable_irqs) local_irq_disable();
}
Which sends an IPI to all CPUs to make sure they no longer see the int3.
Duh. I have been looking back and forth in that file, and I was mixing ftrace_modify_code_direct() (which only does a local sync) with ftrace_modify_code() (which does run_sync()). The dangers of moving around by searching for function names.
That file is a maze of several functions that are very similarly named and do slightly different things.
But yes, I was looking at the "direct" sequence.
I think you are missing the run_sync() which is the heavy hammer to make sure all CPUs are in sync. And this is done at each stage:
add int3 run_sync(); update call cite outside of int3 run_sync() remove int3 run_sync()
HPA said that the last run_sync() isn't needed, but I kept it because I wanted to make sure. Looks like your analysis shows that it is needed.
Absolutely. I think we could get rid of it, but yes, to then avoid the race we'd need to be a lot more clever.
Yeah, with the three run_sunc() things, the races I thought it had can't happen.
Linus
On Mon, 6 May 2019 15:06:57 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, May 6, 2019 at 2:45 PM Steven Rostedt rostedt@goodmis.org wrote:
To do that we would need to rewrite the logic to update each of those 40,000 calls one at a time, or group them together to what gets changed.
Stephen, YOU ARE NOT LISTENING.
(note, it's Steven ;-)
I'm listening, I'm just trying to understand.
You are already fixing the value of the call in the instruction as part of the instruction rewriting.
When you do things like this:
unsigned long ip = (unsigned long)(&ftrace_call); unsigned char *new; int ret; new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
you have already decided to rewrite the instruction with one single fixed call target: "func".
I'm just saying that you should ALWAYS use the same call target in the int3 emulation.
Instead, you hardcode something else than what you are AT THE SAME TIME rewriting the instruction with.
See what I'm saying?
Yes, but that's not the code I'm worried about.
ftrace_replace_code() is, which does:
for_ftrace_rec_iter(iter) { rec = ftrace_rec_iter_record(iter);
ret = add_breakpoints(rec, enable); if (ret) goto remove_breakpoints; count++; }
run_sync();
And there's two more iterator loops that will do the modification of the call site, and then the third loop will remove the breakpoint.
That iterator does something special for each individual record. All 40,000 of them.
That add_breakpoint() does:
static int add_breakpoints(struct dyn_ftrace *rec, int enable) { unsigned long ftrace_addr; int ret;
ftrace_addr = ftrace_get_addr_curr(rec);
ret = ftrace_test_record(rec, enable);
switch (ret) { case FTRACE_UPDATE_IGNORE: return 0;
case FTRACE_UPDATE_MAKE_CALL: /* converting nop to call */ return add_brk_on_nop(rec);
case FTRACE_UPDATE_MODIFY_CALL: case FTRACE_UPDATE_MAKE_NOP: /* converting a call to a nop */ return add_brk_on_call(rec, ftrace_addr); } return 0; }
And to get what the target is, we call ftrace_get_addr_curr(), which will return a function based on the flags in the record. Which can be anything from a call to a customized trampoline, to either ftrace_caller, or to ftrace_regs_caller, or it can turn the record into a nop.
This is what I'm talking about. We are adding thousands of int3s through out the kernel, and we have a single handler to handle each one of them.
The reason I picked ftrace_regs_caller() is because that one does anything that any of the callers can do (albeit slower). If it does not, then ftrace will be broken, because it handles the case that all types of trampolines are attached to a single function, and that code had better do the exact same thing for each of those trampolines as if the trampolines were called directly, because the handlers that those trampolines call, shouldn't care who else is using that function.
Note, the only exception to that rule, is that we only allow one function attached to the function to modify the return address (and the record has a flag for that). If a record already modifies the ip address on return, the registering of another ftrace_ops that modifies the ip address will fail to register.
Stop with the "there could be thousands of targets" arguyment. The "call" instruction THAT YOU ARE REWRITING has exactly one target. There aren't 40,000 of them. x86 does not have that kind of "call" instruction that randomly calls 40k different functions. You are replacing FIVE BYTES of memory, and the emulation you do should emulate those FIVE BYTES.
See?
Why are you emulating something different than what you are rewriting?
I'm not having one call site call 40,000 different functions. You are right about that. But I have 40,000 different call sites that could be calling different functions and all of them are being processed by a single int3 handler.
That's my point.
-- Steve
On Mon, 6 May 2019 21:04:16 -0400 Steven Rostedt rostedt@goodmis.org wrote:
static int add_breakpoints(struct dyn_ftrace *rec, int enable) { unsigned long ftrace_addr; int ret;
ftrace_addr = ftrace_get_addr_curr(rec);
ret = ftrace_test_record(rec, enable);
switch (ret) { case FTRACE_UPDATE_IGNORE: return 0;
case FTRACE_UPDATE_MAKE_CALL: /* converting nop to call */ return add_brk_on_nop(rec);
case FTRACE_UPDATE_MODIFY_CALL:
The FTRACE_UPDATE_MODIFY_CALL is the case where we switch from calling one trampoline to calling another. As you can plainly see from this switch statement, we treat it the same as converting to a nop. That's the bug we are trying to fix. It shouldn't be treated as a nop. It should be treated as a call to the old or new function. Either is fine, but it must be one or the other, not a nop.
-- Steve
case FTRACE_UPDATE_MAKE_NOP: /* converting a call to a nop */ return add_brk_on_call(rec, ftrace_addr); } return 0; }
On Mon, May 6, 2019 at 6:04 PM Steven Rostedt rostedt@goodmis.org wrote:
That iterator does something special for each individual record. All 40,000 of them.
.. yes, but the 'int3' only happens for *one* of them at a time.
Why would it bother with the other 39,999 calls?
You could easily just look up the record at the int3 time, and just use the record. Exactly the same way you use the one-at-a-time ones.
Instead, you emulate a fake call to a function that *wouldn't* get called, which now does the lookup there. That's the part I don't get. Why are you emulating something else than what you'd be rewriting?
Linus
On Mon, 6 May 2019 18:34:59 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, May 6, 2019 at 6:04 PM Steven Rostedt rostedt@goodmis.org wrote:
That iterator does something special for each individual record. All 40,000 of them.
.. yes, but the 'int3' only happens for *one* of them at a time.
Why would it bother with the other 39,999 calls?
You could easily just look up the record at the int3 time, and just use the record. Exactly the same way you use the one-at-a-time ones.
Instead, you emulate a fake call to a function that *wouldn't* get called, which now does the lookup there. That's the part I don't get. Why are you emulating something else than what you'd be rewriting?
Ah, now I see what you are saying. Yes, I could pass in what it is suppose to call. But I was trying to use the same code for all the alternative solutions we were passing around, and this became the "default" case that would work with any int3_emulate_call implementation we came up with.
That is, if we call ftrace_regs_caller() for any scenario it should work. Even if the call was suppose to be a nop, because in that case, all the ftrace_ops registered in the iterator would refuse to have their handler be called for that function.
I sent you a single patch, but that was really just a diff of several applied patches against your unmodified tree. The last patch implements the ftrace code. And I had it this way because it should work for any of the implementations.
I could modify it so that it picks what function to call when the int3 is triggered. I think all the solutions we are down to allow that now. Some of the early ideas had me call one function for all int3s due to trampolines and such.
Also, I figured just calling ftrace_regs_caller() was simpler then having that int3 handler do the hash look ups to determine what handler it needs to call.
-- Steve
On Mon, May 6, 2019 at 6:53 PM Steven Rostedt rostedt@goodmis.org wrote:
Also, I figured just calling ftrace_regs_caller() was simpler then having that int3 handler do the hash look ups to determine what handler it needs to call.
So what got me looking at this - and the races (that didn't turn out to be races) - and why I really hate it, is because of the whole notion of "atomic state".
Running an "int3" handler (when the int3 is in the kernel) is in some sense "atomic". There is the state in the caller, of course, and there's the state that the int3 handler has, but you can *mostly* think of the two as independent.
In particular, if the int3 handler doesn't ever enable interrupts, and if it doesn't need to do any stack switches, the int3 handler part looks "obvious". It can be interrupted by NMI, but it can't be interrupted (for example) by the cross-cpu IPI.
That was at least the mental model I was going for.
Similarly, the "caller" side mostly looks obvious too. If we don't take an int3, none of this matter, and if we *do* take an int3, if we can at least make it "atomic" wrt the rewriter (before or after state), it should be easy to think about.
One of the things I was thinking of doing, for example, was to simply make the call emulation do a "load four bytes from the instruction stream, and just use that as the emulation target offset".
Why would that matter?
We do *not* have very strict guarantees for D$-vs-I$ coherency on x86, but we *do* have very strict guarantees for D$-vs-D$ coherency. And so we could use the D$ coherency to give us atomicity guarantees for loading and storing the instruction offset for instruction emulation, in ways we can *not* use the D$-to-I$ guarantees and just executing it directly.
So while we still need those nasty IPI's to guarantee the D$-vs-I$ coherency in the "big picture" model and to get the serialization with the actual 'int3' exception right, we *could* just do all the other parts of the instruction emulation using the D$ coherency.
So we could do the actual "call offset" write with a single atomic 4-byte locked cycle (just use "xchg" to write - it's always locked). And similarly we could do the call offset *read* with a single locked cycle (cmpxchg with a 0 value, for example). It would be atomic even if it crosses a cacheline boundary.
And now we'd be guaranteed that on a D$ side, we'd get either the old value _or_ the new value for the emulation, and never a mix of them. Which is very much unlike the I$ side guarantees that aren't documented and people have been worrying about.
Notice? We'd not even have to look up any values. We'd literally just do something like
int offset = locked_atomic_read(ip+1); return int3_emulate_call(ip, ip+5+offset);
and it would be *atomic* with respect to whatever other user that updates the instruction, as long as they update the offset with a "xchg" instruction.
Wouldn't that be lovely? We'd literally use the emulation as a way to get the kinds of coherency guarantees that we don't get natively with the I$.
Except that was _so_ not the case with the whole ftrace_regs_caller() thing. That whole hack made me go "this doesn't emulate either the old _or_ the new call". And see above why that would matter. Because a "let's just make the four bytes after the 'int3' instruction be atomic" would very much be all about "emulate either the old or the new call".
So this was why I was then looking at what the heck that code was doing, and it was making zero sense to me. Emulating something that is different from what you're actually rewriting means that tricks like the above don't work.
Linus
On Mon, 6 May 2019 19:22:06 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
Notice? We'd not even have to look up any values. We'd literally just do something like
int offset = locked_atomic_read(ip+1); return int3_emulate_call(ip, ip+5+offset);
and it would be *atomic* with respect to whatever other user that updates the instruction, as long as they update the offset with a "xchg" instruction.
Honestly, I'm not really sure what you are trying to do here.
Are you talking about making the update to the code in the int3 handler? And then just returning back to the regs->ip and executing the new call (not really emulating).
-- Steve
On Mon, May 6, 2019 at 7:58 PM Steven Rostedt rostedt@goodmis.org wrote:
Notice? We'd not even have to look up any values. We'd literally just do something like
int offset = locked_atomic_read(ip+1); return int3_emulate_call(ip, ip+5+offset);
and it would be *atomic* with respect to whatever other user that updates the instruction, as long as they update the offset with a "xchg" instruction.
Honestly, I'm not really sure what you are trying to do here.
Are you talking about making the update to the code in the int3 handler?
No. The above would be pretty much the entirely of the the ftrace_int3_handler.
It would emulate the call that has had its first byte overwritten by 'int3'. Without doing any lookups of what it was supposed to change the call to, because it simply depends on what the rewriting code is doing on another CPU (or on the same CPU - it wouldn't care).
So no need to look up anything, not at int3 time, and not at return time. It would just emulate the instruction atomically, with no state, and no need to look up what the 'ip' instruction is at the time.
It could literally just use a single flag: "is ftrace updating call instructions". Add another flag for the "I'm nop'ing out call instructions" so that it knows to emulate a jump-over instead. That's it.
Because all the actual *values* would be entirely be determined by the actual rewriting that is going on independently of the 'int3' exception.
Linus
On Mon, 6 May 2019 20:05:24 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
It would emulate the call that has had its first byte overwritten by 'int3'. Without doing any lookups of what it was supposed to change the call to, because it simply depends on what the rewriting code is doing on another CPU (or on the same CPU - it wouldn't care).
OK, so this is just about what to have it call.
So no need to look up anything, not at int3 time, and not at return time. It would just emulate the instruction atomically, with no state, and no need to look up what the 'ip' instruction is at the time.
It could literally just use a single flag: "is ftrace updating call instructions". Add another flag for the "I'm nop'ing out call instructions" so that it knows to emulate a jump-over instead. That's it.
Well we have that, and we have to look up the record regardless to know if this was a ftrace int3 or not (the ftrace_location(ip) does that). And the record has a counter to # of attached callers. Zero being to turn it into a nop.
Note, if we are going from nop to call or call to nop, it would need to read the offset to see if it is a nop (don't want to call with the nop offset)
Because all the actual *values* would be entirely be determined by the actual rewriting that is going on independently of the 'int3' exception.
But still, we need to emulate the call, which requires pushing the return code back onto the stack. I believe that part is the part we are struggling with.
-- Steve
On Mon, May 6, 2019 at 8:22 PM Steven Rostedt rostedt@goodmis.org wrote:
But still, we need to emulate the call, which requires pushing the return code back onto the stack. I believe that part is the part we are struggling with.
Yes. But I was looking at the ftrace parts because I didn't see the bug in the low-level x86 side, so...
The x86 int3 faulting code itself looks so *trivially* simple, and it does work for the trivial test-case too. Which was what made me go "Hmm, maybe there's timing or something".
But it could obviously also be that the trivial test-case is just too trivial, and doesn't involve nmi etc etc.
Linus
Duh.
I woke up this morning, realizing what was wrong with my patch.
On Mon, May 6, 2019 at 8:28 PM Linus Torvalds torvalds@linux-foundation.org wrote:
Yes. But I was looking at the ftrace parts because I didn't see the bug in the low-level x86 side, so...
There was nothing wrong in the *low-level* parts. There was one thing wrong with the "in3_emulate_push()" code:
memmove(new, regs, SAVED_KERNEL_REGS_SIZE);
which ends up calling an out-of-line function. One that is traced. One that will recursively result in 'int3'. Which will fill up the stack until you get a triple fault and reboot.
Stupid stupid.
Anyway, changing that to just copying things one word at a time makes everything work. The attached patch boots with the whole ftrace test thing.The only change is literally changing that memmove() into
/* Inlined "memmove()" of the pt_regs */ unsigned long *p = (unsigned long *) new; int i = SAVED_KERNEL_REGS_SIZE / sizeof(unsigned long); do { *p = p[1]; p++; } while (--i);
which I'm not entirely proud of, but it sure is still simple.
And honestly, I absolutely despise PeterZ's patch. The notion that we should suddenly say that "oh, the i386 kernel stack is odd" after 28 years of having that standard i386 stack is just crazy. And this:
arch/x86/entry/entry_32.S | 136 ++++++++++++++++++++++++++++------- ... 12 files changed, 323 insertions(+), 140 deletions(-)
vs this:
arch/x86/entry/entry_32.S | 7 +++- ... 6 files changed, 120 insertions(+), 13 deletions(-)
is still pretty damn conclusive. Not to mention that the simple approach had a truly mindbogglingly simple solution with no actual subtle changes anywhere else.
So I still claim that we should do my patch. Because it is SIMPLE. It's straightforward, and I can explain every single line in it. Even if I spent *way* too long until I realized that the "trivial" memmove() wasn't so trivial.
Linus
On Tue, 7 May 2019 07:54:53 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
And honestly, I absolutely despise PeterZ's patch. The notion that we should suddenly say that "oh, the i386 kernel stack is odd" after 28 years of having that standard i386 stack is just crazy. And this:
arch/x86/entry/entry_32.S | 136 ++++++++++++++++++++++++++++------- ... 12 files changed, 323 insertions(+), 140 deletions(-)
vs this:
arch/x86/entry/entry_32.S | 7 +++- ... 6 files changed, 120 insertions(+), 13 deletions(-)
is still pretty damn conclusive. Not to mention that the simple approach had a truly mindbogglingly simple solution with no actual subtle changes anywhere else.
So I still claim that we should do my patch. Because it is SIMPLE. It's straightforward, and I can explain every single line in it. Even if I spent *way* too long until I realized that the "trivial" memmove() wasn't so trivial.
Yes, band-aids are usually simpler than a proper fix. We have 28 years of hacks built on hacks. There's a lot of hacks in the C code to handle the differences between the crappy way x86_32 does pt_regs and the proper way x86_64 does them.
If the goal was just to add another band-aid to this, we now have one more subtle work around caused by two different methods being handled by a single code base.
I don't look at Peter's patch and think "this is the solution for int3 emulate calls". I see Peter's patch as "Thanks God, we are finally getting rid of the cause of all theses work around hacks all over the place! and oh by the way, we can easily implement int3 call emulation because of it".
To implement your way, we need to change how the int3 handler works. It will be the only exception handler having to return regs, otherwise it will crash.
Sure, it's an easily solution for the one off change of emulating calls, but it's just another complex work around that nobody is going to understand in 5 years.
-- Steve
On Tue, 7 May 2019 11:12:27 -0400 Steven Rostedt rostedt@goodmis.org wrote:
I don't look at Peter's patch and think "this is the solution for int3 emulate calls". I see Peter's patch as "Thanks God, we are finally getting rid of the cause of all theses work around hacks all over the place! and oh by the way, we can easily implement int3 call emulation because of it".
Note, if you really are adamant on your solution, I can write them up, test them, and get them out for this merge window. I really want a solution for the int3 emulate calls, as there is a real bug here that they fix.
But I will say, I still prefer Peter's solution. Maybe in the future we can revisit it again.
-- Steve
On Tue, 7 May 2019 11:25:13 -0400 Steven Rostedt rostedt@goodmis.org wrote:
Note, if you really are adamant on your solution, I can write them up, test them, and get them out for this merge window. I really want a solution for the int3 emulate calls, as there is a real bug here that they fix.
Thinking about this more, as my real motivation for getting this in (for this merge window), is to fix the live kernel patching bug. We only need to implement int3 call emulation for x86_64. We don't need to implement it for 32bit. The ftrace code can continue to just make it a nop. Live kernel patching does not support x86_32, and the only issue that happens on 32bit when we do a nop when converting a call to call, is that we might lose a trace. But that's been the case since this started, and not a critical issue. But with live kernel patching, losing a trace could crash the machine.
As I need to mark all this for stable, I'm going to look into just implementing this for x86_64. Then we can continue the debate about how to do this for x86_32 if we care about loss traces. But we don't need to commit to anything yet.
-- Steve
On Tue, May 7, 2019 at 8:12 AM Steven Rostedt rostedt@goodmis.org wrote:
Yes, band-aids are usually simpler than a proper fix.
What? No/.
My fix is the *proper* fix.
PeterZ's is the bandaid.
We have 28 years
of hacks built on hacks. There's a lot of hacks in the C code to handle the differences between the crappy way x86_32 does pt_regs and the proper way x86_64 does them.
You're confusing "reality": with "your dream world". They are different.
The reality is that the i386 kernel stack is just how things work. End of story.
The reality is that changing something fundamental like the kernel stack at this point for an architecture that will not change in the future is silly.
The reality is that Peter's patch is much bigger than mine, because it needed a lot of other changes *because* it did that change.
To implement your way, we need to change how the int3 handler works. It will be the only exception handler having to return regs, otherwise it will crash.
What? That's what the patch *does*. It's trivial. It is done.
Linus
On Tue, 7 May 2019 08:31:14 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
The reality is that changing something fundamental like the kernel stack at this point for an architecture that will not change in the future is silly.
x86_32 will no longer have updates, but will x86_64? And we will constantly be adding more work around hacks to handle the difference of the pt_regs in the future. I see Peter's patch easing that future pain.
The reality is that Peter's patch is much bigger than mine, because it needed a lot of other changes *because* it did that change.
It was much bigger because it removed a lot of the work around hacks.
But you are the "benevolent dictator", and I don't see me changing your mind.
I'll go and implement it the way you like.
-- Steve
On Tue, May 07, 2019 at 08:31:14AM -0700, Linus Torvalds wrote:
The reality is that changing something fundamental like the kernel stack at this point for an architecture that will not change in the future is silly.
In my eyes it makes sense because i386 is a minority architecture at this point, and 'nobody' wants to care about how its different if they don't have to.
The reality is that Peter's patch is much bigger than mine, because it needed a lot of other changes *because* it did that change.
Yes, I change the way stack layout works on i386, and yes that affects a lot of code. _However_ all of that code is now more like x86_64 than it was.
Earlier you said that kernel_stack_pointer() was a horrible thing; and most/all the code that I ended up touching was similarly horrible.
Would you consider my approach later on, under the guise of unification? We can work on it for a while, and make sure all the iffy bits are sorted, no need to rush?
On Tue, May 7, 2019 at 9:34 AM Peter Zijlstra peterz@infradead.org wrote:
Would you consider my approach later on, under the guise of unification?
WHY?
The *only* advantage of your patch is that trivial "look up kernel stack" macro.
Seriously. There's absolutely nothing else.
Is that macro ugly? Yes. But it's directly explainable by just pointing to the architecture documentation.
It's a one-liner hack.
And for that, you want to complicate the x86-32 entry and exit code?
Do we have different emulation for "push" on 32-bit and 64-bit? Yes. But again, that's just how the hardware works. This is not some "generic hw-independent code". This is literally emulating instructions that care about instruction encoding and bit size details, where there are certainly _similarities_ (and in the case of 'call', they look bit-identical), but it's also not like "same code" is a big argument. That's why we have a helper function, to hide the details.
I point to my diffstat once again. It's smaller, and I argue that it is actually conceptually *simpler* to simply say "this is how the architecture works".
And yes, I realize that I may be biased by the fact that I simply know i386 so well, so to me it simply makes more sense to just work with what the hardware gives us. The i386 exception model with the kernel stack nesting is a *hell* of a lot simpler than the x86-64 one. The fact is, x86-64 messed things up, and swapgs and friends are an abomination against God.
So the whole "let's clean up x86-32 to look like x86-64, which got things right" is to me a completely bogus argument. x86-64 got the "yes, push ss/sp unconditionally" part right, but got a lot of other things horribly wrong. So this is all just one small detail that differs, across two architectures that are similar but have very different warts.
But that diffstat is still hard, cold, unbiased data.
Linus
On Tue, May 07, 2019 at 10:08:50AM -0700, Linus Torvalds wrote:
On Tue, May 7, 2019 at 9:34 AM Peter Zijlstra peterz@infradead.org wrote:
Would you consider my approach later on, under the guise of unification?
WHY?
The *only* advantage of your patch is that trivial "look up kernel stack" macro.
Seriously. There's absolutely nothing else.
Is that macro ugly? Yes. But it's directly explainable by just pointing to the architecture documentation.
It's a one-liner hack.
And for that, you want to complicate the x86-32 entry and exit code?
Do we have different emulation for "push" on 32-bit and 64-bit? Yes. But again, that's just how the hardware works. This is not some "generic hw-independent code". This is literally emulating instructions that care about instruction encoding and bit size details, where there are certainly _similarities_ (and in the case of 'call', they look bit-identical), but it's also not like "same code" is a big argument. That's why we have a helper function, to hide the details.
I point to my diffstat once again. It's smaller, and I argue that it is actually conceptually *simpler* to simply say "this is how the architecture works".
And yes, I realize that I may be biased by the fact that I simply know i386 so well, so to me it simply makes more sense to just work with what the hardware gives us. The i386 exception model with the kernel stack nesting is a *hell* of a lot simpler than the x86-64 one. The fact is, x86-64 messed things up, and swapgs and friends are an abomination against God.
So the whole "let's clean up x86-32 to look like x86-64, which got things right" is to me a completely bogus argument. x86-64 got the "yes, push ss/sp unconditionally" part right, but got a lot of other things horribly wrong. So this is all just one small detail that differs, across two architectures that are similar but have very different warts.
But that diffstat is still hard, cold, unbiased data.
regs->sp is *undefined* on x86-32. We're damning our future selves to have to always remember to use that darn kernel_stack_pointer() helper for eternity just because of x86-32.
There have already been several bugs related to that. Because regs->sp is there, so why wouldn't you use it?
If we truly want the code to reflect the HW, then we should have a pt_regs_kernel and a pt_regs_user on 32-bit. I'm pretty sure we don't want to go there...
IMO, we either need to make the pt_regs struct(s) match the HW behavior, or make entry code match pt_regs. But this in-between thing just creates a bunch of headaches.
On Tue, 7 May 2019 12:21:59 -0500 Josh Poimboeuf jpoimboe@redhat.com wrote:
regs->sp is *undefined* on x86-32. We're damning our future selves to have to always remember to use that darn kernel_stack_pointer() helper for eternity just because of x86-32.
And there's been several times I forget that regs->sp can not be read directly. Especially most of my bug reports are for x86_64 these days. But when I had that seldom x86_32 one, and go debugging, I would print out "regs->sp" and then the system would crash. And I spend some time wondering why?
It's been a bane of mine for some time.
-- Steve
On Tue, May 7, 2019 at 2:24 PM Steven Rostedt rostedt@goodmis.org wrote:
And there's been several times I forget that regs->sp can not be read directly. Especially most of my bug reports are for x86_64 these days. But when I had that seldom x86_32 one, and go debugging, I would print out "regs->sp" and then the system would crash. And I spend some time wondering why?
It's been a bane of mine for some time.
Guys, I have basically a one-liner patch for your hangups.
It's called "rename 'sp' to 'user_sp' on x86-32".
Then we make the 'sp' field on x86-64 be a union, so that you can call it user_sp or sp as you wish.
Yeah, it's really more than one line, because obviously the users will need chaning, but honestly, that would be a _real_ cleanup. Make the register match what it actually is.
And it doesn't mess up the stack frame, and it doesn't change the entry code. It just reminds people that the entry is the USER stack pointer.
Problem solved.
Linus
On Tue, 7 May 2019 21:50:52 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
It's been a bane of mine for some time.
Guys, I have basically a one-liner patch for your hangups.
It's called "rename 'sp' to 'user_sp' on x86-32".
Then we make the 'sp' field on x86-64 be a union, so that you can call it user_sp or sp as you wish.
Yeah, it's really more than one line, because obviously the users will need chaning, but honestly, that would be a _real_ cleanup. Make the register match what it actually is.
But is it? Sure, it will be a reminder that it's different for x86-32, but that still doesn't take away the fact that pt_regs on x86_32 is an anomaly! Where else do we have part of a data structure that can't be read because it can possibly fault? If regs is a valid pointer, one would think that simply reading regs->sp (or regs->user_sp) would be no more dangerous than reading regs->ip.
The difference between entry_32.S from entry_64.S causes it to spill into C code, making the x86_64 code more difficult to deal with. Sure, 10 to 15 years ago, all your arguments would make sense. But today, who uses x86_32? Perhaps you may use it in a VM, but I asked a few developers when was the last time they used one, they told me 5 to 7 years ago. I only boot x86_32 to test to make sure I didn't break it.
Yes, your diffstat is really nice to the changes to entry_32.S, but at what cost? To make the x86_64 code more complex? That whole returning the regs in the int3 handler makes no sense on x86_64, but yet we would need to do it to handle x86_32. Why burden the architecture of today and tomorrow with the architecture of yesterday? x86_32 is becoming more obsolete by the day. It baffles me why we wouldn't want to contain its complexity in a single file then to spread it out like wildfire across the generic x86 code.
The ®s->sp hack is just one more rung in the complex learning curve ladder of becoming a Linux kernel developer.
-- Steve
On Tue, May 07, 2019 at 10:08:50AM -0700, Linus Torvalds wrote:
On Tue, May 7, 2019 at 9:34 AM Peter Zijlstra peterz@infradead.org wrote:
Would you consider my approach later on, under the guise of unification?
WHY?
The *only* advantage of your patch is that trivial "look up kernel stack" macro.
Seriously. There's absolutely nothing else.
The ftrace_regs_caller, the kprobe tramplines, the unwinder, they all have 'funny' bits because pt_regs isn't 'right'.
So the whole "let's clean up x86-32 to look like x86-64, which got things right" is to me a completely bogus argument. x86-64 got the "yes, push ss/sp unconditionally" part right, but got a lot of other things horribly wrong. So this is all just one small detail that differs, across two architectures that are similar but have very different warts.
It's a detail that leaks into the C code. Yes SWAPGS is horrible crap, but C code doesn't much care. The partial pt_regs thing otoh comes up a fair number of times.
Anyway; I think we're at the point where we'll have to agree to disagree (or maybe slightly past it).
On Mon, May 06, 2019 at 07:22:06PM -0700, Linus Torvalds wrote:
We do *not* have very strict guarantees for D$-vs-I$ coherency on x86, but we *do* have very strict guarantees for D$-vs-D$ coherency. And so we could use the D$ coherency to give us atomicity guarantees for loading and storing the instruction offset for instruction emulation, in ways we can *not* use the D$-to-I$ guarantees and just executing it directly.
So while we still need those nasty IPI's to guarantee the D$-vs-I$ coherency in the "big picture" model and to get the serialization with the actual 'int3' exception right, we *could* just do all the other parts of the instruction emulation using the D$ coherency.
So we could do the actual "call offset" write with a single atomic 4-byte locked cycle (just use "xchg" to write - it's always locked). And similarly we could do the call offset *read* with a single locked cycle (cmpxchg with a 0 value, for example). It would be atomic even if it crosses a cacheline boundary.
Very 'soon', x86 will start to #AC if you do unaligned LOCK prefixed instructions. The problem is that while aligned LOCK instructions can do the atomicity with the coherency protocol, unaligned (esp, line or page boundary crossing ones) needs that bus-lock thing the SDM talks about.
For giggles, write yourself a while(1) loop that XCHGs across a page-boundary and see what it does to the rest of the system.
So _please_, do not rely on unaligned atomic ops. We really want them to do the way of the Dodo.
On May 6, 2019, at 7:22 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, May 6, 2019 at 6:53 PM Steven Rostedt rostedt@goodmis.org wrote:
Also, I figured just calling ftrace_regs_caller() was simpler then having that int3 handler do the hash look ups to determine what handler it needs to call.
So what got me looking at this - and the races (that didn't turn out to be races) - and why I really hate it, is because of the whole notion of "atomic state".
Running an "int3" handler (when the int3 is in the kernel) is in some sense "atomic". There is the state in the caller, of course, and there's the state that the int3 handler has, but you can *mostly* think of the two as independent.
In particular, if the int3 handler doesn't ever enable interrupts, and if it doesn't need to do any stack switches, the int3 handler part looks "obvious". It can be interrupted by NMI, but it can't be interrupted (for example) by the cross-cpu IPI.
That was at least the mental model I was going for.
Similarly, the "caller" side mostly looks obvious too. If we don't take an int3, none of this matter, and if we *do* take an int3, if we can at least make it "atomic" wrt the rewriter (before or after state), it should be easy to think about.
One of the things I was thinking of doing, for example, was to simply make the call emulation do a "load four bytes from the instruction stream, and just use that as the emulation target offset".
Why would that matter?
We do *not* have very strict guarantees for D$-vs-I$ coherency on x86, but we *do* have very strict guarantees for D$-vs-D$ coherency. And so we could use the D$ coherency to give us atomicity guarantees for loading and storing the instruction offset for instruction emulation, in ways we can *not* use the D$-to-I$ guarantees and just executing it directly.
So while we still need those nasty IPI's to guarantee the D$-vs-I$ coherency in the "big picture" model and to get the serialization with the actual 'int3' exception right, we *could* just do all the other parts of the instruction emulation using the D$ coherency.
So we could do the actual "call offset" write with a single atomic 4-byte locked cycle (just use "xchg" to write - it's always locked). And similarly we could do the call offset *read* with a single locked cycle (cmpxchg with a 0 value, for example). It would be atomic even if it crosses a cacheline boundary.
I don’t quite get how this could work. Suppose we start with a five-byte NOP (0F 1F ...). Then we change the first byte to INT3 (CC). Now we can atomically change the other four bytes, but the INT3 could happen first. I suppose that we could treat 1F 00 00 00 or similar as a known-bogus call target, but that seems dangerous.
IOW I think your trick only works if the old and new states are CALL, but we don’t know that until we’ve looked up the record, at which point we can just use the result of the lookup.
An I missing something clever? IMO it’s a bummer that there isn’t a way to turn NOP into CALL by changing only one byte.
On Tue, May 7, 2019 at 7:48 AM Andy Lutomirski luto@amacapital.net wrote:
IOW I think your trick only works if the old and new states are CALL, but we don’t know that until we’ve looked up the record, at which point we can just use the result of the lookup.
It would indeed only work for call instructions. I was thinking we'd know that because we only ever batch up call instructions, though.
But it doesn't matter. I was looking at the ftrace code because I thought there was some subtle timing bug or race or similar. But it turned out my "memmove()" was the problem. See the patch I just sent out.
Linus
On Mon, 6 May 2019 17:45:11 -0400 Steven Rostedt rostedt@goodmis.org wrote:
If we go with Peter's patch, I can make this code much more sane, and not have to worry about having ®s->sp be at the top of the stack. I could simply, just push everything in the order of pt_regs and call the handler.
Hi Steve, I need to catch up with the origin of this series, but it seems also good to optprobe which is doing similar trick on pt_regs. If we can assume that int3 pt_regs can have a gap, optprobe can also make a gap, and it can be also used for storing destination address.
Thank you,
On Tue, 7 May 2019 23:13:40 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
On Mon, 6 May 2019 17:45:11 -0400 Steven Rostedt rostedt@goodmis.org wrote:
If we go with Peter's patch, I can make this code much more sane, and not have to worry about having ®s->sp be at the top of the stack. I could simply, just push everything in the order of pt_regs and call the handler.
Hi Steve, I need to catch up with the origin of this series, but it seems also good to optprobe which is doing similar trick on pt_regs. If we can assume that int3 pt_regs can have a gap, optprobe can also make a gap, and it can be also used for storing destination address.
Sorry, I misunderstood. I see the issue ( https://lkml.org/lkml/2019/5/1/497 ) and solutions on the thread. If we really need to fix this trace-livepatch combination issue, it may be good to backport to stable trees.
From this viewpoint, Linus's suggestion (no pt_reg changes on x86-32) seems
to have a point.
BTW, even though I think Peter's patch (unifying pt_regs behavior) will also be good for us for more general reason (not only for fixing actual issue).
Thank you,
On Mon, May 06, 2019 at 10:19:51AM +0200, Peter Zijlstra wrote:
+.Lfrom_usermode_no_fixup_@: +.endm
+.macro IRET_FRAME
- /* orig_eax is already POP'ed when we're here */
- testl $CS_FROM_KERNEL, 1*4(%esp)
- jz .Lfinished_frame_@
- pushl %eax
From there..
- lea 10*4(%esp), %eax # address of <previous context>
- cmpl %eax, 4*4(%esp) # if ->sp is unmodified
- jnz .Lmodified_sp_do_fixup_@
- /*
* Fast path; regs->sp wasn't modified, reuse the original IRET frame.
*/
- pop %eax
- add $6*4, %esp
- jmp .Lfinished_frame_@;
+.Lmodified_sp_do_fixup_@:
... until here, needs to go, it is buggy. While a clever idea, it looses updates to regs->ip and ->flags.
- /*
* Reconstruct the 3 entry IRET frame right after the (modified)
* regs->sp without lowering %esp in between, such that an NMI in the
* middle doesn't scribble our stack.
*/
- pushl %ecx
- movl 5*4(%esp), %eax # (modified) regs->sp
- movl 4*4(%esp), %ecx # flags
- movl %ecx, -4(%eax)
- movl 3*4(%esp), %ecx # cs
- andl $0x0000ffff, %ecx
- movl %ecx, -8(%eax)
- movl 2*4(%esp), %ecx # ip
- movl %ecx, -12(%eax)
- movl 1*4(%esp), %ecx # eax
- movl %ecx, -16(%eax)
- popl %ecx
- lea -16(%eax), %esp
- popl %eax
+.Lfinished_frame_@: +.endm
On Mon, May 06, 2019 at 10:19:51AM +0200, Peter Zijlstra wrote:
On Fri, May 03, 2019 at 11:57:22AM -0700, Linus Torvalds wrote:
On Fri, May 3, 2019 at 9:21 AM Andy Lutomirski luto@amacapital.net wrote:
So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always?
What would the code actually end up looking like? I don't necessarily object, since that kernel_stack_pointer() thing certainly looks horrible, but honestly, my suggestion to just pass in the 'struct pt_regs' and let the call emulation fix it up would have also worked, and avoided that bug (and who knows what else might be hiding).
I really think that you're now hitting all the special case magic low-level crap that I wanted to avoid.
This did actually boot on first try; so there must be something horribly wrong...
Now, I know you like that other approach; but I figured I should at least show you what this one looks like. Maybe I've been staring at entry_32.S too much, but I really don't dislike this.
So this one boots with all of Steve's self-test code enabled.
Yes, its fairly huge, and it really should be multiple patches. But it does make a lot of the 32bit code a lot more like the 64bit code.
It also fixes a bunch of bugs in the various trampolines (notably the EBP frame pointer crud, which was wrong or missing). And all the weird 32bit exceptions in the unwinder can go. However, I still cannot get an unwind from trace_selftest_test_regs_func() for some reason. Josh?
--- arch/x86/entry/entry_32.S | 136 ++++++++++++++++++++++++++++------- arch/x86/entry/entry_64.S | 14 +++- arch/x86/include/asm/ptrace.h | 4 -- arch/x86/include/asm/text-patching.h | 20 ++++++ arch/x86/kernel/alternative.c | 81 +++++++++++++++++++-- arch/x86/kernel/ftrace_32.S | 85 +++++++++++++--------- arch/x86/kernel/kprobes/common.h | 36 +++++++--- arch/x86/kernel/kprobes/core.c | 27 +++---- arch/x86/kernel/kprobes/opt.c | 20 +++--- arch/x86/kernel/ptrace.c | 29 -------- arch/x86/kernel/unwind_frame.c | 8 --- kernel/trace/trace_selftest.c | 3 + 12 files changed, 323 insertions(+), 140 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..183d0cf5c167 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -67,9 +67,20 @@ # define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else # define preempt_stop(clobbers) -# define resume_kernel restore_all_kernel #endif
+.macro RETINT_PREEMPT +#ifdef CONFIG_PREEMPT + DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, PER_CPU_VAR(__preempt_count) + jnz .Lend_@ + testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? + jz .Lend_@ + call preempt_schedule_irq +.Lend_@: +#endif +.endm + .macro TRACE_IRQS_IRET #ifdef CONFIG_TRACE_IRQFLAGS testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off? @@ -203,8 +214,105 @@ .Lend_@: .endm
+#define CS_FROM_ENTRY_STACK (1 << 31) +#define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_KERNEL (1 << 29) + +.macro FIXUP_FRAME + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 3*4(%esp) + jnz .Lfrom_usermode_no_fixup_@ +#endif + testl $SEGMENT_RPL_MASK, 2*4(%esp) + jnz .Lfrom_usermode_no_fixup_@ + + orl $CS_FROM_KERNEL, 2*4(%esp) + + /* + * When we're here from kernel mode; the (exception) stack looks like: + * + * 4*4(%esp) - <previous context> + * 3*4(%esp) - flags + * 2*4(%esp) - cs + * 1*4(%esp) - ip + * 0*4(%esp) - orig_eax + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us + * the original 4 enties as gap: + * + * 10*4(%esp) - <previous context> + * 9*4(%esp) - gap / flags + * 8*4(%esp) - gap / cs + * 7*4(%esp) - gap / ip + * 6*4(%esp) - gap / orig_eax + * 5*4(%esp) - ss + * 4*4(%esp) - sp + * 3*4(%esp) - flags + * 2*4(%esp) - cs + * 1*4(%esp) - ip + * 0*4(%esp) - orig_eax + */ + + pushl %ss # ss + pushl %esp # sp (points at ss) + addl $5*4, (%esp) # point sp back at the previous context + pushl 5*4(%esp) # flags + pushl 5*4(%esp) # cs + pushl 5*4(%esp) # ip + pushl 5*4(%esp) # orig_eax + +.Lfrom_usermode_no_fixup_@: +.endm + +.macro IRET_FRAME + + /* orig_eax is already POP'ed when we're here */ + + testl $CS_FROM_KERNEL, 1*4(%esp) + jz .Lfinished_frame_@ + + /* + * Reconstruct the 3 entry IRET frame right after the (modified) + * regs->sp without lowering %esp in between, such that an NMI in the + * middle doesn't scribble our stack. + */ + pushl %eax + pushl %ecx + movl 5*4(%esp), %eax # (modified) regs->sp + + movl 4*4(%esp), %ecx # flags + movl %ecx, -4(%eax) + + movl 3*4(%esp), %ecx # cs + andl $0x0000ffff, %ecx + movl %ecx, -8(%eax) + + movl 2*4(%esp), %ecx # ip + movl %ecx, -12(%eax) + + movl 1*4(%esp), %ecx # eax + movl %ecx, -16(%eax) + + popl %ecx + lea -16(%eax), %esp + popl %eax + +.Lfinished_frame_@: +.endm + .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 + cld + /* + * The high bits of the CS dword (__csh) are used for CS_FROM_*. + * Clear them in case hardware didn't do this for us. + */ + andl $(0x0000ffff), 2*4(%esp) + + FIXUP_FRAME + PUSH_GS pushl %fs pushl %es @@ -375,9 +483,6 @@ * switch to it before we do any copying. */
-#define CS_FROM_ENTRY_STACK (1 << 31) -#define CS_FROM_USER_CR3 (1 << 30) - .macro SWITCH_TO_KERNEL_STACK
ALTERNATIVE "", "jmp .Lend_@", X86_FEATURE_XENPV @@ -391,13 +496,6 @@ * that register for the time this macro runs */
- /* - * The high bits of the CS dword (__csh) are used for - * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case - * hardware didn't do this for us. - */ - andl $(0x0000ffff), PT_CS(%esp) - /* Are we on the entry stack? Bail out if not! */ movl PER_CPU_VAR(cpu_entry_area), %ecx addl $CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx @@ -755,7 +853,7 @@ END(ret_from_fork) andl $SEGMENT_RPL_MASK, %eax #endif cmpl $USER_RPL, %eax - jb resume_kernel # not returning to v8086 or userspace + jb restore_all_kernel # not returning to v8086 or userspace
ENTRY(resume_userspace) DISABLE_INTERRUPTS(CLBR_ANY) @@ -765,18 +863,6 @@ ENTRY(resume_userspace) jmp restore_all END(ret_from_exception)
-#ifdef CONFIG_PREEMPT -ENTRY(resume_kernel) - DISABLE_INTERRUPTS(CLBR_ANY) - cmpl $0, PER_CPU_VAR(__preempt_count) - jnz restore_all_kernel - testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all_kernel - call preempt_schedule_irq - jmp restore_all_kernel -END(resume_kernel) -#endif - GLOBAL(__begin_SYSENTER_singlestep_region) /* * All code from here through __end_SYSENTER_singlestep_region is subject @@ -1019,6 +1105,7 @@ ENTRY(entry_INT80_32) /* Restore user state */ RESTORE_REGS pop=4 # skip orig_eax/error_code .Lirq_return: + IRET_FRAME /* * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization * when returning from IPI handler and when returning from @@ -1027,6 +1114,7 @@ ENTRY(entry_INT80_32) INTERRUPT_RETURN
restore_all_kernel: + RETINT_PREEMPT TRACE_IRQS_IRET PARANOID_EXIT_TO_KERNEL_MODE BUG_IF_WRONG_CR3 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 20e45d9b4e15..268cd9affe04 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -898,6 +898,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 8a7fc0cca2d1..5ff42dc8b396 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -166,14 +166,10 @@ static inline bool user_64bit_mode(struct pt_regs *regs) #define compat_user_stack_pointer() current_pt_regs()->sp #endif
-#ifdef CONFIG_X86_32 -extern unsigned long kernel_stack_pointer(struct pt_regs *regs); -#else static inline unsigned long kernel_stack_pointer(struct pt_regs *regs) { return regs->sp; } -#endif
#define GET_IP(regs) ((regs)->ip) #define GET_FP(regs) ((regs)->bp) diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index c90678fd391a..6aac6abf931e 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -42,4 +42,24 @@ extern int after_bootmem; extern __ro_after_init struct mm_struct *poking_mm; extern __ro_after_init unsigned long poking_addr;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val) +{ + regs->sp -= sizeof(unsigned long); + *(unsigned long *)regs->sp = val; +} + +static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ + int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + int3_emulate_jmp(regs, func); +} + #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 7b9b49dfc05a..8e1fafffb926 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -614,11 +614,83 @@ extern struct paravirt_patch_site __start_parainstructions[], __stop_parainstructions[]; #endif /* CONFIG_PARAVIRT */
+/* + * Self-test for the INT3 based CALL emulation code. + * + * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up + * properly and that there is a stack gap between the INT3 frame and the + * previous context. Without this gap doing a virtual PUSH on the interrupted + * stack would corrupt the INT3 IRET frame. + * + * See entry_{32,64}.S for more details. + */ +static void __init int3_magic(unsigned int *ptr) +{ + *ptr = 1; +} + +extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */ + +static int __init +int3_exception_notify(struct notifier_block *self, unsigned long val, void *data) +{ + struct die_args *args = data; + struct pt_regs *regs = args->regs; + + if (!regs || user_mode(regs)) + return NOTIFY_DONE; + + if (val != DIE_INT3) + return NOTIFY_DONE; + + if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip) + return NOTIFY_DONE; + + int3_emulate_call(regs, (unsigned long)&int3_magic); + return NOTIFY_STOP; +} + +static void __init int3_selftest(void) +{ + static __initdata struct notifier_block int3_exception_nb = { + .notifier_call = int3_exception_notify, + .priority = INT_MAX-1, /* last */ + }; + unsigned int val = 0; + + BUG_ON(register_die_notifier(&int3_exception_nb)); + + /* + * Basically: int3_magic(&val); but really complicated :-) + * + * Stick the address of the INT3 instruction into int3_selftest_ip, + * then trigger the INT3, padded with NOPs to match a CALL instruction + * length. + */ + asm volatile ("1: int3; nop; nop; nop; nop\n\t" + ".pushsection .init.data,"aw"\n\t" + ".align " __ASM_SEL(4, 8) "\n\t" + ".type int3_selftest_ip, @object\n\t" + ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t" + "int3_selftest_ip:\n\t" + __ASM_SEL(.long, .quad) " 1b\n\t" + ".popsection\n\t" + : : __ASM_SEL_RAW(a, D) (&val) : "memory"); + + BUG_ON(val != 1); + + unregister_die_notifier(&int3_exception_nb); +} + void __init alternative_instructions(void) { - /* The patching is not fully atomic, so try to avoid local interruptions - that might execute the to be patched code. - Other CPUs are not running. */ + int3_selftest(); + + /* + * The patching is not fully atomic, so try to avoid local + * interruptions that might execute the to be patched code. + * Other CPUs are not running. + */ stop_nmi();
/* @@ -643,10 +715,11 @@ void __init alternative_instructions(void) _text, _etext); }
- if (!uniproc_patched || num_possible_cpus() == 1) + if (!uniproc_patched || num_possible_cpus() == 1) { free_init_pages("SMP alternatives", (unsigned long)__smp_locks, (unsigned long)__smp_locks_end); + } #endif
apply_paravirt(__parainstructions, __parainstructions_end); diff --git a/arch/x86/kernel/ftrace_32.S b/arch/x86/kernel/ftrace_32.S index 4c8440de3355..dbf929489b08 100644 --- a/arch/x86/kernel/ftrace_32.S +++ b/arch/x86/kernel/ftrace_32.S @@ -9,6 +9,7 @@ #include <asm/export.h> #include <asm/ftrace.h> #include <asm/nospec-branch.h> +#include <asm/asm-offsets.h>
#ifdef CC_USING_FENTRY # define function_hook __fentry__ @@ -104,26 +105,38 @@ END(ftrace_caller)
ENTRY(ftrace_regs_caller) /* - * i386 does not save SS and ESP when coming from kernel. - * Instead, to get sp, ®s->sp is used (see ptrace.h). - * Unfortunately, that means eflags must be at the same location - * as the current return ip is. We move the return ip into the - * regs->ip location, and move flags into the return ip location. + * We're here from an mcount/fentry CALL, and the stack frame looks like: + * + * <previous context> + * RET-IP + * + * The purpose of this function is to call out in an emulated INT3 + * environment with a stack frame like: + * + * <previous context> + * gap + * gap + * gap + * gap + * pt_regs + * + * We do _NOT_ restore: ss, flags, cs, gs, fs, es, ds */ - pushl $__KERNEL_CS - pushl 4(%esp) /* Save the return ip */ - pushl $0 /* Load 0 into orig_ax */ + subl $3*4, %esp # RET-IP + 3 gaps + pushl %ss # ss + pushl %esp # points at ss + addl $5*4, (%esp) # make it point at <previous context> + pushfl # flags + pushl $__KERNEL_CS # cs + pushl 8*4(%esp) # ip <- RET-IP + pushl $0 # orig_eax + pushl %gs pushl %fs pushl %es pushl %ds - pushl %eax - - /* Get flags and place them into the return ip slot */ - pushf - popl %eax - movl %eax, 8*4(%esp)
+ pushl %eax pushl %ebp pushl %edi pushl %esi @@ -131,28 +144,36 @@ ENTRY(ftrace_regs_caller) pushl %ecx pushl %ebx
- movl 12*4(%esp), %eax /* Load ip (1st parameter) */ - subl $MCOUNT_INSN_SIZE, %eax /* Adjust ip */ + movl PT_EIP(%esp), %eax # 1st argument: IP + subl $MCOUNT_INSN_SIZE, %eax + #ifdef CC_USING_FENTRY - movl 15*4(%esp), %edx /* Load parent ip (2nd parameter) */ + movl 21*4(%esp), %edx # 2nd argument: parent ip #else - movl 0x4(%ebp), %edx /* Load parent ip (2nd parameter) */ + movl 1*4(%ebp), %edx # 2nd argument: parent ip #endif - movl function_trace_op, %ecx /* Save ftrace_pos in 3rd parameter */ - pushl %esp /* Save pt_regs as 4th parameter */ + +#ifdef CONFIG_FRAME_POINTER + movl %esp, %ebp + andl $0x7fffffff, %ebp +#endif + + movl function_trace_op, %ecx # 3rd argument: ftrace_pos + pushl %esp # 4th argument: pt_regs
GLOBAL(ftrace_regs_call) call ftrace_stub
- addl $4, %esp /* Skip pt_regs */ + addl $4, %esp # skip 4th argument
- /* restore flags */ - push 14*4(%esp) - popf + /* place IP below the new SP */ + movl PT_OLDESP(%esp), %eax + movl PT_EIP(%esp), %ecx + movl %ecx, -4(%eax)
- /* Move return ip back to its original location */ - movl 12*4(%esp), %eax - movl %eax, 14*4(%esp) + /* place EAX below that */ + movl PT_EAX(%esp), %ecx + movl %ecx, -8(%eax)
popl %ebx popl %ecx @@ -160,16 +181,12 @@ GLOBAL(ftrace_regs_call) popl %esi popl %edi popl %ebp - popl %eax - popl %ds - popl %es - popl %fs - popl %gs
- /* use lea to not affect flags */ - lea 3*4(%esp), %esp /* Skip orig_ax, ip and cs */ + lea -8(%eax), %esp + popl %eax
jmp .Lftrace_ret + #else /* ! CONFIG_DYNAMIC_FTRACE */
ENTRY(function_hook) diff --git a/arch/x86/kernel/kprobes/common.h b/arch/x86/kernel/kprobes/common.h index 2b949f4fd4d8..f54b5d666169 100644 --- a/arch/x86/kernel/kprobes/common.h +++ b/arch/x86/kernel/kprobes/common.h @@ -6,14 +6,15 @@
#include <asm/asm.h>
+#ifdef CONFIG_X86_64 + #ifdef CONFIG_FRAME_POINTER -# define SAVE_RBP_STRING " push %" _ASM_BP "\n" \ - " mov %" _ASM_SP ", %" _ASM_BP "\n" +#define ENCODE_FRAME_POINTER \ + " leaq 1(%rsp), %rbp\n" #else -# define SAVE_RBP_STRING " push %" _ASM_BP "\n" +#define ENCODE_FRAME_POINTER #endif
-#ifdef CONFIG_X86_64 #define SAVE_REGS_STRING \ /* Skip cs, ip, orig_ax. */ \ " subq $24, %rsp\n" \ @@ -27,11 +28,13 @@ " pushq %r10\n" \ " pushq %r11\n" \ " pushq %rbx\n" \ - SAVE_RBP_STRING \ + " pushq %rbp\n" \ " pushq %r12\n" \ " pushq %r13\n" \ " pushq %r14\n" \ - " pushq %r15\n" + " pushq %r15\n" \ + ENCODE_FRAME_POINTER + #define RESTORE_REGS_STRING \ " popq %r15\n" \ " popq %r14\n" \ @@ -51,19 +54,30 @@ /* Skip orig_ax, ip, cs */ \ " addq $24, %rsp\n" #else + +#ifdef CONFIG_FRAME_POINTER +#define ENCODE_FRAME_POINTER \ + " movl %esp, %ebp\n" \ + " andl $0x7fffffff, %ebp\n" +#else +#define ENCODE_FRAME_POINTER +#endif + #define SAVE_REGS_STRING \ /* Skip cs, ip, orig_ax and gs. */ \ - " subl $16, %esp\n" \ + " subl $4*4, %esp\n" \ " pushl %fs\n" \ " pushl %es\n" \ " pushl %ds\n" \ " pushl %eax\n" \ - SAVE_RBP_STRING \ + " pushl %ebp\n" \ " pushl %edi\n" \ " pushl %esi\n" \ " pushl %edx\n" \ " pushl %ecx\n" \ - " pushl %ebx\n" + " pushl %ebx\n" \ + ENCODE_FRAME_POINTER + #define RESTORE_REGS_STRING \ " popl %ebx\n" \ " popl %ecx\n" \ @@ -72,8 +86,8 @@ " popl %edi\n" \ " popl %ebp\n" \ " popl %eax\n" \ - /* Skip ds, es, fs, gs, orig_ax, and ip. Note: don't pop cs here*/\ - " addl $24, %esp\n" + /* Skip ds, es, fs, gs, orig_ax, ip, and cs. */\ + " addl $7*4, %esp\n" #endif
/* Ensure if the instruction can be boostable */ diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 9e4fa2484d10..28d8ba3b9add 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -731,29 +731,27 @@ asm( ".global kretprobe_trampoline\n" ".type kretprobe_trampoline, @function\n" "kretprobe_trampoline:\n" -#ifdef CONFIG_X86_64 /* We don't bother saving the ss register */ +#ifdef CONFIG_X86_64 " pushq %rsp\n" " pushfq\n" SAVE_REGS_STRING " movq %rsp, %rdi\n" " call trampoline_handler\n" /* Replace saved sp with true return address. */ - " movq %rax, 152(%rsp)\n" + " movq %rax, 19*8(%rsp)\n" RESTORE_REGS_STRING " popfq\n" #else - " pushf\n" + " pushl %esp\n" + " pushfl\n" SAVE_REGS_STRING " movl %esp, %eax\n" " call trampoline_handler\n" - /* Move flags to cs */ - " movl 56(%esp), %edx\n" - " movl %edx, 52(%esp)\n" - /* Replace saved flags with true return address. */ - " movl %eax, 56(%esp)\n" + /* Replace saved sp with true return address. */ + " movl %eax, 15*4(%esp)\n" RESTORE_REGS_STRING - " popf\n" + " popfl\n" #endif " ret\n" ".size kretprobe_trampoline, .-kretprobe_trampoline\n" @@ -794,16 +792,13 @@ __used __visible void *trampoline_handler(struct pt_regs *regs) INIT_HLIST_HEAD(&empty_rp); kretprobe_hash_lock(current, &head, &flags); /* fixup registers */ -#ifdef CONFIG_X86_64 regs->cs = __KERNEL_CS; - /* On x86-64, we use pt_regs->sp for return address holder. */ - frame_pointer = ®s->sp; -#else - regs->cs = __KERNEL_CS | get_kernel_rpl(); +#ifdef CONFIG_X86_32 + regs->cs |= get_kernel_rpl(); regs->gs = 0; - /* On x86-32, we use pt_regs->flags for return address holder. */ - frame_pointer = ®s->flags; #endif + /* We use pt_regs->sp for return address holder. */ + frame_pointer = ®s->sp; regs->ip = trampoline_address; regs->orig_ax = ~0UL;
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index f14262952015..c1010207d036 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -115,14 +115,15 @@ asm ( "optprobe_template_call:\n" ASM_NOP5 /* Move flags to rsp */ - " movq 144(%rsp), %rdx\n" - " movq %rdx, 152(%rsp)\n" + " movq 18*8(%rsp), %rdx\n" + " movq %rdx, 19*8(%rsp)\n" RESTORE_REGS_STRING /* Skip flags entry */ " addq $8, %rsp\n" " popfq\n" #else /* CONFIG_X86_32 */ - " pushf\n" + " pushl %esp\n" + " pushfl\n" SAVE_REGS_STRING " movl %esp, %edx\n" ".global optprobe_template_val\n" @@ -131,9 +132,13 @@ asm ( ".global optprobe_template_call\n" "optprobe_template_call:\n" ASM_NOP5 + /* Move flags into esp */ + " movl 14*4(%esp), %edx\n" + " movl %edx, 15*4(%esp)\n" RESTORE_REGS_STRING - " addl $4, %esp\n" /* skip cs */ - " popf\n" + /* Skip flags entry */ + " addl $4, %esp\n" + " popfl\n" #endif ".global optprobe_template_end\n" "optprobe_template_end:\n" @@ -165,10 +170,9 @@ optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs) } else { struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); /* Save skipped registers */ -#ifdef CONFIG_X86_64 regs->cs = __KERNEL_CS; -#else - regs->cs = __KERNEL_CS | get_kernel_rpl(); +#ifdef CONFIG_X86_32 + regs->cs |= get_kernel_rpl(); regs->gs = 0; #endif regs->ip = (unsigned long)op->kp.addr + INT3_SIZE; diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..d13f892d2c47 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -153,35 +153,6 @@ static inline bool invalid_selector(u16 value)
#define FLAG_MASK FLAG_MASK_32
-/* - * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode - * when it traps. The previous stack will be directly underneath the saved - * registers, and 'sp/ss' won't even have been saved. Thus the '®s->sp'. - * - * Now, if the stack is empty, '®s->sp' is out of range. In this - * case we try to take the previous stack. To always return a non-null - * stack pointer we fall back to regs as stack if no previous stack - * exists. - * - * This is valid only for kernel mode traps. - */ -unsigned long kernel_stack_pointer(struct pt_regs *regs) -{ - unsigned long context = (unsigned long)regs & ~(THREAD_SIZE - 1); - unsigned long sp = (unsigned long)®s->sp; - u32 *prev_esp; - - if (context == (sp & ~(THREAD_SIZE - 1))) - return sp; - - prev_esp = (u32 *)(context); - if (*prev_esp) - return (unsigned long)*prev_esp; - - return (unsigned long)regs; -} -EXPORT_SYMBOL_GPL(kernel_stack_pointer); - static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno) { BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0); diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c index 9b9fd4826e7a..df51040d1689 100644 --- a/arch/x86/kernel/unwind_frame.c +++ b/arch/x86/kernel/unwind_frame.c @@ -71,10 +71,6 @@ static void unwind_dump(struct unwind_state *state)
static size_t regs_size(struct pt_regs *regs) { - /* x86_32 regs from kernel mode are two words shorter: */ - if (IS_ENABLED(CONFIG_X86_32) && !user_mode(regs)) - return sizeof(*regs) - 2*sizeof(long); - return sizeof(*regs); }
@@ -197,11 +193,7 @@ static struct pt_regs *decode_frame_pointer(unsigned long *bp) } #endif
-#ifdef CONFIG_X86_32 -#define KERNEL_REGS_SIZE (sizeof(struct pt_regs) - 2*sizeof(long)) -#else #define KERNEL_REGS_SIZE (sizeof(struct pt_regs)) -#endif
static bool update_stack_state(struct unwind_state *state, unsigned long *next_bp) diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 9d402e7fc949..acfd7a174337 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -548,6 +548,9 @@ static void trace_selftest_test_regs_func(unsigned long ip, struct ftrace_ops *op, struct pt_regs *pt_regs) { + + WARN(1, "whoomp!\n"); + if (pt_regs) trace_selftest_regs_stat = TRACE_SELFTEST_REGS_FOUND; else
From: Peter Zijlstra
Sent: 07 May 2019 09:58
...
- /*
* When we're here from kernel mode; the (exception) stack looks like:
*
* 4*4(%esp) - <previous context>
* 3*4(%esp) - flags
* 2*4(%esp) - cs
* 1*4(%esp) - ip
* 0*4(%esp) - orig_eax
Am I right in thinking that this is the only 'INT3' stack frame that needs to be 'fiddled' with? And that the 'emulate a call instruction' has verified that is the case?? So the %cs is always the kernel %cs.
If the 'call target' address is saved in a per-cpu location it ought to be possible to get the code that returns from the INT3 with the call target address (or zero) in %ax. If non-zero, instead of 'pop %ax; iret' execute: xchg %eax, 4(%esp) - swap function address and callers ip push 12(%esp) - old flags mov 14(%esp),%eax - callers address over flags popf - enables interrupts (etc) pop %eax retf - Jump to called function and remove %cs
Nothing else needs to be moved.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, May 07, 2019 at 09:18:51AM +0000, David Laight wrote:
From: Peter Zijlstra
Sent: 07 May 2019 09:58
...
- /*
* When we're here from kernel mode; the (exception) stack looks like:
*
* 4*4(%esp) - <previous context>
* 3*4(%esp) - flags
* 2*4(%esp) - cs
* 1*4(%esp) - ip
* 0*4(%esp) - orig_eax
Am I right in thinking that this is the only 'INT3' stack frame that needs to be 'fiddled' with? And that the 'emulate a call instruction' has verified that is the case?? So the %cs is always the kernel %cs.
Only the INT3 thing needs 'the gap', but the far bigger change here is that kernel frames now have a complete pt_regs set and all sorts of horrible crap can go away.
For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes it is possible to build a 5 entry frame on top of the old 3 entry one, but why bother...
From: Peter Zijlstra
Sent: 07 May 2019 12:31 To: David Laight On Tue, May 07, 2019 at 09:18:51AM +0000, David Laight wrote:
From: Peter Zijlstra
Sent: 07 May 2019 09:58
...
- /*
* When we're here from kernel mode; the (exception) stack looks like:
*
* 4*4(%esp) - <previous context>
* 3*4(%esp) - flags
* 2*4(%esp) - cs
* 1*4(%esp) - ip
* 0*4(%esp) - orig_eax
Am I right in thinking that this is the only 'INT3' stack frame that needs to be 'fiddled' with? And that the 'emulate a call instruction' has verified that is the case?? So the %cs is always the kernel %cs.
Only the INT3 thing needs 'the gap', but the far bigger change here is that kernel frames now have a complete pt_regs set and all sorts of horrible crap can go away.
I'm not doubting that generating the 'five register' interrupt stack frame for faults in kernel space makes life simpler just suggesting that the 'emulated call' can be done by emulating the 'iret' rather than generating a gap in the stack.
For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes it is possible to build a 5 entry frame on top of the old 3 entry one, but why bother...
Presumably there is 'horrid' code to generate the gap in 64bit mode? (less horrid than 32bit, but still horrid?) Or does it copy the entire pt_regs into a local stack frame and use that for the iret?
I've just tried to parse the pseudo code for IRET in the intel docs. Does anyone find that readable? I wonder if you can force 32bit mode to do a stack switch 'iret' by doing something like a far jump to a different %cs ?
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, 7 May 2019 12:57:15 +0000 David Laight David.Laight@ACULAB.COM wrote:
Only the INT3 thing needs 'the gap', but the far bigger change here is that kernel frames now have a complete pt_regs set and all sorts of horrible crap can go away.
I'm not doubting that generating the 'five register' interrupt stack frame for faults in kernel space makes life simpler just suggesting that the 'emulated call' can be done by emulating the 'iret' rather than generating a gap in the stack.
But how would the user put something on the stack? I don't see how emulating an iret helps here. Can you write some pseudo code to explain what you mean. I also believe the gap is only added for kernel->kernel entries.
For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes it is possible to build a 5 entry frame on top of the old 3 entry one, but why bother...
Presumably there is 'horrid' code to generate the gap in 64bit mode? (less horrid than 32bit, but still horrid?) Or does it copy the entire pt_regs into a local stack frame and use that for the iret?
On x86_64, the gap is only done for int3 and nothing else, thus it is much less horrid. That's because x86_64 has a sane pt_regs storage for all exceptions.
-- Steve
From: Steven Rostedt
Sent: 07 May 2019 14:14 On Tue, 7 May 2019 12:57:15 +0000 David Laight David.Laight@ACULAB.COM wrote:
Only the INT3 thing needs 'the gap', but the far bigger change here is that kernel frames now have a complete pt_regs set and all sorts of horrible crap can go away.
I'm not doubting that generating the 'five register' interrupt stack frame for faults in kernel space makes life simpler just suggesting that the 'emulated call' can be done by emulating the 'iret' rather than generating a gap in the stack.
But how would the user put something on the stack? I don't see how emulating an iret helps here. Can you write some pseudo code to explain what you mean. I also believe the gap is only added for kernel->kernel entries.
The 'user' (ie the kernel code that needs to emulate the call) doesn't write the data to the stack, just to some per-cpu location. (Actually it could be on the stack at the other end of pt-regs.) So you get to the 'register restore and iret' code with the stack unaltered. It is then a SMOP to replace the %flags saved by the int3 with the %ip saved by the int3, the %ip with the address of the function to call, restore the flags (push and popf) and issue a ret.f to remove the %ip and %cs.
(Actually you need to add 4 to the callers %ip address to allow for the difference between the size of int3 (hopefully 0xcc, not 0xcd 0x3).)
For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes it is possible to build a 5 entry frame on top of the old 3 entry one, but why bother...
Presumably there is 'horrid' code to generate the gap in 64bit mode? (less horrid than 32bit, but still horrid?) Or does it copy the entire pt_regs into a local stack frame and use that for the iret?
On x86_64, the gap is only done for int3 and nothing else, thus it is much less horrid. That's because x86_64 has a sane pt_regs storage for all exceptions.
Well, in particular, it always loads %sp as part of the iret. So you can create a gap and the cpu will remove it for you.
In 64bit mode you could overwrite the %ss with the return address to the caller restore %eax and %flags, push the function address and use ret.n to jump to the function subtracting the right amount from %esp.
Actually that means you can do the following in both modes: if not emulated_call_address then pop %ax; iret else # assume kernel<->kernel return push emulated_call_address; push flags_saved_by_int3 load %ax, return_address_from_iret add %ax,#4 store %ax, first_stack_location_written_by_int3 load %ax, value_saved_by_int3_entry popf ret.n
The ret.n discards everything from the %ax to the required return address. So 'n' is the size of the int3 frame, so 12 for i386 and 40 for amd64.
If the register restore (done just before this code) finished with 'add %sp, sizeof *pt_regs' then the emulated_call_address can be loaded in %ax from the other end of pt_regs.
This all reminds me of fixing up the in-kernel faults that happen when loading the user segment registers during 'return to user' fault in kernel space.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, 7 May 2019 14:50:26 +0000 David Laight David.Laight@ACULAB.COM wrote:
From: Steven Rostedt
Sent: 07 May 2019 14:14 On Tue, 7 May 2019 12:57:15 +0000 David Laight David.Laight@ACULAB.COM wrote:
The 'user' (ie the kernel code that needs to emulate the call) doesn't write the data to the stack, just to some per-cpu location. (Actually it could be on the stack at the other end of pt-regs.) So you get to the 'register restore and iret' code with the stack unaltered. It is then a SMOP to replace the %flags saved by the int3 with the %ip saved by the int3, the %ip with the address of the function to call, restore the flags (push and popf) and issue a ret.f to remove the %ip and %cs.
How would you handle NMIs doing the same thing? Yes, the NMI handlers have breakpoints that will need to emulated calls as well.
(Actually you need to add 4 to the callers %ip address to allow for the difference between the size of int3 (hopefully 0xcc, not 0xcd 0x3).)
For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes it is possible to build a 5 entry frame on top of the old 3 entry one, but why bother...
Presumably there is 'horrid' code to generate the gap in 64bit mode? (less horrid than 32bit, but still horrid?) Or does it copy the entire pt_regs into a local stack frame and use that for the iret?
On x86_64, the gap is only done for int3 and nothing else, thus it is much less horrid. That's because x86_64 has a sane pt_regs storage for all exceptions.
Well, in particular, it always loads %sp as part of the iret. So you can create a gap and the cpu will remove it for you.
In 64bit mode you could overwrite the %ss with the return address to the caller restore %eax and %flags, push the function address and use ret.n to jump to the function subtracting the right amount from %esp.
Actually that means you can do the following in both modes: if not emulated_call_address then pop %ax; iret else # assume kernel<->kernel return push emulated_call_address; push flags_saved_by_int3 load %ax, return_address_from_iret add %ax,#4 store %ax, first_stack_location_written_by_int3 load %ax, value_saved_by_int3_entry popf ret.n
The ret.n discards everything from the %ax to the required return address. So 'n' is the size of the int3 frame, so 12 for i386 and 40 for amd64.
If the register restore (done just before this code) finished with 'add %sp, sizeof *pt_regs' then the emulated_call_address can be loaded in %ax from the other end of pt_regs.
This all reminds me of fixing up the in-kernel faults that happen when loading the user segment registers during 'return to user' fault in kernel space.
This all sounds much more complex and fragile than the proposed solution. Why would we do this over what is being proposed?
-- Steve
From: Steven Rostedt
Sent: 07 May 2019 15:57 On Tue, 7 May 2019 14:50:26 +0000 David Laight David.Laight@ACULAB.COM wrote:
From: Steven Rostedt
Sent: 07 May 2019 14:14 On Tue, 7 May 2019 12:57:15 +0000 David Laight David.Laight@ACULAB.COM wrote:
The 'user' (ie the kernel code that needs to emulate the call) doesn't write the data to the stack, just to some per-cpu location. (Actually it could be on the stack at the other end of pt-regs.) So you get to the 'register restore and iret' code with the stack unaltered. It is then a SMOP to replace the %flags saved by the int3 with the %ip saved by the int3, the %ip with the address of the function to call, restore the flags (push and popf) and issue a ret.f to remove the %ip and %cs.
How would you handle NMIs doing the same thing? Yes, the NMI handlers have breakpoints that will need to emulated calls as well.
That means you'd have to use a field in the on-stack pt_regs for the 'address to call' rather than a per-cpu variable. Then it would all nest.
...
Actually that means you can do the following in both modes: if not emulated_call_address then pop %ax; iret else # assume kernel<->kernel return push emulated_call_address; push flags_saved_by_int3 load %ax, return_address_from_iret add %ax,#4 store %ax, first_stack_location_written_by_int3 load %ax, value_saved_by_int3_entry popf ret.n
The ret.n discards everything from the %ax to the required return address. So 'n' is the size of the int3 frame, so 12 for i386 and 40 for amd64.
If the register restore (done just before this code) finished with 'add %sp, sizeof *pt_regs' then the emulated_call_address can be loaded in %ax from the other end of pt_regs.
...
This all sounds much more complex and fragile than the proposed solution. Why would we do this over what is being proposed?
It is all complex and fragile however you do it.
I see a problem with converting the 3 register trap frame to a 5 register one is that the entry code and exit code both have to know whether it is necessary or was done.
AFAICT it is actually quite hard to tell from the stack which form it is. Although the %sp value might tell you because %ss:%sp might only be pushed when a stack switch happens so the kernel %sp will be a known value (for the switched to stack).
The advantage of converting the frame is that, as pointed out earlier it does let you have a pt_regs that always contains %ss:%sp.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, May 07, 2019 at 12:57:15PM +0000, David Laight wrote:
Only the INT3 thing needs 'the gap', but the far bigger change here is that kernel frames now have a complete pt_regs set and all sorts of horrible crap can go away.
I'm not doubting that generating the 'five register' interrupt stack frame for faults in kernel space makes life simpler just suggesting that the 'emulated call' can be done by emulating the 'iret' rather than generating a gap in the stack.
The thing you suggested doesn't actually work, INT3 can nest.
For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes it is possible to build a 5 entry frame on top of the old 3 entry one, but why bother...
Presumably there is 'horrid' code to generate the gap in 64bit mode? (less horrid than 32bit, but still horrid?) Or does it copy the entire pt_regs into a local stack frame and use that for the iret?
It's in the patch you replied to; it is so small you might have overlooked it. It simply pushes another copy on top of what was already there.
I've just tried to parse the pseudo code for IRET in the intel docs. Does anyone find that readable?
No; it's abysmal.
I wonder if you can force 32bit mode to do a stack switch 'iret' by doing something like a far jump to a different %cs ?
I don't think that'll work.
On Tue, May 07, 2019 at 10:57:53AM +0200, Peter Zijlstra wrote:
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 9e4fa2484d10..28d8ba3b9add 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -731,29 +731,27 @@ asm( ".global kretprobe_trampoline\n" ".type kretprobe_trampoline, @function\n" "kretprobe_trampoline:\n" -#ifdef CONFIG_X86_64 /* We don't bother saving the ss register */ +#ifdef CONFIG_X86_64 " pushq %rsp\n" " pushfq\n" SAVE_REGS_STRING " movq %rsp, %rdi\n" " call trampoline_handler\n" /* Replace saved sp with true return address. */
- " movq %rax, 152(%rsp)\n"
- " movq %rax, 19*8(%rsp)\n" RESTORE_REGS_STRING " popfq\n"
#else
- " pushf\n"
- " pushl %esp\n"
- " pushfl\n" SAVE_REGS_STRING " movl %esp, %eax\n" " call trampoline_handler\n"
- /* Move flags to cs */
- " movl 56(%esp), %edx\n"
- " movl %edx, 52(%esp)\n"
- /* Replace saved flags with true return address. */
- " movl %eax, 56(%esp)\n"
- /* Replace saved sp with true return address. */
- " movl %eax, 15*4(%esp)\n" RESTORE_REGS_STRING
- " popf\n"
- " popfl\n"
#endif " ret\n" ".size kretprobe_trampoline, .-kretprobe_trampoline\n"
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index f14262952015..c1010207d036 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -115,14 +115,15 @@ asm ( "optprobe_template_call:\n" ASM_NOP5 /* Move flags to rsp */
" movq 144(%rsp), %rdx\n"
" movq %rdx, 152(%rsp)\n"
" movq 18*8(%rsp), %rdx\n"
" movq %rdx, 19*8(%rsp)\n" RESTORE_REGS_STRING /* Skip flags entry */ " addq $8, %rsp\n" " popfq\n"
#else /* CONFIG_X86_32 */
" pushf\n"
" pushl %esp\n"
" pushfl\n" SAVE_REGS_STRING " movl %esp, %edx\n" ".global optprobe_template_val\n"
@@ -131,9 +132,13 @@ asm ( ".global optprobe_template_call\n" "optprobe_template_call:\n" ASM_NOP5
/* Move flags into esp */
" movl 14*4(%esp), %edx\n"
" movl %edx, 15*4(%esp)\n" RESTORE_REGS_STRING
" addl $4, %esp\n" /* skip cs */
" popf\n"
/* Skip flags entry */
" addl $4, %esp\n"
" popfl\n"
#endif ".global optprobe_template_end\n" "optprobe_template_end:\n"
FWIW, both these trampolines assume a kprobe will not int3_emulate_{push/call}(), for both bitnesses.
But then; I'm thinking kprobes should be inspection only and not modify things. So that might just be good enough.
On Tue, 7 May 2019 11:27:31 +0200 Peter Zijlstra peterz@infradead.org wrote:
FWIW, both these trampolines assume a kprobe will not int3_emulate_{push/call}(), for both bitnesses.
But then; I'm thinking kprobes should be inspection only and not modify things. So that might just be good enough.
I believe there are kprobe calls that do modify things. Note, they can modify regs->ip. Kprobes sets the FTRACE_OPS_FL_IPMODIFY flag, thus they can never be put at the same location that is being live patched.
-- Steve
On Tue, May 07, 2019 at 08:27:16AM -0400, Steven Rostedt wrote:
On Tue, 7 May 2019 11:27:31 +0200 Peter Zijlstra peterz@infradead.org wrote:
FWIW, both these trampolines assume a kprobe will not int3_emulate_{push/call}(), for both bitnesses.
But then; I'm thinking kprobes should be inspection only and not modify things. So that might just be good enough.
I believe there are kprobe calls that do modify things. Note, they can modify regs->ip.
The kprobe pre_handler as used by kretprobes does, and that is indeed handled by the trampolines.
Kprobes sets the FTRACE_OPS_FL_IPMODIFY flag, thus they can never be put at the same location that is being live patched.
OK, so do we want to allow kprobes that also modify regs->sp ? Because then we need to change these trampolines a bit.
I'd prefer not to allow kprobes this.
On Tue, 7 May 2019 14:41:31 +0200 Peter Zijlstra peterz@infradead.org wrote:
Kprobes sets the FTRACE_OPS_FL_IPMODIFY flag, thus they can never be put at the same location that is being live patched.
OK, so do we want to allow kprobes that also modify regs->sp ? Because then we need to change these trampolines a bit.
I'd prefer not to allow kprobes this.
I believe no kprobe changes sp, because it would have had the same issues we are trying to solve now. And even though we are changing things to allow it, it's not a regression to keep kprobes from doing it.
-- Steve
2019年5月7日(火) 21:54 Steven Rostedt rostedt@goodmis.org:
On Tue, 7 May 2019 14:41:31 +0200 Peter Zijlstra peterz@infradead.org wrote:
Kprobes sets the FTRACE_OPS_FL_IPMODIFY flag, thus they can never be put at the same location that is being live patched.
OK, so do we want to allow kprobes that also modify regs->sp ? Because then we need to change these trampolines a bit.
I'd prefer not to allow kprobes this.
I believe no kprobe changes sp, because it would have had the same issues we are trying to solve now. And even though we are changing things to allow it, it's not a regression to keep kprobes from doing it.
No, kprobes doesn't allow to change sp. At this moment we can't change "®s->sp" since it is just a value :) kprobes user (e.g. function fault-injection) will change regs->ip, that is why kprobes sets IPMODIFY flag.
Thank you,
On Tue, May 07, 2019 at 10:57:53AM +0200, Peter Zijlstra wrote:
So this one boots with all of Steve's self-test code enabled.
The selftests need improvement; I passed the 'ftrace regs test' but that trampline was buggered.
Yes, its fairly huge, and it really should be multiple patches. But it does make a lot of the 32bit code a lot more like the 64bit code.
It also fixes a bunch of bugs in the various trampolines (notably the EBP frame pointer crud, which was wrong or missing). And all the weird 32bit exceptions in the unwinder can go. However, I still cannot get an unwind from trace_selftest_test_regs_func() for some reason. Josh?
arch/x86/entry/entry_32.S | 136 ++++++++++++++++++++++++++++------- arch/x86/entry/entry_64.S | 14 +++- arch/x86/include/asm/ptrace.h | 4 -- arch/x86/include/asm/text-patching.h | 20 ++++++ arch/x86/kernel/alternative.c | 81 +++++++++++++++++++-- arch/x86/kernel/ftrace.c | 25 +++++-- arch/x86/kernel/ftrace_32.S | 85 +++++++++++++--------- arch/x86/kernel/kprobes/common.h | 36 +++++++--- arch/x86/kernel/kprobes/core.c | 27 +++---- arch/x86/kernel/kprobes/opt.c | 20 +++--- arch/x86/kernel/ptrace.c | 29 -------- arch/x86/kernel/unwind_frame.c | 8 --- kernel/trace/trace_selftest.c | 3 + 13 files changed, 343 insertions(+), 145 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7b23431be5cb..183d0cf5c167 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -67,9 +67,20 @@ # define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else # define preempt_stop(clobbers) -# define resume_kernel restore_all_kernel #endif
+.macro RETINT_PREEMPT +#ifdef CONFIG_PREEMPT + DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, PER_CPU_VAR(__preempt_count) + jnz .Lend_@ + testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? + jz .Lend_@ + call preempt_schedule_irq +.Lend_@: +#endif +.endm + .macro TRACE_IRQS_IRET #ifdef CONFIG_TRACE_IRQFLAGS testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off? @@ -203,8 +214,105 @@ .Lend_@: .endm
+#define CS_FROM_ENTRY_STACK (1 << 31) +#define CS_FROM_USER_CR3 (1 << 30) +#define CS_FROM_KERNEL (1 << 29) + +.macro FIXUP_FRAME + +#ifdef CONFIG_VM86 + testl $X86_EFLAGS_VM, 3*4(%esp) + jnz .Lfrom_usermode_no_fixup_@ +#endif + testl $SEGMENT_RPL_MASK, 2*4(%esp) + jnz .Lfrom_usermode_no_fixup_@ + + orl $CS_FROM_KERNEL, 2*4(%esp) + + /* + * When we're here from kernel mode; the (exception) stack looks like: + * + * 4*4(%esp) - <previous context> + * 3*4(%esp) - flags + * 2*4(%esp) - cs + * 1*4(%esp) - ip + * 0*4(%esp) - orig_eax + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us + * the original 4 enties as gap: + * + * 10*4(%esp) - <previous context> + * 9*4(%esp) - gap / flags + * 8*4(%esp) - gap / cs + * 7*4(%esp) - gap / ip + * 6*4(%esp) - gap / orig_eax + * 5*4(%esp) - ss + * 4*4(%esp) - sp + * 3*4(%esp) - flags + * 2*4(%esp) - cs + * 1*4(%esp) - ip + * 0*4(%esp) - orig_eax + */ + + pushl %ss # ss + pushl %esp # sp (points at ss) + addl $5*4, (%esp) # point sp back at the previous context + pushl 5*4(%esp) # flags + pushl 5*4(%esp) # cs + pushl 5*4(%esp) # ip + pushl 5*4(%esp) # orig_eax + +.Lfrom_usermode_no_fixup_@: +.endm + +.macro IRET_FRAME + + /* orig_eax is already POP'ed when we're here */ + + testl $CS_FROM_KERNEL, 1*4(%esp) + jz .Lfinished_frame_@ + + /* + * Reconstruct the 3 entry IRET frame right after the (modified) + * regs->sp without lowering %esp in between, such that an NMI in the + * middle doesn't scribble our stack. + */ + pushl %eax + pushl %ecx + movl 5*4(%esp), %eax # (modified) regs->sp + + movl 4*4(%esp), %ecx # flags + movl %ecx, -4(%eax) + + movl 3*4(%esp), %ecx # cs + andl $0x0000ffff, %ecx + movl %ecx, -8(%eax) + + movl 2*4(%esp), %ecx # ip + movl %ecx, -12(%eax) + + movl 1*4(%esp), %ecx # eax + movl %ecx, -16(%eax) + + popl %ecx + lea -16(%eax), %esp + popl %eax + +.Lfinished_frame_@: +.endm + .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 + cld + /* + * The high bits of the CS dword (__csh) are used for CS_FROM_*. + * Clear them in case hardware didn't do this for us. + */ + andl $(0x0000ffff), 2*4(%esp) + + FIXUP_FRAME + PUSH_GS pushl %fs pushl %es @@ -375,9 +483,6 @@ * switch to it before we do any copying. */
-#define CS_FROM_ENTRY_STACK (1 << 31) -#define CS_FROM_USER_CR3 (1 << 30) - .macro SWITCH_TO_KERNEL_STACK
ALTERNATIVE "", "jmp .Lend_@", X86_FEATURE_XENPV @@ -391,13 +496,6 @@ * that register for the time this macro runs */
- /* - * The high bits of the CS dword (__csh) are used for - * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case - * hardware didn't do this for us. - */ - andl $(0x0000ffff), PT_CS(%esp) - /* Are we on the entry stack? Bail out if not! */ movl PER_CPU_VAR(cpu_entry_area), %ecx addl $CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx @@ -755,7 +853,7 @@ END(ret_from_fork) andl $SEGMENT_RPL_MASK, %eax #endif cmpl $USER_RPL, %eax - jb resume_kernel # not returning to v8086 or userspace + jb restore_all_kernel # not returning to v8086 or userspace
ENTRY(resume_userspace) DISABLE_INTERRUPTS(CLBR_ANY) @@ -765,18 +863,6 @@ ENTRY(resume_userspace) jmp restore_all END(ret_from_exception)
-#ifdef CONFIG_PREEMPT -ENTRY(resume_kernel) - DISABLE_INTERRUPTS(CLBR_ANY) - cmpl $0, PER_CPU_VAR(__preempt_count) - jnz restore_all_kernel - testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all_kernel - call preempt_schedule_irq - jmp restore_all_kernel -END(resume_kernel) -#endif - GLOBAL(__begin_SYSENTER_singlestep_region) /* * All code from here through __end_SYSENTER_singlestep_region is subject @@ -1019,6 +1105,7 @@ ENTRY(entry_INT80_32) /* Restore user state */ RESTORE_REGS pop=4 # skip orig_eax/error_code .Lirq_return: + IRET_FRAME /* * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization * when returning from IPI handler and when returning from @@ -1027,6 +1114,7 @@ ENTRY(entry_INT80_32) INTERRUPT_RETURN
restore_all_kernel: + RETINT_PREEMPT TRACE_IRQS_IRET PARANOID_EXIT_TO_KERNEL_MODE BUG_IF_WRONG_CR3 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 20e45d9b4e15..268cd9affe04 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt * @paranoid == 2 is special: the stub will never switch stacks. This is for * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 ENTRY(\sym) UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -898,6 +898,16 @@ ENTRY(\sym) jnz .Lfrom_usermode_switch_stack_@ .endif
+ .if \create_gap == 1 + testb $3, CS-ORIG_RAX(%rsp) + jnz .Lfrom_usermode_no_gap_@ + .rept 6 + pushq 5*8(%rsp) + .endr + UNWIND_HINT_IRET_REGS offset=8 +.Lfrom_usermode_no_gap_@: + .endif + .if \paranoid call paranoid_entry .else @@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \ #endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET -idtentry int3 do_int3 has_error_code=0 +idtentry int3 do_int3 has_error_code=0 create_gap=1 idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 8a7fc0cca2d1..5ff42dc8b396 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -166,14 +166,10 @@ static inline bool user_64bit_mode(struct pt_regs *regs) #define compat_user_stack_pointer() current_pt_regs()->sp #endif
-#ifdef CONFIG_X86_32 -extern unsigned long kernel_stack_pointer(struct pt_regs *regs); -#else static inline unsigned long kernel_stack_pointer(struct pt_regs *regs) { return regs->sp; } -#endif
#define GET_IP(regs) ((regs)->ip) #define GET_FP(regs) ((regs)->bp) diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index c90678fd391a..6aac6abf931e 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -42,4 +42,24 @@ extern int after_bootmem; extern __ro_after_init struct mm_struct *poking_mm; extern __ro_after_init unsigned long poking_addr;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val) +{ + regs->sp -= sizeof(unsigned long); + *(unsigned long *)regs->sp = val; +} + +static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} + +#define INT3_INSN_SIZE 1 +#define CALL_INSN_SIZE 5 + +static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func) +{ + int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE); + int3_emulate_jmp(regs, func); +} + #endif /* _ASM_X86_TEXT_PATCHING_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 7b9b49dfc05a..8e1fafffb926 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -614,11 +614,83 @@ extern struct paravirt_patch_site __start_parainstructions[], __stop_parainstructions[]; #endif /* CONFIG_PARAVIRT */
+/* + * Self-test for the INT3 based CALL emulation code. + * + * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up + * properly and that there is a stack gap between the INT3 frame and the + * previous context. Without this gap doing a virtual PUSH on the interrupted + * stack would corrupt the INT3 IRET frame. + * + * See entry_{32,64}.S for more details. + */ +static void __init int3_magic(unsigned int *ptr) +{ + *ptr = 1; +} + +extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */ + +static int __init +int3_exception_notify(struct notifier_block *self, unsigned long val, void *data) +{ + struct die_args *args = data; + struct pt_regs *regs = args->regs; + + if (!regs || user_mode(regs)) + return NOTIFY_DONE; + + if (val != DIE_INT3) + return NOTIFY_DONE; + + if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip) + return NOTIFY_DONE; + + int3_emulate_call(regs, (unsigned long)&int3_magic); + return NOTIFY_STOP; +} + +static void __init int3_selftest(void) +{ + static __initdata struct notifier_block int3_exception_nb = { + .notifier_call = int3_exception_notify, + .priority = INT_MAX-1, /* last */ + }; + unsigned int val = 0; + + BUG_ON(register_die_notifier(&int3_exception_nb)); + + /* + * Basically: int3_magic(&val); but really complicated :-) + * + * Stick the address of the INT3 instruction into int3_selftest_ip, + * then trigger the INT3, padded with NOPs to match a CALL instruction + * length. + */ + asm volatile ("1: int3; nop; nop; nop; nop\n\t" + ".pushsection .init.data,"aw"\n\t" + ".align " __ASM_SEL(4, 8) "\n\t" + ".type int3_selftest_ip, @object\n\t" + ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t" + "int3_selftest_ip:\n\t" + __ASM_SEL(.long, .quad) " 1b\n\t" + ".popsection\n\t" + : : __ASM_SEL_RAW(a, D) (&val) : "memory"); + + BUG_ON(val != 1); + + unregister_die_notifier(&int3_exception_nb); +} + void __init alternative_instructions(void) { - /* The patching is not fully atomic, so try to avoid local interruptions - that might execute the to be patched code. - Other CPUs are not running. */ + int3_selftest(); + + /* + * The patching is not fully atomic, so try to avoid local + * interruptions that might execute the to be patched code. + * Other CPUs are not running. + */ stop_nmi();
/* @@ -643,10 +715,11 @@ void __init alternative_instructions(void) _text, _etext); }
- if (!uniproc_patched || num_possible_cpus() == 1) + if (!uniproc_patched || num_possible_cpus() == 1) { free_init_pages("SMP alternatives", (unsigned long)__smp_locks, (unsigned long)__smp_locks_end); + } #endif
apply_paravirt(__parainstructions, __parainstructions_end); diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index 0caf8122d680..a4eea7bad4a1 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -29,6 +29,7 @@ #include <asm/kprobes.h> #include <asm/ftrace.h> #include <asm/nops.h> +#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, }
static unsigned long ftrace_update_func; +static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new) { @@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret;
+ ftrace_update_func_call = (unsigned long)func; + new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new);
@@ -294,13 +298,21 @@ int ftrace_int3_handler(struct pt_regs *regs) if (WARN_ON_ONCE(!regs)) return 0;
- ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) - return 0; + ip = regs->ip - INT3_INSN_SIZE;
- regs->ip += MCOUNT_INSN_SIZE - 1; + if (ftrace_location(ip)) { + int3_emulate_call(regs, (unsigned long)ftrace_regs_caller); + return 1; + } else if (is_ftrace_caller(ip)) { + if (!ftrace_update_func_call) { + int3_emulate_jmp(regs, ip + CALL_INSN_SIZE); + return 1; + } + int3_emulate_call(regs, ftrace_update_func_call); + return 1; + }
- return 1; + return 0; } NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -865,6 +877,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func; + /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -966,6 +980,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new;
+ ftrace_update_func_call = 0UL; new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new); diff --git a/arch/x86/kernel/ftrace_32.S b/arch/x86/kernel/ftrace_32.S index 4c8440de3355..6ff9911adbb7 100644 --- a/arch/x86/kernel/ftrace_32.S +++ b/arch/x86/kernel/ftrace_32.S @@ -9,6 +9,7 @@ #include <asm/export.h> #include <asm/ftrace.h> #include <asm/nospec-branch.h> +#include <asm/asm-offsets.h>
#ifdef CC_USING_FENTRY # define function_hook __fentry__ @@ -104,26 +105,38 @@ END(ftrace_caller)
ENTRY(ftrace_regs_caller) /* - * i386 does not save SS and ESP when coming from kernel. - * Instead, to get sp, ®s->sp is used (see ptrace.h). - * Unfortunately, that means eflags must be at the same location - * as the current return ip is. We move the return ip into the - * regs->ip location, and move flags into the return ip location. + * We're here from an mcount/fentry CALL, and the stack frame looks like: + * + * <previous context> + * RET-IP + * + * The purpose of this function is to call out in an emulated INT3 + * environment with a stack frame like: + * + * <previous context> + * gap / RET-IP + * gap + * gap + * gap + * pt_regs + * + * We do _NOT_ restore: ss, flags, cs, gs, fs, es, ds */ - pushl $__KERNEL_CS - pushl 4(%esp) /* Save the return ip */ - pushl $0 /* Load 0 into orig_ax */ + subl $3*4, %esp # RET-IP + 3 gaps + pushl %ss # ss + pushl %esp # points at ss + addl $5*4, (%esp) # make it point at <previous context> + pushfl # flags + pushl $__KERNEL_CS # cs + pushl 7*4(%esp) # ip <- RET-IP + pushl $0 # orig_eax + pushl %gs pushl %fs pushl %es pushl %ds - pushl %eax - - /* Get flags and place them into the return ip slot */ - pushf - popl %eax - movl %eax, 8*4(%esp)
+ pushl %eax pushl %ebp pushl %edi pushl %esi @@ -131,28 +144,36 @@ ENTRY(ftrace_regs_caller) pushl %ecx pushl %ebx
- movl 12*4(%esp), %eax /* Load ip (1st parameter) */ - subl $MCOUNT_INSN_SIZE, %eax /* Adjust ip */ + movl PT_EIP(%esp), %eax # 1st argument: IP + subl $MCOUNT_INSN_SIZE, %eax + #ifdef CC_USING_FENTRY - movl 15*4(%esp), %edx /* Load parent ip (2nd parameter) */ + movl 21*4(%esp), %edx # 2nd argument: parent ip #else - movl 0x4(%ebp), %edx /* Load parent ip (2nd parameter) */ + movl 1*4(%ebp), %edx # 2nd argument: parent ip #endif - movl function_trace_op, %ecx /* Save ftrace_pos in 3rd parameter */ - pushl %esp /* Save pt_regs as 4th parameter */ + +#ifdef CONFIG_FRAME_POINTER + movl %esp, %ebp + andl $0x7fffffff, %ebp +#endif + + movl function_trace_op, %ecx # 3rd argument: ftrace_pos + pushl %esp # 4th argument: pt_regs
GLOBAL(ftrace_regs_call) call ftrace_stub
- addl $4, %esp /* Skip pt_regs */ + addl $4, %esp # skip 4th argument
- /* restore flags */ - push 14*4(%esp) - popf + /* place IP below the new SP */ + movl PT_OLDESP(%esp), %eax + movl PT_EIP(%esp), %ecx + movl %ecx, -4(%eax)
- /* Move return ip back to its original location */ - movl 12*4(%esp), %eax - movl %eax, 14*4(%esp) + /* place EAX below that */ + movl PT_EAX(%esp), %ecx + movl %ecx, -8(%eax)
popl %ebx popl %ecx @@ -160,16 +181,12 @@ GLOBAL(ftrace_regs_call) popl %esi popl %edi popl %ebp - popl %eax - popl %ds - popl %es - popl %fs - popl %gs
- /* use lea to not affect flags */ - lea 3*4(%esp), %esp /* Skip orig_ax, ip and cs */ + lea -8(%eax), %esp + popl %eax
jmp .Lftrace_ret + #else /* ! CONFIG_DYNAMIC_FTRACE */
ENTRY(function_hook) diff --git a/arch/x86/kernel/kprobes/common.h b/arch/x86/kernel/kprobes/common.h index 2b949f4fd4d8..f54b5d666169 100644 --- a/arch/x86/kernel/kprobes/common.h +++ b/arch/x86/kernel/kprobes/common.h @@ -6,14 +6,15 @@
#include <asm/asm.h>
+#ifdef CONFIG_X86_64 + #ifdef CONFIG_FRAME_POINTER -# define SAVE_RBP_STRING " push %" _ASM_BP "\n" \ - " mov %" _ASM_SP ", %" _ASM_BP "\n" +#define ENCODE_FRAME_POINTER \ + " leaq 1(%rsp), %rbp\n" #else -# define SAVE_RBP_STRING " push %" _ASM_BP "\n" +#define ENCODE_FRAME_POINTER #endif
-#ifdef CONFIG_X86_64 #define SAVE_REGS_STRING \ /* Skip cs, ip, orig_ax. */ \ " subq $24, %rsp\n" \ @@ -27,11 +28,13 @@ " pushq %r10\n" \ " pushq %r11\n" \ " pushq %rbx\n" \ - SAVE_RBP_STRING \ + " pushq %rbp\n" \ " pushq %r12\n" \ " pushq %r13\n" \ " pushq %r14\n" \ - " pushq %r15\n" + " pushq %r15\n" \ + ENCODE_FRAME_POINTER + #define RESTORE_REGS_STRING \ " popq %r15\n" \ " popq %r14\n" \ @@ -51,19 +54,30 @@ /* Skip orig_ax, ip, cs */ \ " addq $24, %rsp\n" #else + +#ifdef CONFIG_FRAME_POINTER +#define ENCODE_FRAME_POINTER \ + " movl %esp, %ebp\n" \ + " andl $0x7fffffff, %ebp\n" +#else +#define ENCODE_FRAME_POINTER +#endif + #define SAVE_REGS_STRING \ /* Skip cs, ip, orig_ax and gs. */ \ - " subl $16, %esp\n" \ + " subl $4*4, %esp\n" \ " pushl %fs\n" \ " pushl %es\n" \ " pushl %ds\n" \ " pushl %eax\n" \ - SAVE_RBP_STRING \ + " pushl %ebp\n" \ " pushl %edi\n" \ " pushl %esi\n" \ " pushl %edx\n" \ " pushl %ecx\n" \ - " pushl %ebx\n" + " pushl %ebx\n" \ + ENCODE_FRAME_POINTER + #define RESTORE_REGS_STRING \ " popl %ebx\n" \ " popl %ecx\n" \ @@ -72,8 +86,8 @@ " popl %edi\n" \ " popl %ebp\n" \ " popl %eax\n" \ - /* Skip ds, es, fs, gs, orig_ax, and ip. Note: don't pop cs here*/\ - " addl $24, %esp\n" + /* Skip ds, es, fs, gs, orig_ax, ip, and cs. */\ + " addl $7*4, %esp\n" #endif
/* Ensure if the instruction can be boostable */ diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 9e4fa2484d10..28d8ba3b9add 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -731,29 +731,27 @@ asm( ".global kretprobe_trampoline\n" ".type kretprobe_trampoline, @function\n" "kretprobe_trampoline:\n" -#ifdef CONFIG_X86_64 /* We don't bother saving the ss register */ +#ifdef CONFIG_X86_64 " pushq %rsp\n" " pushfq\n" SAVE_REGS_STRING " movq %rsp, %rdi\n" " call trampoline_handler\n" /* Replace saved sp with true return address. */ - " movq %rax, 152(%rsp)\n" + " movq %rax, 19*8(%rsp)\n" RESTORE_REGS_STRING " popfq\n" #else - " pushf\n" + " pushl %esp\n" + " pushfl\n" SAVE_REGS_STRING " movl %esp, %eax\n" " call trampoline_handler\n" - /* Move flags to cs */ - " movl 56(%esp), %edx\n" - " movl %edx, 52(%esp)\n" - /* Replace saved flags with true return address. */ - " movl %eax, 56(%esp)\n" + /* Replace saved sp with true return address. */ + " movl %eax, 15*4(%esp)\n" RESTORE_REGS_STRING - " popf\n" + " popfl\n" #endif " ret\n" ".size kretprobe_trampoline, .-kretprobe_trampoline\n" @@ -794,16 +792,13 @@ __used __visible void *trampoline_handler(struct pt_regs *regs) INIT_HLIST_HEAD(&empty_rp); kretprobe_hash_lock(current, &head, &flags); /* fixup registers */ -#ifdef CONFIG_X86_64 regs->cs = __KERNEL_CS; - /* On x86-64, we use pt_regs->sp for return address holder. */ - frame_pointer = ®s->sp; -#else - regs->cs = __KERNEL_CS | get_kernel_rpl(); +#ifdef CONFIG_X86_32 + regs->cs |= get_kernel_rpl(); regs->gs = 0; - /* On x86-32, we use pt_regs->flags for return address holder. */ - frame_pointer = ®s->flags; #endif + /* We use pt_regs->sp for return address holder. */ + frame_pointer = ®s->sp; regs->ip = trampoline_address; regs->orig_ax = ~0UL;
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index f14262952015..c1010207d036 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -115,14 +115,15 @@ asm ( "optprobe_template_call:\n" ASM_NOP5 /* Move flags to rsp */ - " movq 144(%rsp), %rdx\n" - " movq %rdx, 152(%rsp)\n" + " movq 18*8(%rsp), %rdx\n" + " movq %rdx, 19*8(%rsp)\n" RESTORE_REGS_STRING /* Skip flags entry */ " addq $8, %rsp\n" " popfq\n" #else /* CONFIG_X86_32 */ - " pushf\n" + " pushl %esp\n" + " pushfl\n" SAVE_REGS_STRING " movl %esp, %edx\n" ".global optprobe_template_val\n" @@ -131,9 +132,13 @@ asm ( ".global optprobe_template_call\n" "optprobe_template_call:\n" ASM_NOP5 + /* Move flags into esp */ + " movl 14*4(%esp), %edx\n" + " movl %edx, 15*4(%esp)\n" RESTORE_REGS_STRING - " addl $4, %esp\n" /* skip cs */ - " popf\n" + /* Skip flags entry */ + " addl $4, %esp\n" + " popfl\n" #endif ".global optprobe_template_end\n" "optprobe_template_end:\n" @@ -165,10 +170,9 @@ optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs) } else { struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); /* Save skipped registers */ -#ifdef CONFIG_X86_64 regs->cs = __KERNEL_CS; -#else - regs->cs = __KERNEL_CS | get_kernel_rpl(); +#ifdef CONFIG_X86_32 + regs->cs |= get_kernel_rpl(); regs->gs = 0; #endif regs->ip = (unsigned long)op->kp.addr + INT3_SIZE; diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 4b8ee05dd6ad..d13f892d2c47 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -153,35 +153,6 @@ static inline bool invalid_selector(u16 value)
#define FLAG_MASK FLAG_MASK_32
-/* - * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode - * when it traps. The previous stack will be directly underneath the saved - * registers, and 'sp/ss' won't even have been saved. Thus the '®s->sp'. - * - * Now, if the stack is empty, '®s->sp' is out of range. In this - * case we try to take the previous stack. To always return a non-null - * stack pointer we fall back to regs as stack if no previous stack - * exists. - * - * This is valid only for kernel mode traps. - */ -unsigned long kernel_stack_pointer(struct pt_regs *regs) -{ - unsigned long context = (unsigned long)regs & ~(THREAD_SIZE - 1); - unsigned long sp = (unsigned long)®s->sp; - u32 *prev_esp; - - if (context == (sp & ~(THREAD_SIZE - 1))) - return sp; - - prev_esp = (u32 *)(context); - if (*prev_esp) - return (unsigned long)*prev_esp; - - return (unsigned long)regs; -} -EXPORT_SYMBOL_GPL(kernel_stack_pointer); - static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno) { BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0); diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c index 9b9fd4826e7a..df51040d1689 100644 --- a/arch/x86/kernel/unwind_frame.c +++ b/arch/x86/kernel/unwind_frame.c @@ -71,10 +71,6 @@ static void unwind_dump(struct unwind_state *state)
static size_t regs_size(struct pt_regs *regs) { - /* x86_32 regs from kernel mode are two words shorter: */ - if (IS_ENABLED(CONFIG_X86_32) && !user_mode(regs)) - return sizeof(*regs) - 2*sizeof(long); - return sizeof(*regs); }
@@ -197,11 +193,7 @@ static struct pt_regs *decode_frame_pointer(unsigned long *bp) } #endif
-#ifdef CONFIG_X86_32 -#define KERNEL_REGS_SIZE (sizeof(struct pt_regs) - 2*sizeof(long)) -#else #define KERNEL_REGS_SIZE (sizeof(struct pt_regs)) -#endif
static bool update_stack_state(struct unwind_state *state, unsigned long *next_bp) diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 9d402e7fc949..acfd7a174337 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -548,6 +548,9 @@ static void trace_selftest_test_regs_func(unsigned long ip, struct ftrace_ops *op, struct pt_regs *pt_regs) { + + WARN(1, "whoomp!\n"); + if (pt_regs) trace_selftest_regs_stat = TRACE_SELFTEST_REGS_FOUND; else
On Thu, 2 May 2019 11:02:40 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
Indeed, the 32-bit case for same-RPL exceptions/iret is entirely different, and I'd forgotten about that.
And honestly, this makes the 32-bit case much worse. Now the entry stack modifications of int3 suddenly affect not just the entry, but every exit too.
This is _exactly_ the kind of subtle kernel entry/exit code I wanted us to avoid.
I just want to point out that I never got the trampoline version working on i386. I didn't have the time to debug why it would crash when stressed.
-- Steve
On Thu, May 02, 2019 at 11:02:40AM -0700, Linus Torvalds wrote:
On Thu, May 2, 2019 at 9:21 AM Peter Zijlstra peterz@infradead.org wrote:
TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do not include ESP/SS, so not only wasn't regs->sp setup, if you changed it it wouldn't be effective and corrupt random stack state.
Indeed, the 32-bit case for same-RPL exceptions/iret is entirely different, and I'd forgotten about that.
And honestly, this makes the 32-bit case much worse. Now the entry stack modifications of int3 suddenly affect not just the entry, but every exit too.
This is _exactly_ the kind of subtle kernel entry/exit code I wanted us to avoid.
I actually love this patch (absent the bugs). This is already something that has been sorely needed for years.
The "struct pt_regs is incomplete on x86-32" thing is a monstrosity which has long been a source of confusion and bugs. Sure, this patch adds some complexity to the entry code, but on the other hand it actually makes it possible to use pt_regs sanely: regs->sp is no longer uninitialized. So a class of (very non-obvious) bugs is eliminated.
I don't think it would make sense to make this change for int3 only, because the benefits are global.
linux-kselftest-mirror@lists.linaro.org