Add bpf trampoline support for arm64. Most of the logic is the same as x86.
Tested on raspberry pi 4b and qemu with KASLR disabled (avoid long jump), result: #9 /1 bpf_cookie/kprobe:OK #9 /2 bpf_cookie/multi_kprobe_link_api:FAIL #9 /3 bpf_cookie/multi_kprobe_attach_api:FAIL #9 /4 bpf_cookie/uprobe:OK #9 /5 bpf_cookie/tracepoint:OK #9 /6 bpf_cookie/perf_event:OK #9 /7 bpf_cookie/trampoline:OK #9 /8 bpf_cookie/lsm:OK #9 bpf_cookie:FAIL #18 /1 bpf_tcp_ca/dctcp:OK #18 /2 bpf_tcp_ca/cubic:OK #18 /3 bpf_tcp_ca/invalid_license:OK #18 /4 bpf_tcp_ca/dctcp_fallback:OK #18 /5 bpf_tcp_ca/rel_setsockopt:OK #18 bpf_tcp_ca:OK #51 /1 dummy_st_ops/dummy_st_ops_attach:OK #51 /2 dummy_st_ops/dummy_init_ret_value:OK #51 /3 dummy_st_ops/dummy_init_ptr_arg:OK #51 /4 dummy_st_ops/dummy_multiple_args:OK #51 dummy_st_ops:OK #55 fentry_fexit:OK #56 fentry_test:OK #57 /1 fexit_bpf2bpf/target_no_callees:OK #57 /2 fexit_bpf2bpf/target_yes_callees:OK #57 /3 fexit_bpf2bpf/func_replace:OK #57 /4 fexit_bpf2bpf/func_replace_verify:OK #57 /5 fexit_bpf2bpf/func_sockmap_update:OK #57 /6 fexit_bpf2bpf/func_replace_return_code:OK #57 /7 fexit_bpf2bpf/func_map_prog_compatibility:OK #57 /8 fexit_bpf2bpf/func_replace_multi:OK #57 /9 fexit_bpf2bpf/fmod_ret_freplace:OK #57 fexit_bpf2bpf:OK #58 fexit_sleep:OK #59 fexit_stress:OK #60 fexit_test:OK #67 get_func_args_test:OK #68 get_func_ip_test:OK #104 modify_return:OK #237 xdp_bpf2bpf:OK
bpf_cookie/multi_kprobe_link_api and bpf_cookie/multi_kprobe_attach_api failed due to lack of multi_kprobe on arm64.
v5: - As Alexei suggested, remove is_valid_bpf_tramp_flags()
v4: https://lore.kernel.org/bpf/20220517071838.3366093-1-xukuohai@huawei.com/ - Run the test cases on raspberry pi 4b - Rebase and add cookie to trampoline - As Steve suggested, move trace_direct_tramp() back to entry-ftrace.S to avoid messing up generic code with architecture specific code - As Jakub suggested, merge patch 4 and patch 5 of v3 to provide full function in one patch - As Mark suggested, add a comment for the use of aarch64_insn_patch_text_nosync() - Do not generate trampoline for long jump to avoid triggering ftrace_bug - Round stack size to multiples of 16B to avoid SPAlignmentFault - Use callee saved register x20 to reduce the use of mov_i64 - Add missing BTI J instructions - Trivial spelling and code style fixes
v3: https://lore.kernel.org/bpf/20220424154028.1698685-1-xukuohai@huawei.com/ - Append test results for bpf_tcp_ca, dummy_st_ops, fexit_bpf2bpf, xdp_bpf2bpf - Support to poke bpf progs - Fix return value of arch_prepare_bpf_trampoline() to the total number of bytes instead of number of instructions - Do not check whether CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled in arch_prepare_bpf_trampoline, since the trampoline may be hooked to a bpf prog - Restrict bpf_arch_text_poke() to poke bpf text only, as kernel functions are poked by ftrace - Rewrite trace_direct_tramp() in inline assembly in trace_selftest.c to avoid messing entry-ftrace.S - isolate arch_ftrace_set_direct_caller() with macro CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to avoid compile error when this macro is disabled - Some trivial code sytle fixes
v2: https://lore.kernel.org/bpf/20220414162220.1985095-1-xukuohai@huawei.com/ - Add Song's ACK - Change the multi-line comment in is_valid_bpf_tramp_flags() into net style (patch 3) - Fix a deadloop issue in ftrace selftest (patch 2) - Replace pt_regs->x0 with pt_regs->orig_x0 in patch 1 commit message - Replace "bpf trampoline" with "custom trampoline" in patch 1, as ftrace direct call is not only used by bpf trampoline.
v1: https://lore.kernel.org/bpf/20220413054959.1053668-1-xukuohai@huawei.com/
Xu Kuohai (6): arm64: ftrace: Add ftrace direct call support ftrace: Fix deadloop caused by direct call in ftrace selftest bpf: Remove is_valid_bpf_tramp_flags() bpf, arm64: Impelment bpf_arch_text_poke() for arm64 bpf, arm64: bpf trampoline for arm64 selftests/bpf: Fix trivial typo in fentry_fexit.c
arch/arm64/Kconfig | 2 + arch/arm64/include/asm/ftrace.h | 22 + arch/arm64/kernel/asm-offsets.c | 1 + arch/arm64/kernel/entry-ftrace.S | 28 +- arch/arm64/net/bpf_jit.h | 1 + arch/arm64/net/bpf_jit_comp.c | 523 +++++++++++++++++- arch/x86/net/bpf_jit_comp.c | 20 - kernel/bpf/bpf_struct_ops.c | 3 + kernel/bpf/trampoline.c | 3 + kernel/trace/trace_selftest.c | 2 + .../selftests/bpf/prog_tests/fentry_fexit.c | 4 +- 11 files changed, 570 insertions(+), 39 deletions(-)
Add ftrace direct support for arm64.
1. When there is custom trampoline only, replace the fentry nop to a jump instruction that jumps directly to the custom trampoline.
2. When ftrace trampoline and custom trampoline coexist, jump from fentry to ftrace trampoline first, then jump to custom trampoline when ftrace trampoline exits. The current unused register pt_regs->orig_x0 is used as an intermediary for jumping from ftrace trampoline to custom trampoline.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com --- arch/arm64/Kconfig | 2 ++ arch/arm64/include/asm/ftrace.h | 12 ++++++++++++ arch/arm64/kernel/asm-offsets.c | 1 + arch/arm64/kernel/entry-ftrace.S | 18 +++++++++++++++--- 4 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 57c4c995965f..81cc330daafc 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -177,6 +177,8 @@ config ARM64 select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS \ if $(cc-option,-fpatchable-function-entry=2) + select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS \ + if DYNAMIC_FTRACE_WITH_REGS select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY \ if DYNAMIC_FTRACE_WITH_REGS select HAVE_EFFICIENT_UNALIGNED_ACCESS diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index 1494cfa8639b..14a35a5df0a1 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -78,6 +78,18 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr) return addr; }
+#ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +static inline void arch_ftrace_set_direct_caller(struct pt_regs *regs, + unsigned long addr) +{ + /* + * Place custom trampoline address in regs->orig_x0 to let ftrace + * trampoline jump to it. + */ + regs->orig_x0 = addr; +} +#endif /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ + #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS struct dyn_ftrace; int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec); diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index 1197e7679882..b1ed0bf01c59 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -80,6 +80,7 @@ int main(void) DEFINE(S_SDEI_TTBR1, offsetof(struct pt_regs, sdei_ttbr1)); DEFINE(S_PMR_SAVE, offsetof(struct pt_regs, pmr_save)); DEFINE(S_STACKFRAME, offsetof(struct pt_regs, stackframe)); + DEFINE(S_ORIG_X0, offsetof(struct pt_regs, orig_x0)); DEFINE(PT_REGS_SIZE, sizeof(struct pt_regs)); BLANK(); #ifdef CONFIG_COMPAT diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S index e535480a4069..dfe62c55e3a2 100644 --- a/arch/arm64/kernel/entry-ftrace.S +++ b/arch/arm64/kernel/entry-ftrace.S @@ -60,6 +60,9 @@ str x29, [sp, #S_FP] .endif
+ /* Set orig_x0 to zero */ + str xzr, [sp, #S_ORIG_X0] + /* Save the callsite's SP and LR */ add x10, sp, #(PT_REGS_SIZE + 16) stp x9, x10, [sp, #S_LR] @@ -119,12 +122,21 @@ ftrace_common_return: /* Restore the callsite's FP, LR, PC */ ldr x29, [sp, #S_FP] ldr x30, [sp, #S_LR] - ldr x9, [sp, #S_PC] - + ldr x10, [sp, #S_PC] + + ldr x11, [sp, #S_ORIG_X0] + cbz x11, 1f + /* Set x9 to parent ip before jump to custom trampoline */ + mov x9, x30 + /* Set lr to self ip */ + ldr x30, [sp, #S_PC] + /* Set x10 (used for return address) to custom trampoline */ + mov x10, x11 +1: /* Restore the callsite's SP */ add sp, sp, #PT_REGS_SIZE + 16
- ret x9 + ret x10 SYM_CODE_END(ftrace_common)
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
On Wed, May 18, 2022 at 3:53 PM Xu Kuohai xukuohai@huawei.com wrote:
Add ftrace direct support for arm64.
When there is custom trampoline only, replace the fentry nop to a jump instruction that jumps directly to the custom trampoline.
When ftrace trampoline and custom trampoline coexist, jump from fentry to ftrace trampoline first, then jump to custom trampoline when ftrace trampoline exits. The current unused register pt_regs->orig_x0 is used as an intermediary for jumping from ftrace trampoline to custom trampoline.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
Acked-by: KP Singh kpsingh@kernel.org
On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote:
Add ftrace direct support for arm64.
When there is custom trampoline only, replace the fentry nop to a jump instruction that jumps directly to the custom trampoline.
When ftrace trampoline and custom trampoline coexist, jump from fentry to ftrace trampoline first, then jump to custom trampoline when ftrace trampoline exits. The current unused register pt_regs->orig_x0 is used as an intermediary for jumping from ftrace trampoline to custom trampoline.
For those of us not all that familiar with BPF, can you explain *why* you want this? The above explains what the patch implements, but not why that's useful.
e.g. is this just to avoid the overhead of the ops list processing in the regular ftrace code, or is the custom trampoline there to allow you to do something special?
There is another patch series on the list from some of your colleagues which uses dynamic trampolines to try to avoid that ops list overhead, and it's not clear to me whether these are trying to solve the largely same problem or something different. That other thread is at:
https://lore.kernel.org/linux-arm-kernel/20220316100132.244849-1-bobo.shaobo...
... and I've added the relevant parties to CC here, since there doesn't seem to be any overlap in the CC lists of the two threads.
In that other thread I've suggested a general approach we could follow at:
https://lore.kernel.org/linux-arm-kernel/YmGF%2FOpIhAF8YeVq@lakrids/
As noted in that thread, I have a few concerns which equally apply here:
* Due to the limited range of BL instructions, it's not always possible to patch an ftrace call-site to branch to an arbitrary trampoline. The way this works for ftrace today relies upon knowingthe set of trampolines at compile-time, and allocating module PLTs for those, and that approach cannot work reliably for dynanically allocated trampolines.
I'd strongly prefer to avoid custom tramplines unless they're strictly necessary for functional reasons, so that we can have this work reliably and consistently.
* If this is mostly about avoiding the ops list processing overhead, I beleive we can implement some custom ops support more generally in ftrace which would still use a common trampoline but could directly call into those custom ops. I would strongly prefer this over custom trampolines.
* I'm looking to minimize the set of regs ftrace saves, and never save a full pt_regs, since today we (incompletely) fill that with bogus values and cannot acquire some state reliably (e.g. PSTATE). I'd like to avoid usage of pt_regs unless necessary, and I don't want to add additional reliance upon that structure.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
arch/arm64/Kconfig | 2 ++ arch/arm64/include/asm/ftrace.h | 12 ++++++++++++ arch/arm64/kernel/asm-offsets.c | 1 + arch/arm64/kernel/entry-ftrace.S | 18 +++++++++++++++--- 4 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 57c4c995965f..81cc330daafc 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -177,6 +177,8 @@ config ARM64 select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS \ if $(cc-option,-fpatchable-function-entry=2)
- select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS \
select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY \ if DYNAMIC_FTRACE_WITH_REGS select HAVE_EFFICIENT_UNALIGNED_ACCESSif DYNAMIC_FTRACE_WITH_REGS
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index 1494cfa8639b..14a35a5df0a1 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -78,6 +78,18 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr) return addr; } +#ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +static inline void arch_ftrace_set_direct_caller(struct pt_regs *regs,
unsigned long addr)
+{
- /*
* Place custom trampoline address in regs->orig_x0 to let ftrace
* trampoline jump to it.
*/
- regs->orig_x0 = addr;
+} +#endif /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
Please, let's not abuse pt_regs::orig_x0 for this. That's at best unnecessarily confusing, and if we really need a field to place a value like this it implies we should add an ftrace-specific structure to hold the ftrace-specific context information.
Thanks, Mark.
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS struct dyn_ftrace; int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec); diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index 1197e7679882..b1ed0bf01c59 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -80,6 +80,7 @@ int main(void) DEFINE(S_SDEI_TTBR1, offsetof(struct pt_regs, sdei_ttbr1)); DEFINE(S_PMR_SAVE, offsetof(struct pt_regs, pmr_save)); DEFINE(S_STACKFRAME, offsetof(struct pt_regs, stackframe));
- DEFINE(S_ORIG_X0, offsetof(struct pt_regs, orig_x0)); DEFINE(PT_REGS_SIZE, sizeof(struct pt_regs)); BLANK();
#ifdef CONFIG_COMPAT diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S index e535480a4069..dfe62c55e3a2 100644 --- a/arch/arm64/kernel/entry-ftrace.S +++ b/arch/arm64/kernel/entry-ftrace.S @@ -60,6 +60,9 @@ str x29, [sp, #S_FP] .endif
- /* Set orig_x0 to zero */
- str xzr, [sp, #S_ORIG_X0]
- /* Save the callsite's SP and LR */ add x10, sp, #(PT_REGS_SIZE + 16) stp x9, x10, [sp, #S_LR]
@@ -119,12 +122,21 @@ ftrace_common_return: /* Restore the callsite's FP, LR, PC */ ldr x29, [sp, #S_FP] ldr x30, [sp, #S_LR]
- ldr x9, [sp, #S_PC]
- ldr x10, [sp, #S_PC]
- ldr x11, [sp, #S_ORIG_X0]
- cbz x11, 1f
- /* Set x9 to parent ip before jump to custom trampoline */
- mov x9, x30
- /* Set lr to self ip */
- ldr x30, [sp, #S_PC]
- /* Set x10 (used for return address) to custom trampoline */
- mov x10, x11
+1: /* Restore the callsite's SP */ add sp, sp, #PT_REGS_SIZE + 16
- ret x9
- ret x10
SYM_CODE_END(ftrace_common)
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
2.30.2
On 5/25/2022 9:38 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote:
Add ftrace direct support for arm64.
When there is custom trampoline only, replace the fentry nop to a jump instruction that jumps directly to the custom trampoline.
When ftrace trampoline and custom trampoline coexist, jump from fentry to ftrace trampoline first, then jump to custom trampoline when ftrace trampoline exits. The current unused register pt_regs->orig_x0 is used as an intermediary for jumping from ftrace trampoline to custom trampoline.
For those of us not all that familiar with BPF, can you explain *why* you want this? The above explains what the patch implements, but not why that's useful.
e.g. is this just to avoid the overhead of the ops list processing in the regular ftrace code, or is the custom trampoline there to allow you to do something special?
IIUC, ftrace direct call was designed to *remove* the unnecessary overhead of saving regs completely [1][2].
[1] https://lore.kernel.org/all/20191022175052.frjzlnjjfwwfov64@ast-mbp.dhcp.the... [2] https://lore.kernel.org/all/20191108212834.594904349@goodmis.org/
This patch itself is just a variant of [3].
[3] https://lore.kernel.org/all/20191108213450.891579507@goodmis.org/
There is another patch series on the list from some of your colleagues which uses dynamic trampolines to try to avoid that ops list overhead, and it's not clear to me whether these are trying to solve the largely same problem or something different. That other thread is at:
https://lore.kernel.org/linux-arm-kernel/20220316100132.244849-1-bobo.shaobo...
... and I've added the relevant parties to CC here, since there doesn't seem to be any overlap in the CC lists of the two threads.
We're not working to solve the same problem. The trampoline introduced in this series helps us to monitor kernel function or another bpf prog with bpf, and also helps us to use bpf prog like a normal kernel function pointer.
In that other thread I've suggested a general approach we could follow at: https://lore.kernel.org/linux-arm-kernel/YmGF%2FOpIhAF8YeVq@lakrids/
Is it possible for a kernel function to take a long jump to common trampoline when we get a huge kernel image?
As noted in that thread, I have a few concerns which equally apply here:
- Due to the limited range of BL instructions, it's not always possible to patch an ftrace call-site to branch to an arbitrary trampoline. The way this works for ftrace today relies upon knowingthe set of trampolines at compile-time, and allocating module PLTs for those, and that approach cannot work reliably for dynanically allocated trampolines.
Currently patch 5 returns -ENOTSUPP when long jump is detected, so no bpf trampoline is constructed for out of range patch-site:
if (is_long_jump(orig_call, image)) return -ENOTSUPP;
I'd strongly prefer to avoid custom tramplines unless they're strictly necessary for functional reasons, so that we can have this work reliably and consistently.
bpf trampoline is needed by bpf itself, not to replace ftrace trampolines.
- If this is mostly about avoiding the ops list processing overhead, I
beleive
we can implement some custom ops support more generally in ftrace which would still use a common trampoline but could directly call into those custom ops. I would strongly prefer this over custom trampolines.
- I'm looking to minimize the set of regs ftrace saves, and never save a full pt_regs, since today we (incompletely) fill that with bogus values and cannot acquire some state reliably (e.g. PSTATE). I'd like to avoid usage of pt_regs unless necessary, and I don't want to add additional reliance upon that structure.
Even if such a common trampoline is used, bpf trampoline is still necessary since we need to construct custom instructions to implement bpf functions, for example, to implement kernel function pointer with a bpf prog.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
arch/arm64/Kconfig | 2 ++ arch/arm64/include/asm/ftrace.h | 12 ++++++++++++ arch/arm64/kernel/asm-offsets.c | 1 + arch/arm64/kernel/entry-ftrace.S | 18 +++++++++++++++--- 4 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 57c4c995965f..81cc330daafc 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -177,6 +177,8 @@ config ARM64 select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS \ if $(cc-option,-fpatchable-function-entry=2)
- select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS \
select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY \ if DYNAMIC_FTRACE_WITH_REGS select HAVE_EFFICIENT_UNALIGNED_ACCESSif DYNAMIC_FTRACE_WITH_REGS
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index 1494cfa8639b..14a35a5df0a1 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -78,6 +78,18 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr) return addr; } +#ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +static inline void arch_ftrace_set_direct_caller(struct pt_regs *regs,
unsigned long addr)
+{
- /*
* Place custom trampoline address in regs->orig_x0 to let ftrace
* trampoline jump to it.
*/
- regs->orig_x0 = addr;
+} +#endif /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
Please, let's not abuse pt_regs::orig_x0 for this. That's at best unnecessarily confusing, and if we really need a field to place a value like this it implies we should add an ftrace-specific structure to hold the ftrace-specific context information.
Sorry for this confusion, this was modified in the x86 way:
https://lore.kernel.org/all/20191108213450.891579507@goodmis.org/
Thanks, Mark.
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS struct dyn_ftrace;g w int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec); diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index 1197e7679882..b1ed0bf01c59 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -80,6 +80,7 @@ int main(void) DEFINE(S_SDEI_TTBR1, offsetof(struct pt_regs, sdei_ttbr1)); DEFINE(S_PMR_SAVE, offsetof(struct pt_regs, pmr_save)); DEFINE(S_STACKFRAME, offsetof(struct pt_regs, stackframe));
- DEFINE(S_ORIG_X0, offsetof(struct pt_regs, orig_x0)); DEFINE(PT_REGS_SIZE, sizeof(struct pt_regs)); BLANK();
#ifdef CONFIG_COMPAT diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S index e535480a4069..dfe62c55e3a2 100644 --- a/arch/arm64/kernel/entry-ftrace.S +++ b/arch/arm64/kernel/entry-ftrace.S @@ -60,6 +60,9 @@ str x29, [sp, #S_FP] .endif
- /* Set orig_x0 to zero */
- str xzr, [sp, #S_ORIG_X0]
- /* Save the callsite's SP and LR */ add x10, sp, #(PT_REGS_SIZE + 16) stp x9, x10, [sp, #S_LR]
@@ -119,12 +122,21 @@ ftrace_common_return: /* Restore the callsite's FP, LR, PC */ ldr x29, [sp, #S_FP] ldr x30, [sp, #S_LR]
- ldr x9, [sp, #S_PC]
- ldr x10, [sp, #S_PC]
- ldr x11, [sp, #S_ORIG_X0]
- cbz x11, 1f
- /* Set x9 to parent ip before jump to custom trampoline */
- mov x9, x30
- /* Set lr to self ip */
- ldr x30, [sp, #S_PC]
- /* Set x10 (used for return address) to custom trampoline */
- mov x10, x11
+1: /* Restore the callsite's SP */ add sp, sp, #PT_REGS_SIZE + 16
- ret x9
- ret x10
SYM_CODE_END(ftrace_common)
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
2.30.2
.
On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:38 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote:
Add ftrace direct support for arm64.
When there is custom trampoline only, replace the fentry nop to a jump instruction that jumps directly to the custom trampoline.
When ftrace trampoline and custom trampoline coexist, jump from fentry to ftrace trampoline first, then jump to custom trampoline when ftrace trampoline exits. The current unused register pt_regs->orig_x0 is used as an intermediary for jumping from ftrace trampoline to custom trampoline.
For those of us not all that familiar with BPF, can you explain *why* you want this? The above explains what the patch implements, but not why that's useful.
e.g. is this just to avoid the overhead of the ops list processing in the regular ftrace code, or is the custom trampoline there to allow you to do something special?
IIUC, ftrace direct call was designed to *remove* the unnecessary overhead of saving regs completely [1][2].
Ok. My plan is to get rid of most of the register saving generally, so I think that aspect can be solved without direct calls.
[1] https://lore.kernel.org/all/20191022175052.frjzlnjjfwwfov64@ast-mbp.dhcp.the... [2] https://lore.kernel.org/all/20191108212834.594904349@goodmis.org/
This patch itself is just a variant of [3].
[3] https://lore.kernel.org/all/20191108213450.891579507@goodmis.org/
There is another patch series on the list from some of your colleagues which uses dynamic trampolines to try to avoid that ops list overhead, and it's not clear to me whether these are trying to solve the largely same problem or something different. That other thread is at:
https://lore.kernel.org/linux-arm-kernel/20220316100132.244849-1-bobo.shaobo...
... and I've added the relevant parties to CC here, since there doesn't seem to be any overlap in the CC lists of the two threads.
We're not working to solve the same problem. The trampoline introduced in this series helps us to monitor kernel function or another bpf prog with bpf, and also helps us to use bpf prog like a normal kernel function pointer.
Ok, but why is it necessary to have a special trampoline?
Is that *just* to avoid overhead, or do you need to do something special that the regular trampoline won't do?
In that other thread I've suggested a general approach we could follow at: https://lore.kernel.org/linux-arm-kernel/YmGF%2FOpIhAF8YeVq@lakrids/
Is it possible for a kernel function to take a long jump to common trampoline when we get a huge kernel image?
It is possible, but only where the kernel Image itself is massive and the .text section exceeeds 128MiB, at which point other things break anyway. Practically speaking, this doesn't happen for production kernels, or reasonable test kernels.
I've been meaning to add some logic to detect this at boot time and idsable ftrace (or at build time), since live patching would also be broken in that case.
As noted in that thread, I have a few concerns which equally apply here:
- Due to the limited range of BL instructions, it's not always possible to patch an ftrace call-site to branch to an arbitrary trampoline. The way this works for ftrace today relies upon knowingthe set of trampolines at compile-time, and allocating module PLTs for those, and that approach cannot work reliably for dynanically allocated trampolines.
Currently patch 5 returns -ENOTSUPP when long jump is detected, so no bpf trampoline is constructed for out of range patch-site:
if (is_long_jump(orig_call, image)) return -ENOTSUPP;
Sure, my point is that in practice that means that (from the user's PoV) this may randomly fail to work, and I'd like something that we can ensure works consistently.
I'd strongly prefer to avoid custom tramplines unless they're strictly necessary for functional reasons, so that we can have this work reliably and consistently.
bpf trampoline is needed by bpf itself, not to replace ftrace trampolines.
As above, can you please let me know *why* specifically it is needed? Why can't we invoke the BPF code through the usual ops mechanism?
Is that to avoid overhead, or are there other functional reasons you need a special trampoline?
- If this is mostly about avoiding the ops list processing overhead, I
beleive
we can implement some custom ops support more generally in ftrace which would still use a common trampoline but could directly call into those custom ops. I would strongly prefer this over custom trampolines.
- I'm looking to minimize the set of regs ftrace saves, and never save a full pt_regs, since today we (incompletely) fill that with bogus values and cannot acquire some state reliably (e.g. PSTATE). I'd like to avoid usage of pt_regs unless necessary, and I don't want to add additional reliance upon that structure.
Even if such a common trampoline is used, bpf trampoline is still necessary since we need to construct custom instructions to implement bpf functions, for example, to implement kernel function pointer with a bpf prog.
Sorry, but I'm struggling to understand this. What specifically do you need to do that means this can't use the same calling convention as the regular ops function pointers?
Thanks, Mark.
On 5/26/2022 6:06 PM, Mark Rutland wrote:
On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:38 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote:
Add ftrace direct support for arm64.
When there is custom trampoline only, replace the fentry nop to a jump instruction that jumps directly to the custom trampoline.
When ftrace trampoline and custom trampoline coexist, jump from fentry to ftrace trampoline first, then jump to custom trampoline when ftrace trampoline exits. The current unused register pt_regs->orig_x0 is used as an intermediary for jumping from ftrace trampoline to custom trampoline.
For those of us not all that familiar with BPF, can you explain *why* you want this? The above explains what the patch implements, but not why that's useful.
e.g. is this just to avoid the overhead of the ops list processing in the regular ftrace code, or is the custom trampoline there to allow you to do something special?
IIUC, ftrace direct call was designed to *remove* the unnecessary overhead of saving regs completely [1][2].
Ok. My plan is to get rid of most of the register saving generally, so I think that aspect can be solved without direct calls.
Looking forward to your new solution.
[1] https://lore.kernel.org/all/20191022175052.frjzlnjjfwwfov64@ast-mbp.dhcp.the... [2] https://lore.kernel.org/all/20191108212834.594904349@goodmis.org/
This patch itself is just a variant of [3].
[3] https://lore.kernel.org/all/20191108213450.891579507@goodmis.org/
There is another patch series on the list from some of your colleagues which uses dynamic trampolines to try to avoid that ops list overhead, and it's not clear to me whether these are trying to solve the largely same problem or something different. That other thread is at:
https://lore.kernel.org/linux-arm-kernel/20220316100132.244849-1-bobo.shaobo...
... and I've added the relevant parties to CC here, since there doesn't seem to be any overlap in the CC lists of the two threads.
We're not working to solve the same problem. The trampoline introduced in this series helps us to monitor kernel function or another bpf prog with bpf, and also helps us to use bpf prog like a normal kernel function pointer.
Ok, but why is it necessary to have a special trampoline?
Is that *just* to avoid overhead, or do you need to do something special that the regular trampoline won't do?
Sorry for not explaining the problem. The main bpf prog accepts only a single argument 'ctx' in r1, so to allow kernel code to call bpf prog transparently, we need a trampoline to convert native calling convention into BPF calling convention [1].
[1] https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org/
For example,
SEC("struct_ops/dctcp_state") void BPF_PROG(dctcp_state, struct sock *sk, __u8 new_state) { // do something }
The above bpf prog will be compiled to something like this:
dctcp_state: r2 = *(u64 *)(r1 + 8) // new_state r1 = *(u64 *)(r1 + 0) // sk ...
It accepts only one argument 'ctx' in r1, and loads the actual arugment 'sk' and 'new_state' from r1 + 0 and r1 + 8, resepectively. So before calling this prog, we need to construct 'ctx' and store its address to r1.
In that other thread I've suggested a general approach we could follow at: https://lore.kernel.org/linux-arm-kernel/YmGF%2FOpIhAF8YeVq@lakrids/
Is it possible for a kernel function to take a long jump to common trampoline when we get a huge kernel image?
It is possible, but only where the kernel Image itself is massive and the .text section exceeeds 128MiB, at which point other things break anyway. Practically speaking, this doesn't happen for production kernels, or reasonable test kernels.
So even for normal kernel functions, we need some way to construct and destruct long jumps atomically and safely.
I've been meaning to add some logic to detect this at boot time and idsable ftrace (or at build time), since live patching would also be broken in that case.
As noted in that thread, I have a few concerns which equally apply here:
- Due to the limited range of BL instructions, it's not always possible to patch an ftrace call-site to branch to an arbitrary trampoline. The way this works for ftrace today relies upon knowingthe set of trampolines at compile-time, and allocating module PLTs for those, and that approach cannot work reliably for dynanically allocated trampolines.
Currently patch 5 returns -ENOTSUPP when long jump is detected, so no bpf trampoline is constructed for out of range patch-site:
if (is_long_jump(orig_call, image)) return -ENOTSUPP;
Sure, my point is that in practice that means that (from the user's PoV) this may randomly fail to work, and I'd like something that we can ensure works consistently.
OK, should I suspend this work until you finish refactoring ftrace?
I'd strongly prefer to avoid custom tramplines unless they're strictly necessary for functional reasons, so that we can have this work reliably and consistently.
bpf trampoline is needed by bpf itself, not to replace ftrace trampolines.
As above, can you please let me know *why* specifically it is needed? Why can't we invoke the BPF code through the usual ops mechanism?
Is that to avoid overhead, or are there other functional reasons you need a special trampoline?
- If this is mostly about avoiding the ops list processing overhead, I
beleive
we can implement some custom ops support more generally in ftrace which would still use a common trampoline but could directly call into those custom ops. I would strongly prefer this over custom trampolines.
- I'm looking to minimize the set of regs ftrace saves, and never save a full pt_regs, since today we (incompletely) fill that with bogus values and cannot acquire some state reliably (e.g. PSTATE). I'd like to avoid usage of pt_regs unless necessary, and I don't want to add additional reliance upon that structure.
Even if such a common trampoline is used, bpf trampoline is still necessary since we need to construct custom instructions to implement bpf functions, for example, to implement kernel function pointer with a bpf prog.
Sorry, but I'm struggling to understand this. What specifically do you need to do that means this can't use the same calling convention as the regular ops function pointers?
Thanks,
Mark. .
On Thu, May 26, 2022 at 10:48:05PM +0800, Xu Kuohai wrote:
On 5/26/2022 6:06 PM, Mark Rutland wrote:
On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:38 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote:
Add ftrace direct support for arm64.
When there is custom trampoline only, replace the fentry nop to a jump instruction that jumps directly to the custom trampoline.
When ftrace trampoline and custom trampoline coexist, jump from fentry to ftrace trampoline first, then jump to custom trampoline when ftrace trampoline exits. The current unused register pt_regs->orig_x0 is used as an intermediary for jumping from ftrace trampoline to custom trampoline.
For those of us not all that familiar with BPF, can you explain *why* you want this? The above explains what the patch implements, but not why that's useful.
e.g. is this just to avoid the overhead of the ops list processing in the regular ftrace code, or is the custom trampoline there to allow you to do something special?
IIUC, ftrace direct call was designed to *remove* the unnecessary overhead of saving regs completely [1][2].
Ok. My plan is to get rid of most of the register saving generally, so I think that aspect can be solved without direct calls.
Looking forward to your new solution.
For the register saving rework, I have a WIP branch on my kernel.org repo:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/... git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/ftrace/minimal-regs
I'm working on that at the moment along with a per-callsite ops implementaiton that would avoid most of the need for custom trampolines (and work with branch range limitations):
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/... git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/ftrace/per-callsite-ops
[1] https://lore.kernel.org/all/20191022175052.frjzlnjjfwwfov64@ast-mbp.dhcp.the... [2] https://lore.kernel.org/all/20191108212834.594904349@goodmis.org/
This patch itself is just a variant of [3].
[3] https://lore.kernel.org/all/20191108213450.891579507@goodmis.org/
There is another patch series on the list from some of your colleagues which uses dynamic trampolines to try to avoid that ops list overhead, and it's not clear to me whether these are trying to solve the largely same problem or something different. That other thread is at:
https://lore.kernel.org/linux-arm-kernel/20220316100132.244849-1-bobo.shaobo...
... and I've added the relevant parties to CC here, since there doesn't seem to be any overlap in the CC lists of the two threads.
We're not working to solve the same problem. The trampoline introduced in this series helps us to monitor kernel function or another bpf prog with bpf, and also helps us to use bpf prog like a normal kernel function pointer.
Ok, but why is it necessary to have a special trampoline?
Is that *just* to avoid overhead, or do you need to do something special that the regular trampoline won't do?
Sorry for not explaining the problem. The main bpf prog accepts only a single argument 'ctx' in r1, so to allow kernel code to call bpf prog transparently, we need a trampoline to convert native calling convention into BPF calling convention [1].
[1] https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org/
Thanks for the pointer; I'll go page that in.
For example,
SEC("struct_ops/dctcp_state") void BPF_PROG(dctcp_state, struct sock *sk, __u8 new_state) { // do something }
The above bpf prog will be compiled to something like this:
dctcp_state: r2 = *(u64 *)(r1 + 8) // new_state r1 = *(u64 *)(r1 + 0) // sk ...
It accepts only one argument 'ctx' in r1, and loads the actual arugment 'sk' and 'new_state' from r1 + 0 and r1 + 8, resepectively. So before calling this prog, we need to construct 'ctx' and store its address to r1.
In that other thread I've suggested a general approach we could follow at: https://lore.kernel.org/linux-arm-kernel/YmGF%2FOpIhAF8YeVq@lakrids/
Is it possible for a kernel function to take a long jump to common trampoline when we get a huge kernel image?
It is possible, but only where the kernel Image itself is massive and the .text section exceeeds 128MiB, at which point other things break anyway. Practically speaking, this doesn't happen for production kernels, or reasonable test kernels.
So even for normal kernel functions, we need some way to construct and destruct long jumps atomically and safely.
My point was that case is unrealistic for production kernels, and utterly broken anyway (and as below I intend to make ftrace detect this and mark itself as broken).
FWIW, an allmodconfig kernel built with GCC 12.1.0 has a ~30MB .text segment, so for realistic kernels we have plenty of headroom for normal functions to reach the in-kernel trampoline.
I've been meaning to add some logic to detect this at boot time and idsable ftrace (or at build time), since live patching would also be broken in that case.
As noted in that thread, I have a few concerns which equally apply here:
- Due to the limited range of BL instructions, it's not always possible to patch an ftrace call-site to branch to an arbitrary trampoline. The way this works for ftrace today relies upon knowingthe set of trampolines at compile-time, and allocating module PLTs for those, and that approach cannot work reliably for dynanically allocated trampolines.
Currently patch 5 returns -ENOTSUPP when long jump is detected, so no bpf trampoline is constructed for out of range patch-site:
if (is_long_jump(orig_call, image)) return -ENOTSUPP;
Sure, my point is that in practice that means that (from the user's PoV) this may randomly fail to work, and I'd like something that we can ensure works consistently.
OK, should I suspend this work until you finish refactoring ftrace?
Yes; I'd appreciate if we could hold on this for a bit.
I think with some ground work we can avoid most of the painful edge cases and might be able to avoid the need for custom trampolines.
Thanks, Mark.
On 6/7/2022 12:35 AM, Mark Rutland wrote:
On Thu, May 26, 2022 at 10:48:05PM +0800, Xu Kuohai wrote:
On 5/26/2022 6:06 PM, Mark Rutland wrote:
On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:38 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote:
Add ftrace direct support for arm64.
When there is custom trampoline only, replace the fentry nop to a jump instruction that jumps directly to the custom trampoline.
When ftrace trampoline and custom trampoline coexist, jump from fentry to ftrace trampoline first, then jump to custom trampoline when ftrace trampoline exits. The current unused register pt_regs->orig_x0 is used as an intermediary for jumping from ftrace trampoline to custom trampoline.
For those of us not all that familiar with BPF, can you explain *why* you want this? The above explains what the patch implements, but not why that's useful.
e.g. is this just to avoid the overhead of the ops list processing in the regular ftrace code, or is the custom trampoline there to allow you to do something special?
IIUC, ftrace direct call was designed to *remove* the unnecessary overhead of saving regs completely [1][2].
Ok. My plan is to get rid of most of the register saving generally, so I think that aspect can be solved without direct calls.
Looking forward to your new solution.
For the register saving rework, I have a WIP branch on my kernel.org repo:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/... git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/ftrace/minimal-regs
I'm working on that at the moment along with a per-callsite ops implementaiton that would avoid most of the need for custom trampolines (and work with branch range limitations):
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/... git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/ftrace/per-callsite-ops
[1] https://lore.kernel.org/all/20191022175052.frjzlnjjfwwfov64@ast-mbp.dhcp.the... [2] https://lore.kernel.org/all/20191108212834.594904349@goodmis.org/
This patch itself is just a variant of [3].
[3] https://lore.kernel.org/all/20191108213450.891579507@goodmis.org/
There is another patch series on the list from some of your colleagues which uses dynamic trampolines to try to avoid that ops list overhead, and it's not clear to me whether these are trying to solve the largely same problem or something different. That other thread is at:
https://lore.kernel.org/linux-arm-kernel/20220316100132.244849-1-bobo.shaobo...
... and I've added the relevant parties to CC here, since there doesn't seem to be any overlap in the CC lists of the two threads.
We're not working to solve the same problem. The trampoline introduced in this series helps us to monitor kernel function or another bpf prog with bpf, and also helps us to use bpf prog like a normal kernel function pointer.
Ok, but why is it necessary to have a special trampoline?
Is that *just* to avoid overhead, or do you need to do something special that the regular trampoline won't do?
Sorry for not explaining the problem. The main bpf prog accepts only a single argument 'ctx' in r1, so to allow kernel code to call bpf prog transparently, we need a trampoline to convert native calling convention into BPF calling convention [1].
[1] https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org/
Thanks for the pointer; I'll go page that in.
For example,
SEC("struct_ops/dctcp_state") void BPF_PROG(dctcp_state, struct sock *sk, __u8 new_state) { // do something }
The above bpf prog will be compiled to something like this:
dctcp_state: r2 = *(u64 *)(r1 + 8) // new_state r1 = *(u64 *)(r1 + 0) // sk ...
It accepts only one argument 'ctx' in r1, and loads the actual arugment 'sk' and 'new_state' from r1 + 0 and r1 + 8, resepectively. So before calling this prog, we need to construct 'ctx' and store its address to r1.
In that other thread I've suggested a general approach we could follow at: https://lore.kernel.org/linux-arm-kernel/YmGF%2FOpIhAF8YeVq@lakrids/
Is it possible for a kernel function to take a long jump to common trampoline when we get a huge kernel image?
It is possible, but only where the kernel Image itself is massive and the .text section exceeeds 128MiB, at which point other things break anyway. Practically speaking, this doesn't happen for production kernels, or reasonable test kernels.
So even for normal kernel functions, we need some way to construct and destruct long jumps atomically and safely.
My point was that case is unrealistic for production kernels, and utterly broken anyway (and as below I intend to make ftrace detect this and mark itself as broken).
FWIW, an allmodconfig kernel built with GCC 12.1.0 has a ~30MB .text segment, so for realistic kernels we have plenty of headroom for normal functions to reach the in-kernel trampoline.
I've been meaning to add some logic to detect this at boot time and idsable ftrace (or at build time), since live patching would also be broken in that case.
As noted in that thread, I have a few concerns which equally apply here:
- Due to the limited range of BL instructions, it's not always possible to patch an ftrace call-site to branch to an arbitrary trampoline. The way this works for ftrace today relies upon knowingthe set of trampolines at compile-time, and allocating module PLTs for those, and that approach cannot work reliably for dynanically allocated trampolines.
Currently patch 5 returns -ENOTSUPP when long jump is detected, so no bpf trampoline is constructed for out of range patch-site:
if (is_long_jump(orig_call, image)) return -ENOTSUPP;
Sure, my point is that in practice that means that (from the user's PoV) this may randomly fail to work, and I'd like something that we can ensure works consistently.
OK, should I suspend this work until you finish refactoring ftrace?
Yes; I'd appreciate if we could hold on this for a bit.
I think with some ground work we can avoid most of the painful edge cases and might be able to avoid the need for custom trampolines.
I'v read your WIP code, but unfortunately I didn't find any mechanism to replace bpf trampoline in your code, sorry.
It looks like bpf trampoline and ftrace works can be done at the same time. I think for now we can just attach bpf trampoline to bpf prog. Once your ftrace work is done, we can add support for attaching bpf trampoline to regular kernel function. Is this OK?
Thanks, Mark. .
On Thu, Jun 9, 2022 at 6:27 AM Xu Kuohai xukuohai@huawei.com wrote:
On 6/7/2022 12:35 AM, Mark Rutland wrote:
On Thu, May 26, 2022 at 10:48:05PM +0800, Xu Kuohai wrote:
On 5/26/2022 6:06 PM, Mark Rutland wrote:
On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:38 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote: > As noted in that thread, I have a few concerns which equally apply here:
- Due to the limited range of BL instructions, it's not always possible to patch an ftrace call-site to branch to an arbitrary trampoline. The way this works for ftrace today relies upon knowingthe set of trampolines at compile-time, and allocating module PLTs for those, and that approach cannot work reliably for dynanically allocated trampolines.
Currently patch 5 returns -ENOTSUPP when long jump is detected, so no bpf trampoline is constructed for out of range patch-site:
if (is_long_jump(orig_call, image)) return -ENOTSUPP;
Sure, my point is that in practice that means that (from the user's PoV) this may randomly fail to work, and I'd like something that we can ensure works consistently.
OK, should I suspend this work until you finish refactoring ftrace?
Yes; I'd appreciate if we could hold on this for a bit.
I think with some ground work we can avoid most of the painful edge cases and might be able to avoid the need for custom trampolines.
I'v read your WIP code, but unfortunately I didn't find any mechanism to replace bpf trampoline in your code, sorry.
It looks like bpf trampoline and ftrace works can be done at the same time. I think for now we can just attach bpf trampoline to bpf prog. Once your ftrace work is done, we can add support for attaching bpf trampoline to regular kernel function. Is this OK?
Hey Mark and Xu! :)
I'm interested in this feature too and would be happy to help.
I've been trying to understand what you both have in mind to figure out a way forward, please correct me if I got anything wrong! :)
It looks like, currently, there are three places where an indirection to BPF is technically possible. Chronologically these are:
- the function's patchsite (currently there are 2 nops, this could become 4 nops with Mark's series on per call-site ops)
- the ftrace ops (currently called by iterating over a global list but could be called more directly with Mark's series on per-call-site ops or by dynamically generated branches with Wang's series on dynamic trampolines)
- a ftrace trampoline tail call (currently, this is after restoring a full pt_regs but this could become an args only restoration with Mark's series on DYNAMIC_FTRACE_WITH_ARGS)
If we first consider the situation when only a BPF program is attached to a kernel function: - Using the patchsite for indirection (proposed by Xu, same as on x86) Pros: - We have BPF trampolines anyway because they are required for orthogonal features such as calling BPF programs as functions, so jumping into that existing JITed code is straightforward - This has the minimum overhead (eg: these trampolines only save the actual number of args used by the function in ctx and avoid indirect calls) Cons: - If the BPF trampoline is JITed outside BL's limits, attachment can randomly fail
- Using a ftrace op for indirection (proposed by Mark) Pros: - BPF doesn't need to care about BL's range, ftrace_caller will be in range Cons: - The ftrace trampoline would first save all args in an ftrace_regs only for the BPF op to then re-save them in a BPF ctx array (as per BPF calling convention) so we'd effectively have to do the work of saving args twice - BPF currently uses DYNAMIC_FTRACE_WITH_DIRECT_CALLS APIs. Either arm64 should implement DIRECT_CALLS with... an indirect call :) (that is, the arch_ftrace_set_direct_caller op would turn back its ftrace_regs into arguments for the BPF trampoline) or BPF would need to use a different ftrace API just on arm64 (to define new ops, which, unless if they would be dynamically JITed, wouldn't be as performant as the existing BPF trampolines)
- Using a ftrace trampoline tail call for indirection (not discussed yet iiuc) Pros: - BPF also doesn't need to care about BL's range - This also leverages the existing BPF trampolines Cons: - This also does the work of saving/restoring arguments twice - DYNAMIC_FTRACE_WITH_DIRECT_CALLS depends on DYNAMIC_FTRACE_WITH_REGS now although in practice the registers kept by DYNAMIC_FTRACE_WITH_ARGS should be enough to call BPF trampolines
If we consider the situation when both ftrace ops and BPF programs are attached to a kernel function: - Using the patchsite for indirection can't solve this
- Using a ftrace op for indirection (proposed by Mark) or using a ftrace trampoline tail call as an indirection (proposed by Xu, same as on x86) have the same pros & cons as in the BPF only situation except that this time we pay the cost of registers saving twice for good reasons (we need args in both ftrace_regs and the BPF ctx array formats anyway)
Unless I'm missing something, it sounds like the following approach would work: - Always patch patchsites with calls to ftrace trampolines (within BL ranges) - Always go through ops and have arch_ftrace_set_direct_caller set ftrace_regs->direct_call (instead of pt_regs->orig_x0 in this patch) - If ftrace_regs->direct_call != 0 at the end of the ftrace trampoline, tail call it
Once Mark's series on DYNAMIC_FTRACE_WITH_ARGS is merged, we would need to have DYNAMIC_FTRACE_WITH_DIRECT_CALLS depend on DYNAMIC_FTRACE_WITH_REGS || DYNAMIC_FTRACE_WITH_ARGS BPF trampolines (the only users of this API now) only care about args to the attachment point anyway so I think this would work transparently ?
Once Mark's series on per-callsite ops is merged, the second step (going through ops) would be significantly faster in the situation where only one program is used, therefore one arch_ftrace_set_direct_caller op.
Once Wang's series on dynamic trampolines is merged, the second step (going through ops) would also be significantly faster in the case when multiple ops are attached.
What are your thoughts? If this sounds somewhat sane, I'm happy to help out with the implementation as well :)
Thanks! Florent
On 8/10/2022 1:03 AM, Florent Revest wrote:
On Thu, Jun 9, 2022 at 6:27 AM Xu Kuohai xukuohai@huawei.com wrote:
On 6/7/2022 12:35 AM, Mark Rutland wrote:
On Thu, May 26, 2022 at 10:48:05PM +0800, Xu Kuohai wrote:
On 5/26/2022 6:06 PM, Mark Rutland wrote:
On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:38 PM, Mark Rutland wrote: > On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote: >> As noted in that thread, I have a few concerns which equally apply here: > > * Due to the limited range of BL instructions, it's not always possible to > patch an ftrace call-site to branch to an arbitrary trampoline. The way this > works for ftrace today relies upon knowingthe set of trampolines at > compile-time, and allocating module PLTs for those, and that approach cannot > work reliably for dynanically allocated trampolines.
Currently patch 5 returns -ENOTSUPP when long jump is detected, so no bpf trampoline is constructed for out of range patch-site:
if (is_long_jump(orig_call, image)) return -ENOTSUPP;
Sure, my point is that in practice that means that (from the user's PoV) this may randomly fail to work, and I'd like something that we can ensure works consistently.
OK, should I suspend this work until you finish refactoring ftrace?
Yes; I'd appreciate if we could hold on this for a bit.
I think with some ground work we can avoid most of the painful edge cases and might be able to avoid the need for custom trampolines.
I'v read your WIP code, but unfortunately I didn't find any mechanism to replace bpf trampoline in your code, sorry.
It looks like bpf trampoline and ftrace works can be done at the same time. I think for now we can just attach bpf trampoline to bpf prog. Once your ftrace work is done, we can add support for attaching bpf trampoline to regular kernel function. Is this OK?
Hey Mark and Xu! :)
I'm interested in this feature too and would be happy to help.
I've been trying to understand what you both have in mind to figure out a way forward, please correct me if I got anything wrong! :)
It looks like, currently, there are three places where an indirection to BPF is technically possible. Chronologically these are:
the function's patchsite (currently there are 2 nops, this could become 4 nops with Mark's series on per call-site ops)
the ftrace ops (currently called by iterating over a global list but could be called more directly with Mark's series on per-call-site ops or by dynamically generated branches with Wang's series on dynamic trampolines)
a ftrace trampoline tail call (currently, this is after restoring a full pt_regs but this could become an args only restoration with Mark's series on DYNAMIC_FTRACE_WITH_ARGS)
If we first consider the situation when only a BPF program is attached to a kernel function:
Using the patchsite for indirection (proposed by Xu, same as on x86) Pros:
- We have BPF trampolines anyway because they are required for orthogonal features such as calling BPF programs as functions, so jumping into that existing JITed code is straightforward
- This has the minimum overhead (eg: these trampolines only save the actual number of args used by the function in ctx and avoid indirect calls)
Cons:
- If the BPF trampoline is JITed outside BL's limits, attachment can randomly fail
Using a ftrace op for indirection (proposed by Mark) Pros:
- BPF doesn't need to care about BL's range, ftrace_caller will be in range
Cons:
- The ftrace trampoline would first save all args in an ftrace_regs only for the BPF op to then re-save them in a BPF ctx array (as per BPF calling convention) so we'd effectively have to do the work of saving args twice
- BPF currently uses DYNAMIC_FTRACE_WITH_DIRECT_CALLS APIs. Either arm64 should implement DIRECT_CALLS with... an indirect call :) (that is, the arch_ftrace_set_direct_caller op would turn back its ftrace_regs into arguments for the BPF trampoline) or BPF would need to use a different ftrace API just on arm64 (to define new ops, which, unless if they would be dynamically JITed, wouldn't be as performant as the existing BPF trampolines)
Using a ftrace trampoline tail call for indirection (not discussed yet iiuc) Pros:
- BPF also doesn't need to care about BL's range
- This also leverages the existing BPF trampolines
Cons:
- This also does the work of saving/restoring arguments twice
- DYNAMIC_FTRACE_WITH_DIRECT_CALLS depends on DYNAMIC_FTRACE_WITH_REGS now although in practice the registers kept by DYNAMIC_FTRACE_WITH_ARGS should be enough to call BPF trampolines
If we consider the situation when both ftrace ops and BPF programs are attached to a kernel function:
Using the patchsite for indirection can't solve this
Using a ftrace op for indirection (proposed by Mark) or using a ftrace trampoline tail call as an indirection (proposed by Xu, same as on x86) have the same pros & cons as in the BPF only situation except that this time we pay the cost of registers saving twice for good reasons (we need args in both ftrace_regs and the BPF ctx array formats anyway)
Unless I'm missing something, it sounds like the following approach would work:
- Always patch patchsites with calls to ftrace trampolines (within BL ranges)
- Always go through ops and have arch_ftrace_set_direct_caller set ftrace_regs->direct_call (instead of pt_regs->orig_x0 in this patch)
- If ftrace_regs->direct_call != 0 at the end of the ftrace trampoline, tail call it
Once Mark's series on DYNAMIC_FTRACE_WITH_ARGS is merged, we would need to have DYNAMIC_FTRACE_WITH_DIRECT_CALLS depend on DYNAMIC_FTRACE_WITH_REGS || DYNAMIC_FTRACE_WITH_ARGS BPF trampolines (the only users of this API now) only care about args to the attachment point anyway so I think this would work transparently ?
Once Mark's series on per-callsite ops is merged, the second step (going through ops) would be significantly faster in the situation where only one program is used, therefore one arch_ftrace_set_direct_caller op.
Once Wang's series on dynamic trampolines is merged, the second step (going through ops) would also be significantly faster in the case when multiple ops are attached.
What are your thoughts? If this sounds somewhat sane, I'm happy to help out with the implementation as well :)
Hi Florent,
I'm struggling with how to attach bpf trampoline to regular kernel functions. I think your suggestion is fine. Thanks for the help!
Thanks! Florent .
After direct call is enabled for arm64, ftrace selftest enters a dead loop:
<trace_selftest_dynamic_test_func>: 00 bti c 01 mov x9, x30 <trace_direct_tramp>: 02 bl <trace_direct_tramp> ----------> ret | lr/x30 is 03, return to 03 | 03 mov w0, #0x0 <-----------------------------| | | | dead loop! | | | 04 ret ---- lr/x30 is still 03, go back to 03 ----|
The reason is that when the direct caller trace_direct_tramp() returns to the patched function trace_selftest_dynamic_test_func(), lr is still the address after the instrumented instruction in the patched function, so when the patched function exits, it returns to itself!
To fix this issue, we need to restore lr before trace_direct_tramp() exits, so rewrite a dedicated trace_direct_tramp() for arm64.
Reported-by: Li Huafei lihuafei1@huawei.com Signed-off-by: Xu Kuohai xukuohai@huawei.com --- arch/arm64/include/asm/ftrace.h | 10 ++++++++++ arch/arm64/kernel/entry-ftrace.S | 10 ++++++++++ kernel/trace/trace_selftest.c | 2 ++ 3 files changed, 22 insertions(+)
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index 14a35a5df0a1..6f6b184e72fb 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -126,6 +126,16 @@ static inline bool arch_syscall_match_sym_name(const char *sym, */ return !strcmp(sym + 8, name); } + +#ifdef CONFIG_FTRACE_SELFTEST +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS + +#define trace_direct_tramp trace_direct_tramp +extern void trace_direct_tramp(void); + +#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ +#endif /* CONFIG_FTRACE_SELFTEST */ + #endif /* ifndef __ASSEMBLY__ */
#endif /* __ASM_FTRACE_H */ diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S index dfe62c55e3a2..a47e87d4d3dd 100644 --- a/arch/arm64/kernel/entry-ftrace.S +++ b/arch/arm64/kernel/entry-ftrace.S @@ -357,3 +357,13 @@ SYM_CODE_START(return_to_handler) ret SYM_CODE_END(return_to_handler) #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ + +#ifdef CONFIG_FTRACE_SELFTEST +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +SYM_FUNC_START(trace_direct_tramp) + mov x10, x30 + mov x30, x9 + ret x10 +SYM_FUNC_END(trace_direct_tramp) +#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ +#endif /* CONFIG_FTRACE_SELFTEST */ diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index abcadbe933bb..e7ccd0d10c39 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -785,8 +785,10 @@ static struct fgraph_ops fgraph_ops __initdata = { };
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +#ifndef trace_direct_tramp noinline __noclone static void trace_direct_tramp(void) { } #endif +#endif
/* * Pretty much the same than for the function tracer from which the selftest
On Wed, May 18, 2022 at 09:16:34AM -0400, Xu Kuohai wrote:
After direct call is enabled for arm64, ftrace selftest enters a dead loop:
IIUC this means that patch 1 alone is broken, and presumably this patch should have been part of it?
<trace_selftest_dynamic_test_func>: 00 bti c 01 mov x9, x30 <trace_direct_tramp>: 02 bl <trace_direct_tramp> ----------> ret | lr/x30 is 03, return to 03 | 03 mov w0, #0x0 <-----------------------------| | | | dead loop! | | | 04 ret ---- lr/x30 is still 03, go back to 03 ----|
The reason is that when the direct caller trace_direct_tramp() returns to the patched function trace_selftest_dynamic_test_func(), lr is still the address after the instrumented instruction in the patched function, so when the patched function exits, it returns to itself!
To fix this issue, we need to restore lr before trace_direct_tramp() exits, so rewrite a dedicated trace_direct_tramp() for arm64.
As mentioned on patch 1 I'd prefer we solved this through indirection, which would avoid the need for this and would make things more robust generally by keeping the unusual calling convention private to the patch-site and regular trampoline.
Thanks, Mark.
Reported-by: Li Huafei lihuafei1@huawei.com Signed-off-by: Xu Kuohai xukuohai@huawei.com
arch/arm64/include/asm/ftrace.h | 10 ++++++++++ arch/arm64/kernel/entry-ftrace.S | 10 ++++++++++ kernel/trace/trace_selftest.c | 2 ++ 3 files changed, 22 insertions(+)
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index 14a35a5df0a1..6f6b184e72fb 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -126,6 +126,16 @@ static inline bool arch_syscall_match_sym_name(const char *sym, */ return !strcmp(sym + 8, name); }
+#ifdef CONFIG_FTRACE_SELFTEST +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+#define trace_direct_tramp trace_direct_tramp +extern void trace_direct_tramp(void);
+#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ +#endif /* CONFIG_FTRACE_SELFTEST */
#endif /* ifndef __ASSEMBLY__ */ #endif /* __ASM_FTRACE_H */ diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S index dfe62c55e3a2..a47e87d4d3dd 100644 --- a/arch/arm64/kernel/entry-ftrace.S +++ b/arch/arm64/kernel/entry-ftrace.S @@ -357,3 +357,13 @@ SYM_CODE_START(return_to_handler) ret SYM_CODE_END(return_to_handler) #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+#ifdef CONFIG_FTRACE_SELFTEST +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +SYM_FUNC_START(trace_direct_tramp)
- mov x10, x30
- mov x30, x9
- ret x10
+SYM_FUNC_END(trace_direct_tramp) +#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ +#endif /* CONFIG_FTRACE_SELFTEST */ diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index abcadbe933bb..e7ccd0d10c39 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -785,8 +785,10 @@ static struct fgraph_ops fgraph_ops __initdata = { }; #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +#ifndef trace_direct_tramp noinline __noclone static void trace_direct_tramp(void) { } #endif +#endif /*
- Pretty much the same than for the function tracer from which the selftest
-- 2.30.2
On 5/25/2022 9:43 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:34AM -0400, Xu Kuohai wrote:
After direct call is enabled for arm64, ftrace selftest enters a dead loop:
IIUC this means that patch 1 alone is broken, and presumably this patch should have been part of it?
No, patch 1 is not broken. This patch fixes bug in the selftest trampoline, not bug in patch 1.
<trace_selftest_dynamic_test_func>: 00 bti c 01 mov x9, x30 <trace_direct_tramp>: 02 bl <trace_direct_tramp> ----------> ret | lr/x30 is 03, return to 03 | 03 mov w0, #0x0 <-----------------------------| | | | dead loop! | | | 04 ret ---- lr/x30 is still 03, go back to 03 ----|
The reason is that when the direct caller trace_direct_tramp() returns to the patched function trace_selftest_dynamic_test_func(), lr is still the address after the instrumented instruction in the patched function, so when the patched function exits, it returns to itself!
To fix this issue, we need to restore lr before trace_direct_tramp() exits, so rewrite a dedicated trace_direct_tramp() for arm64.
As mentioned on patch 1 I'd prefer we solved this through indirection, which would avoid the need for this and would make things more robust generally by keeping the unusual calling convention private to the patch-site and regular trampoline.
IIUC, we still need to restore x30 before returning from the trampoline even through indirection, so this bug is still there.
Thanks, Mark.
Reported-by: Li Huafei lihuafei1@huawei.com Signed-off-by: Xu Kuohai xukuohai@huawei.com
arch/arm64/include/asm/ftrace.h | 10 ++++++++++ arch/arm64/kernel/entry-ftrace.S | 10 ++++++++++ kernel/trace/trace_selftest.c | 2 ++ 3 files changed, 22 insertions(+)
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index 14a35a5df0a1..6f6b184e72fb 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -126,6 +126,16 @@ static inline bool arch_syscall_match_sym_name(const char *sym, */ return !strcmp(sym + 8, name); }
+#ifdef CONFIG_FTRACE_SELFTEST +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+#define trace_direct_tramp trace_direct_tramp +extern void trace_direct_tramp(void);
+#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ +#endif /* CONFIG_FTRACE_SELFTEST */
#endif /* ifndef __ASSEMBLY__ */ #endif /* __ASM_FTRACE_H */ diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S index dfe62c55e3a2..a47e87d4d3dd 100644 --- a/arch/arm64/kernel/entry-ftrace.S +++ b/arch/arm64/kernel/entry-ftrace.S @@ -357,3 +357,13 @@ SYM_CODE_START(return_to_handler) ret SYM_CODE_END(return_to_handler) #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+#ifdef CONFIG_FTRACE_SELFTEST +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +SYM_FUNC_START(trace_direct_tramp)
- mov x10, x30
- mov x30, x9
- ret x10
+SYM_FUNC_END(trace_direct_tramp) +#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ +#endif /* CONFIG_FTRACE_SELFTEST */ diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index abcadbe933bb..e7ccd0d10c39 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -785,8 +785,10 @@ static struct fgraph_ops fgraph_ops __initdata = { }; #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +#ifndef trace_direct_tramp noinline __noclone static void trace_direct_tramp(void) { } #endif +#endif /*
- Pretty much the same than for the function tracer from which the selftest
-- 2.30.2
.
BPF_TRAM_F_XXX flags are not used by user code and are almost constant at compile time, so run time validation is a bit overkill. Remove is_valid_bpf_tramp_flags() and add some usage comments.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com --- arch/x86/net/bpf_jit_comp.c | 20 -------------------- kernel/bpf/bpf_struct_ops.c | 3 +++ kernel/bpf/trampoline.c | 3 +++ 3 files changed, 6 insertions(+), 20 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index a2b6d197c226..7698ef3b4821 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1922,23 +1922,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, return 0; }
-static bool is_valid_bpf_tramp_flags(unsigned int flags) -{ - if ((flags & BPF_TRAMP_F_RESTORE_REGS) && - (flags & BPF_TRAMP_F_SKIP_FRAME)) - return false; - - /* - * BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops, - * and it must be used alone. - */ - if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) && - (flags & ~BPF_TRAMP_F_RET_FENTRY_RET)) - return false; - - return true; -} - /* Example: * __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev); * its 'struct btf_func_model' will be nr_args=2 @@ -2017,9 +2000,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i if (nr_args > 6) return -ENOTSUPP;
- if (!is_valid_bpf_tramp_flags(flags)) - return -EINVAL; - /* Generated trampoline stack layout: * * RBP + 8 [ return address ] diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index d9a3c9207240..0572cc5aeb28 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks,
tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1; + /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops, + * and it must be used alone. + */ flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; return arch_prepare_bpf_trampoline(NULL, image, image_end, model, flags, tlinks, NULL); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 93c7675f0c9e..bd3f2e673874 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -358,6 +358,9 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
if (tlinks[BPF_TRAMP_FEXIT].nr_links || tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links) + /* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME + * should not be set together. + */ flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
if (ip_arg)
On Wed, May 18, 2022 at 09:16:35AM -0400, Xu Kuohai wrote:
BPF_TRAM_F_XXX flags are not used by user code and are almost constant at compile time, so run time validation is a bit overkill. Remove is_valid_bpf_tramp_flags() and add some usage comments.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
Am I right in thinking this is independent of the arm64-specific bits, and could be taken on its own now?
Mark.
arch/x86/net/bpf_jit_comp.c | 20 -------------------- kernel/bpf/bpf_struct_ops.c | 3 +++ kernel/bpf/trampoline.c | 3 +++ 3 files changed, 6 insertions(+), 20 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index a2b6d197c226..7698ef3b4821 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1922,23 +1922,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, return 0; } -static bool is_valid_bpf_tramp_flags(unsigned int flags) -{
- if ((flags & BPF_TRAMP_F_RESTORE_REGS) &&
(flags & BPF_TRAMP_F_SKIP_FRAME))
return false;
- /*
* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
* and it must be used alone.
*/
- if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) &&
(flags & ~BPF_TRAMP_F_RET_FENTRY_RET))
return false;
- return true;
-}
/* Example:
- __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
- its 'struct btf_func_model' will be nr_args=2
@@ -2017,9 +2000,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i if (nr_args > 6) return -ENOTSUPP;
- if (!is_valid_bpf_tramp_flags(flags))
return -EINVAL;
- /* Generated trampoline stack layout:
- RBP + 8 [ return address ]
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index d9a3c9207240..0572cc5aeb28 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1;
- /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
* and it must be used alone.
flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; return arch_prepare_bpf_trampoline(NULL, image, image_end, model, flags, tlinks, NULL);*/
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 93c7675f0c9e..bd3f2e673874 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -358,6 +358,9 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) if (tlinks[BPF_TRAMP_FEXIT].nr_links || tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links)
/* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME
* should not be set together.
flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;*/
if (ip_arg) -- 2.30.2
On 5/25/2022 9:45 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:35AM -0400, Xu Kuohai wrote:
BPF_TRAM_F_XXX flags are not used by user code and are almost constant at compile time, so run time validation is a bit overkill. Remove is_valid_bpf_tramp_flags() and add some usage comments.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
Am I right in thinking this is independent of the arm64-specific bits, and could be taken on its own now?
Currenly is_valid_bpf_tramp_flags() is defined in x86 and called before bpf trampoline is constructed. The check logic is irrelevant to the architecture code. So we also need to call this function on arm64. But as Alexei pointed out, the check is not requried, so it's better to remove it before adding bpf trampoline to arm64.
Mark.
arch/x86/net/bpf_jit_comp.c | 20 -------------------- kernel/bpf/bpf_struct_ops.c | 3 +++ kernel/bpf/trampoline.c | 3 +++ 3 files changed, 6 insertions(+), 20 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index a2b6d197c226..7698ef3b4821 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1922,23 +1922,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, return 0; } -static bool is_valid_bpf_tramp_flags(unsigned int flags) -{
- if ((flags & BPF_TRAMP_F_RESTORE_REGS) &&
(flags & BPF_TRAMP_F_SKIP_FRAME))
return false;
- /*
* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
* and it must be used alone.
*/
- if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) &&
(flags & ~BPF_TRAMP_F_RET_FENTRY_RET))
return false;
- return true;
-}
/* Example:
- __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
- its 'struct btf_func_model' will be nr_args=2
@@ -2017,9 +2000,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i if (nr_args > 6) return -ENOTSUPP;
- if (!is_valid_bpf_tramp_flags(flags))
return -EINVAL;
- /* Generated trampoline stack layout:
- RBP + 8 [ return address ]
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index d9a3c9207240..0572cc5aeb28 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1;
- /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
* and it must be used alone.
flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; return arch_prepare_bpf_trampoline(NULL, image, image_end, model, flags, tlinks, NULL);*/
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 93c7675f0c9e..bd3f2e673874 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -358,6 +358,9 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) if (tlinks[BPF_TRAMP_FEXIT].nr_links || tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links)
/* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME
* should not be set together.
flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;*/
if (ip_arg) -- 2.30.2
.
On Thu, May 26, 2022 at 05:45:25PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:45 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:35AM -0400, Xu Kuohai wrote:
BPF_TRAM_F_XXX flags are not used by user code and are almost constant at compile time, so run time validation is a bit overkill. Remove is_valid_bpf_tramp_flags() and add some usage comments.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
Am I right in thinking this is independent of the arm64-specific bits, and could be taken on its own now?
Currenly is_valid_bpf_tramp_flags() is defined in x86 and called before bpf trampoline is constructed. The check logic is irrelevant to the architecture code. So we also need to call this function on arm64. But as Alexei pointed out, the check is not requried, so it's better to remove it before adding bpf trampoline to arm64.
Cool. So this patch could be merged now, even if the rest of the series needs more work?
Thanks, Mark.
arch/x86/net/bpf_jit_comp.c | 20 -------------------- kernel/bpf/bpf_struct_ops.c | 3 +++ kernel/bpf/trampoline.c | 3 +++ 3 files changed, 6 insertions(+), 20 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index a2b6d197c226..7698ef3b4821 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1922,23 +1922,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, return 0; } -static bool is_valid_bpf_tramp_flags(unsigned int flags) -{
- if ((flags & BPF_TRAMP_F_RESTORE_REGS) &&
(flags & BPF_TRAMP_F_SKIP_FRAME))
return false;
- /*
* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
* and it must be used alone.
*/
- if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) &&
(flags & ~BPF_TRAMP_F_RET_FENTRY_RET))
return false;
- return true;
-}
/* Example:
- __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
- its 'struct btf_func_model' will be nr_args=2
@@ -2017,9 +2000,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i if (nr_args > 6) return -ENOTSUPP;
- if (!is_valid_bpf_tramp_flags(flags))
return -EINVAL;
- /* Generated trampoline stack layout:
- RBP + 8 [ return address ]
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index d9a3c9207240..0572cc5aeb28 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1;
- /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
* and it must be used alone.
flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; return arch_prepare_bpf_trampoline(NULL, image, image_end, model, flags, tlinks, NULL);*/
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 93c7675f0c9e..bd3f2e673874 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -358,6 +358,9 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) if (tlinks[BPF_TRAMP_FEXIT].nr_links || tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links)
/* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME
* should not be set together.
flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;*/
if (ip_arg) -- 2.30.2
.
On 5/26/2022 6:12 PM, Mark Rutland wrote:
On Thu, May 26, 2022 at 05:45:25PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:45 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:35AM -0400, Xu Kuohai wrote:
BPF_TRAM_F_XXX flags are not used by user code and are almost constant at compile time, so run time validation is a bit overkill. Remove is_valid_bpf_tramp_flags() and add some usage comments.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
Am I right in thinking this is independent of the arm64-specific bits, and could be taken on its own now?
Currenly is_valid_bpf_tramp_flags() is defined in x86 and called before bpf trampoline is constructed. The check logic is irrelevant to the architecture code. So we also need to call this function on arm64. But as Alexei pointed out, the check is not requried, so it's better to remove it before adding bpf trampoline to arm64.
Cool. So this patch could be merged now, even if the rest of the series needs more work?
Agree with you, thanks.
Thanks, Mark.
arch/x86/net/bpf_jit_comp.c | 20 -------------------- kernel/bpf/bpf_struct_ops.c | 3 +++ kernel/bpf/trampoline.c | 3 +++ 3 files changed, 6 insertions(+), 20 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index a2b6d197c226..7698ef3b4821 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1922,23 +1922,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, return 0; } -static bool is_valid_bpf_tramp_flags(unsigned int flags) -{
- if ((flags & BPF_TRAMP_F_RESTORE_REGS) &&
(flags & BPF_TRAMP_F_SKIP_FRAME))
return false;
- /*
* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
* and it must be used alone.
*/
- if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) &&
(flags & ~BPF_TRAMP_F_RET_FENTRY_RET))
return false;
- return true;
-}
/* Example:
- __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
- its 'struct btf_func_model' will be nr_args=2
@@ -2017,9 +2000,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i if (nr_args > 6) return -ENOTSUPP;
- if (!is_valid_bpf_tramp_flags(flags))
return -EINVAL;
- /* Generated trampoline stack layout:
- RBP + 8 [ return address ]
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index d9a3c9207240..0572cc5aeb28 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1;
- /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
* and it must be used alone.
flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; return arch_prepare_bpf_trampoline(NULL, image, image_end, model, flags, tlinks, NULL);*/
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 93c7675f0c9e..bd3f2e673874 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -358,6 +358,9 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) if (tlinks[BPF_TRAMP_FEXIT].nr_links || tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links)
/* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME
* should not be set together.
flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;*/
if (ip_arg) -- 2.30.2
.
.
Impelment bpf_arch_text_poke() for arm64, so bpf trampoline code can use it to replace nop with jump, or replace jump with nop.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com Reviewed-by: Jakub Sitnicki jakub@cloudflare.com --- arch/arm64/net/bpf_jit.h | 1 + arch/arm64/net/bpf_jit_comp.c | 107 +++++++++++++++++++++++++++++++--- 2 files changed, 99 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h index 194c95ccc1cf..1c4b0075a3e2 100644 --- a/arch/arm64/net/bpf_jit.h +++ b/arch/arm64/net/bpf_jit.h @@ -270,6 +270,7 @@ #define A64_BTI_C A64_HINT(AARCH64_INSN_HINT_BTIC) #define A64_BTI_J A64_HINT(AARCH64_INSN_HINT_BTIJ) #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC) +#define A64_NOP A64_HINT(AARCH64_INSN_HINT_NOP)
/* DMB */ #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 8ab4035dea27..5ce6ed5f42a1 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -9,6 +9,7 @@
#include <linux/bitfield.h> #include <linux/bpf.h> +#include <linux/memory.h> #include <linux/filter.h> #include <linux/printk.h> #include <linux/slab.h> @@ -18,6 +19,7 @@ #include <asm/cacheflush.h> #include <asm/debug-monitors.h> #include <asm/insn.h> +#include <asm/patching.h> #include <asm/set_memory.h>
#include "bpf_jit.h" @@ -235,13 +237,13 @@ static bool is_lsi_offset(int offset, int scale) return true; }
+#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0) +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0) + /* Tail call offset to jump into */ -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \ - IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) -#define PROLOGUE_OFFSET 9 -#else -#define PROLOGUE_OFFSET 8 -#endif +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8) +/* Offset of nop instruction in bpf prog entry to be poked */ +#define POKE_OFFSET (BTI_INSNS + 1)
static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { @@ -279,12 +281,15 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) * */
+ if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) + emit(A64_BTI_C, ctx); + + emit(A64_MOV(1, A64_R(9), A64_LR), ctx); + emit(A64_NOP, ctx); + /* Sign lr */ if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)) emit(A64_PACIASP, ctx); - /* BTI landing pad */ - else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) - emit(A64_BTI_C, ctx);
/* Save FP and LR registers to stay align with ARM64 AAPCS */ emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); @@ -1529,3 +1534,87 @@ void bpf_jit_free_exec(void *addr) { return vfree(addr); } + +static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip, + void *addr, u32 *insn) +{ + if (!addr) + *insn = aarch64_insn_gen_nop(); + else + *insn = aarch64_insn_gen_branch_imm((unsigned long)ip, + (unsigned long)addr, + type); + + return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT; +} + +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type, + void *old_addr, void *new_addr) +{ + int ret; + u32 old_insn; + u32 new_insn; + u32 replaced; + unsigned long offset = ~0UL; + enum aarch64_insn_branch_type branch_type; + char namebuf[KSYM_NAME_LEN]; + + if (!__bpf_address_lookup((unsigned long)ip, NULL, &offset, namebuf)) + /* Only poking bpf text is supported. Since kernel function + * entry is set up by ftrace, we reply on ftrace to poke kernel + * functions. + */ + return -EINVAL; + + /* bpf entry */ + if (offset == 0UL) + /* skip to the nop instruction in bpf prog entry: + * bti c // if BTI enabled + * mov x9, x30 + * nop + */ + ip = ip + POKE_OFFSET * AARCH64_INSN_SIZE; + + if (poke_type == BPF_MOD_CALL) + branch_type = AARCH64_INSN_BRANCH_LINK; + else + branch_type = AARCH64_INSN_BRANCH_NOLINK; + + if (gen_branch_or_nop(branch_type, ip, old_addr, &old_insn) < 0) + return -EFAULT; + + if (gen_branch_or_nop(branch_type, ip, new_addr, &new_insn) < 0) + return -EFAULT; + + mutex_lock(&text_mutex); + if (aarch64_insn_read(ip, &replaced)) { + ret = -EFAULT; + goto out; + } + + if (replaced != old_insn) { + ret = -EFAULT; + goto out; + } + + /* We call aarch64_insn_patch_text_nosync() to replace instruction + * atomically, so no other CPUs will fetch a half-new and half-old + * instruction. But there is chance that another CPU fetches the old + * instruction after bpf_arch_text_poke() finishes, that is, different + * CPUs may execute different versions of instructions at the same + * time before the icache is synchronized by hardware. + * + * 1. when a new trampoline is attached, it is not an issue for + * different CPUs to jump to different trampolines temporarily. + * + * 2. when an old trampoline is freed, we should wait for all other + * CPUs to exit the trampoline and make sure the trampoline is no + * longer reachable, since bpf_tramp_image_put() function already + * uses percpu_ref and rcu task to do the sync, no need to call the + * sync interface here. + */ + ret = aarch64_insn_patch_text_nosync(ip, new_insn); +out: + mutex_unlock(&text_mutex); + return ret; +}
On Wed, May 18, 2022 at 3:54 PM Xu Kuohai xukuohai@huawei.com wrote:
Impelment bpf_arch_text_poke() for arm64, so bpf trampoline code can use it to replace nop with jump, or replace jump with nop.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com Reviewed-by: Jakub Sitnicki jakub@cloudflare.com
Reviewed-by: KP Singh kpsingh@kernel.org
On Wed, May 18, 2022 at 09:16:36AM -0400, Xu Kuohai wrote:
Impelment bpf_arch_text_poke() for arm64, so bpf trampoline code can use it to replace nop with jump, or replace jump with nop.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com Reviewed-by: Jakub Sitnicki jakub@cloudflare.com
arch/arm64/net/bpf_jit.h | 1 + arch/arm64/net/bpf_jit_comp.c | 107 +++++++++++++++++++++++++++++++--- 2 files changed, 99 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h index 194c95ccc1cf..1c4b0075a3e2 100644 --- a/arch/arm64/net/bpf_jit.h +++ b/arch/arm64/net/bpf_jit.h @@ -270,6 +270,7 @@ #define A64_BTI_C A64_HINT(AARCH64_INSN_HINT_BTIC) #define A64_BTI_J A64_HINT(AARCH64_INSN_HINT_BTIJ) #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC) +#define A64_NOP A64_HINT(AARCH64_INSN_HINT_NOP) /* DMB */ #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 8ab4035dea27..5ce6ed5f42a1 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -9,6 +9,7 @@ #include <linux/bitfield.h> #include <linux/bpf.h> +#include <linux/memory.h> #include <linux/filter.h> #include <linux/printk.h> #include <linux/slab.h> @@ -18,6 +19,7 @@ #include <asm/cacheflush.h> #include <asm/debug-monitors.h> #include <asm/insn.h> +#include <asm/patching.h> #include <asm/set_memory.h> #include "bpf_jit.h" @@ -235,13 +237,13 @@ static bool is_lsi_offset(int offset, int scale) return true; } +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0) +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
/* Tail call offset to jump into */ -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
- IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
-#define PROLOGUE_OFFSET 9 -#else -#define PROLOGUE_OFFSET 8 -#endif +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8) +/* Offset of nop instruction in bpf prog entry to be poked */ +#define POKE_OFFSET (BTI_INSNS + 1) static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { @@ -279,12 +281,15 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) * */
- if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
emit(A64_BTI_C, ctx);
- emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
- emit(A64_NOP, ctx);
I take it the idea is to make this the same as the regular ftrace patch-site sequence, so that this can call the same trampoline(s) ?
If so, we need some commentary to that effect, and we need some comments in the ftrace code explaining that this needs to be kept in-sync.
- /* Sign lr */ if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)) emit(A64_PACIASP, ctx);
- /* BTI landing pad */
- else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
emit(A64_BTI_C, ctx);
/* Save FP and LR registers to stay align with ARM64 AAPCS */ emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); @@ -1529,3 +1534,87 @@ void bpf_jit_free_exec(void *addr) { return vfree(addr); }
+static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
void *addr, u32 *insn)
+{
- if (!addr)
*insn = aarch64_insn_gen_nop();
- else
*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
(unsigned long)addr,
type);
- return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
+}
+int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
void *old_addr, void *new_addr)
+{
- int ret;
- u32 old_insn;
- u32 new_insn;
- u32 replaced;
- unsigned long offset = ~0UL;
- enum aarch64_insn_branch_type branch_type;
- char namebuf[KSYM_NAME_LEN];
- if (!__bpf_address_lookup((unsigned long)ip, NULL, &offset, namebuf))
/* Only poking bpf text is supported. Since kernel function
* entry is set up by ftrace, we reply on ftrace to poke kernel
* functions.
*/
return -EINVAL;
- /* bpf entry */
- if (offset == 0UL)
/* skip to the nop instruction in bpf prog entry:
* bti c // if BTI enabled
* mov x9, x30
* nop
*/
ip = ip + POKE_OFFSET * AARCH64_INSN_SIZE;
When is offset non-zero? is this ever called to patch other instructions, and could this ever be used to try to patch the BTI specifically?
I strongly suspect we need a higher-level API to say "poke the patchable callsite in the prologue", rather than assuming that offset 0 always means that, or it'll be *very* easy for this to go wrong.
- if (poke_type == BPF_MOD_CALL)
branch_type = AARCH64_INSN_BRANCH_LINK;
- else
branch_type = AARCH64_INSN_BRANCH_NOLINK;
When is poke_type *not* BPF_MOD_CALL?
I assume that means BPF also uses this for non-ftrace reasons?
- if (gen_branch_or_nop(branch_type, ip, old_addr, &old_insn) < 0)
return -EFAULT;
- if (gen_branch_or_nop(branch_type, ip, new_addr, &new_insn) < 0)
return -EFAULT;
- mutex_lock(&text_mutex);
- if (aarch64_insn_read(ip, &replaced)) {
ret = -EFAULT;
goto out;
- }
- if (replaced != old_insn) {
ret = -EFAULT;
goto out;
- }
- /* We call aarch64_insn_patch_text_nosync() to replace instruction
* atomically, so no other CPUs will fetch a half-new and half-old
* instruction. But there is chance that another CPU fetches the old
* instruction after bpf_arch_text_poke() finishes, that is, different
* CPUs may execute different versions of instructions at the same
* time before the icache is synchronized by hardware.
*
* 1. when a new trampoline is attached, it is not an issue for
* different CPUs to jump to different trampolines temporarily.
*
* 2. when an old trampoline is freed, we should wait for all other
* CPUs to exit the trampoline and make sure the trampoline is no
* longer reachable, since bpf_tramp_image_put() function already
* uses percpu_ref and rcu task to do the sync, no need to call the
* sync interface here.
*/
How is RCU used for that? It's not clear to me how that works for PREEMPT_RCU (which is the usual configuration for arm64), since we can easily be in a preemptible context, outside of an RCU read side critical section, yet call into a trampoline.
I know that for livepatching we need to use stacktracing to ensure we've finished using code we'd like to free, and I can't immediately see how you can avoid that here. I'm suspicious that there's still a race where threads can enter the trampoline and it can be subsequently freed.
For ftrace today we get away with entering the existing trampolines when not intended because those are statically allocated, and the race is caught when acquiring the ops inside the ftrace core code. This case is different because the CPU can fetch the instruction and execute that at any time, without any RCU involvement.
Can you give more details on how the scheme described above works? How *exactly*` do you ensure that threads which have entered the trampoline (and may have been immediately preempted by an interrupt) have returned? Which RCU mechanism are you using?
If you can point me at where this is implemented I'm happy to take a look.
Thanks, Mark.
- ret = aarch64_insn_patch_text_nosync(ip, new_insn);
+out:
- mutex_unlock(&text_mutex);
- return ret;
+}
2.30.2
On 5/25/2022 10:10 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:36AM -0400, Xu Kuohai wrote:
Impelment bpf_arch_text_poke() for arm64, so bpf trampoline code can use it to replace nop with jump, or replace jump with nop.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com Reviewed-by: Jakub Sitnicki jakub@cloudflare.com
arch/arm64/net/bpf_jit.h | 1 + arch/arm64/net/bpf_jit_comp.c | 107 +++++++++++++++++++++++++++++++--- 2 files changed, 99 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h index 194c95ccc1cf..1c4b0075a3e2 100644 --- a/arch/arm64/net/bpf_jit.h +++ b/arch/arm64/net/bpf_jit.h @@ -270,6 +270,7 @@ #define A64_BTI_C A64_HINT(AARCH64_INSN_HINT_BTIC) #define A64_BTI_J A64_HINT(AARCH64_INSN_HINT_BTIJ) #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC) +#define A64_NOP A64_HINT(AARCH64_INSN_HINT_NOP) /* DMB */ #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 8ab4035dea27..5ce6ed5f42a1 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -9,6 +9,7 @@ #include <linux/bitfield.h> #include <linux/bpf.h> +#include <linux/memory.h> #include <linux/filter.h> #include <linux/printk.h> #include <linux/slab.h> @@ -18,6 +19,7 @@ #include <asm/cacheflush.h> #include <asm/debug-monitors.h> #include <asm/insn.h> +#include <asm/patching.h> #include <asm/set_memory.h> #include "bpf_jit.h" @@ -235,13 +237,13 @@ static bool is_lsi_offset(int offset, int scale) return true; } +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0) +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
/* Tail call offset to jump into */ -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
- IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
-#define PROLOGUE_OFFSET 9 -#else -#define PROLOGUE_OFFSET 8 -#endif +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8) +/* Offset of nop instruction in bpf prog entry to be poked */ +#define POKE_OFFSET (BTI_INSNS + 1) static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { @@ -279,12 +281,15 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) * */
- if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
emit(A64_BTI_C, ctx);
- emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
- emit(A64_NOP, ctx);
I take it the idea is to make this the same as the regular ftrace patch-site sequence, so that this can call the same trampoline(s) ?
Yes, we can attach a bpf trampoline to bpf prog.
If so, we need some commentary to that effect, and we need some comments in the ftrace code explaining that this needs to be kept in-sync.
This is patched by bpf_arch_text_poke(), not ftrace.
- /* Sign lr */ if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)) emit(A64_PACIASP, ctx);
- /* BTI landing pad */
- else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
emit(A64_BTI_C, ctx);
/* Save FP and LR registers to stay align with ARM64 AAPCS */ emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); @@ -1529,3 +1534,87 @@ void bpf_jit_free_exec(void *addr) { return vfree(addr); }
+static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
void *addr, u32 *insn)
+{
- if (!addr)
*insn = aarch64_insn_gen_nop();
- else
*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
(unsigned long)addr,
type);
- return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
+}
+int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
void *old_addr, void *new_addr)
+{
- int ret;
- u32 old_insn;
- u32 new_insn;
- u32 replaced;
- unsigned long offset = ~0UL;
- enum aarch64_insn_branch_type branch_type;
- char namebuf[KSYM_NAME_LEN];
- if (!__bpf_address_lookup((unsigned long)ip, NULL, &offset, namebuf))
/* Only poking bpf text is supported. Since kernel function
* entry is set up by ftrace, we reply on ftrace to poke kernel
* functions.
*/
return -EINVAL;
- /* bpf entry */
- if (offset == 0UL)
/* skip to the nop instruction in bpf prog entry:
* bti c // if BTI enabled
* mov x9, x30
* nop
*/
ip = ip + POKE_OFFSET * AARCH64_INSN_SIZE;
When is offset non-zero? is this ever called to patch other instructions, and could this ever be used to try to patch the BTI specifically?
bpf_arch_text_poke() is also called to patch other instructions, for example, bpf_tramp_image_put() calls this to skip calling fexit bpf progs:
int err = bpf_arch_text_poke(im->ip_after_call, BPF_MOD_JUMP, NULL, im->ip_epilogue);
Before this is called, a bpf trampoline looks like:
[...] ip_after_call: nop // to be patched bl <fexit prog> ip_epilogue: bti j [...]
After: [...] ip_after_call: b <ip_epilogue> // patched bl <fexit prog> ip_epilogue: bti j [...]
I strongly suspect we need a higher-level API to say "poke the patchable callsite in the prologue", rather than assuming that offset 0 always means that, or it'll be *very* easy for this to go wrong.
Ah, bpf_arch_text_poke() only patches bpf progs, and the patch-site in bpf prog prologue is constructed for bpf_arch_text_poke(), so we always know the patch-site offset. There's no compiler generated instruction here, so it seems to be not a problem.
- if (poke_type == BPF_MOD_CALL)
branch_type = AARCH64_INSN_BRANCH_LINK;
- else
branch_type = AARCH64_INSN_BRANCH_NOLINK;
When is poke_type *not* BPF_MOD_CALL?>
The bpf_tramp_image_put() example above uses BPF_MOD_JUMP.
I assume that means BPF also uses this for non-ftrace reasons?
This function is NOT used for ftrace patch-site. It's only used to patch bpf image.
- if (gen_branch_or_nop(branch_type, ip, old_addr, &old_insn) < 0)
return -EFAULT;
- if (gen_branch_or_nop(branch_type, ip, new_addr, &new_insn) < 0)
return -EFAULT;
- mutex_lock(&text_mutex);
- if (aarch64_insn_read(ip, &replaced)) {
ret = -EFAULT;
goto out;
- }
- if (replaced != old_insn) {
ret = -EFAULT;
goto out;
- }
- /* We call aarch64_insn_patch_text_nosync() to replace instruction
* atomically, so no other CPUs will fetch a half-new and half-old
* instruction. But there is chance that another CPU fetches the old
* instruction after bpf_arch_text_poke() finishes, that is, different
* CPUs may execute different versions of instructions at the same
* time before the icache is synchronized by hardware.
*
* 1. when a new trampoline is attached, it is not an issue for
* different CPUs to jump to different trampolines temporarily.
*
* 2. when an old trampoline is freed, we should wait for all other
* CPUs to exit the trampoline and make sure the trampoline is no
* longer reachable, since bpf_tramp_image_put() function already
* uses percpu_ref and rcu task to do the sync, no need to call the
* sync interface here.
*/
How is RCU used for that? It's not clear to me how that works for PREEMPT_RCU (which is the usual configuration for arm64), since we can easily be in a preemptible context, outside of an RCU read side critical section, yet call into a trampoline.
I know that for livepatching we need to use stacktracing to ensure we've finished using code we'd like to free, and I can't immediately see how you can avoid that here. I'm suspicious that there's still a race where threads can enter the trampoline and it can be subsequently freed.
For ftrace today we get away with entering the existing trampolines when not intended because those are statically allocated, and the race is caught when acquiring the ops inside the ftrace core code. This case is different because the CPU can fetch the instruction and execute that at any time, without any RCU involvement.
Can you give more details on how the scheme described above works? How *exactly*` do you ensure that threads which have entered the trampoline (and may have been immediately preempted by an interrupt) have returned? Which RCU mechanism are you using?
If you can point me at where this is implemented I'm happy to take a look.
IIUC, task rcu's critical section ends at a voluntary context switch, since no volutary context switch occurs in a progoluge, when a task rcu's critical section ends, we can ensure no one is running in the prologue [1].
For bpf trampoline, the scenario is similar, except that it may sleep, so a reference count is increased when entering the trampoline and decreased when exiting the trampoline, so we can wait for the reference count to become zero to make sure there is no one in the sleepable region [2].
[1] https://lore.kernel.org/all/20140804192017.GA18337@linux.vnet.ibm.com/ [2] https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail....
Thanks, Mark.
- ret = aarch64_insn_patch_text_nosync(ip, new_insn);
+out:
- mutex_unlock(&text_mutex);
- return ret;
+}
2.30.2
.
On Thu, May 26, 2022 at 05:45:30PM +0800, Xu Kuohai wrote:
On 5/25/2022 10:10 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:36AM -0400, Xu Kuohai wrote:
Impelment bpf_arch_text_poke() for arm64, so bpf trampoline code can use it to replace nop with jump, or replace jump with nop.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com Reviewed-by: Jakub Sitnicki jakub@cloudflare.com
arch/arm64/net/bpf_jit.h | 1 + arch/arm64/net/bpf_jit_comp.c | 107 +++++++++++++++++++++++++++++++--- 2 files changed, 99 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h index 194c95ccc1cf..1c4b0075a3e2 100644 --- a/arch/arm64/net/bpf_jit.h +++ b/arch/arm64/net/bpf_jit.h @@ -270,6 +270,7 @@ #define A64_BTI_C A64_HINT(AARCH64_INSN_HINT_BTIC) #define A64_BTI_J A64_HINT(AARCH64_INSN_HINT_BTIJ) #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC) +#define A64_NOP A64_HINT(AARCH64_INSN_HINT_NOP) /* DMB */ #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 8ab4035dea27..5ce6ed5f42a1 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -9,6 +9,7 @@ #include <linux/bitfield.h> #include <linux/bpf.h> +#include <linux/memory.h> #include <linux/filter.h> #include <linux/printk.h> #include <linux/slab.h> @@ -18,6 +19,7 @@ #include <asm/cacheflush.h> #include <asm/debug-monitors.h> #include <asm/insn.h> +#include <asm/patching.h> #include <asm/set_memory.h> #include "bpf_jit.h" @@ -235,13 +237,13 @@ static bool is_lsi_offset(int offset, int scale) return true; } +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0) +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
/* Tail call offset to jump into */ -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
- IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
-#define PROLOGUE_OFFSET 9 -#else -#define PROLOGUE_OFFSET 8 -#endif +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8) +/* Offset of nop instruction in bpf prog entry to be poked */ +#define POKE_OFFSET (BTI_INSNS + 1) static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { @@ -279,12 +281,15 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) * */
- if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
emit(A64_BTI_C, ctx);
- emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
- emit(A64_NOP, ctx);
I take it the idea is to make this the same as the regular ftrace patch-site sequence, so that this can call the same trampoline(s) ?
Yes, we can attach a bpf trampoline to bpf prog.
Just to check, is the BPF trampoline *only* attached to BPF programs, or could that be attached to a regular ftrace patch-site?
I has assumed that the point of adding direct calls support was so that this could be called from regular ftrace patch sites, but your replies below on how the trampoline is protected imply that's not the case.
If so, we need some commentary to that effect, and we need some comments in the ftrace code explaining that this needs to be kept in-sync.
This is patched by bpf_arch_text_poke(), not ftrace.
I understood that, but if the idea is that the instruction sequence must match, then we need to make that clear. Otherwise changes to one side may break the other unexpectedly.
- /* Sign lr */ if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)) emit(A64_PACIASP, ctx);
- /* BTI landing pad */
- else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
emit(A64_BTI_C, ctx);
/* Save FP and LR registers to stay align with ARM64 AAPCS */ emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); @@ -1529,3 +1534,87 @@ void bpf_jit_free_exec(void *addr) { return vfree(addr); }
+static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
void *addr, u32 *insn)
+{
- if (!addr)
*insn = aarch64_insn_gen_nop();
- else
*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
(unsigned long)addr,
type);
- return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
+}
+int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
void *old_addr, void *new_addr)
+{
- int ret;
- u32 old_insn;
- u32 new_insn;
- u32 replaced;
- unsigned long offset = ~0UL;
- enum aarch64_insn_branch_type branch_type;
- char namebuf[KSYM_NAME_LEN];
- if (!__bpf_address_lookup((unsigned long)ip, NULL, &offset, namebuf))
/* Only poking bpf text is supported. Since kernel function
* entry is set up by ftrace, we reply on ftrace to poke kernel
* functions.
*/
return -EINVAL;
- /* bpf entry */
- if (offset == 0UL)
/* skip to the nop instruction in bpf prog entry:
* bti c // if BTI enabled
* mov x9, x30
* nop
*/
ip = ip + POKE_OFFSET * AARCH64_INSN_SIZE;
When is offset non-zero? is this ever called to patch other instructions, and could this ever be used to try to patch the BTI specifically?
bpf_arch_text_poke() is also called to patch other instructions, for example, bpf_tramp_image_put() calls this to skip calling fexit bpf progs:
int err = bpf_arch_text_poke(im->ip_after_call, BPF_MOD_JUMP, NULL, im->ip_epilogue);
Before this is called, a bpf trampoline looks like:
[...] ip_after_call: nop // to be patched bl <fexit prog> ip_epilogue: bti j [...]
After: [...] ip_after_call: b <ip_epilogue> // patched bl <fexit prog> ip_epilogue: bti j [...]
I strongly suspect we need a higher-level API to say "poke the patchable callsite in the prologue", rather than assuming that offset 0 always means that, or it'll be *very* easy for this to go wrong.
Ah, bpf_arch_text_poke() only patches bpf progs, and the patch-site in bpf prog prologue is constructed for bpf_arch_text_poke(), so we always know the patch-site offset. There's no compiler generated instruction here, so it seems to be not a problem.
I understand all of that. My point is that if anyone ever wants to use bpf_arch_text_poke() to poke the *actual* first instruction, this will go wrong.
I see that x86 does the same thing to skip ENDBR, so if this is just a fragile API, and people are expected to not do that, then fine. It's just confusing.
- if (poke_type == BPF_MOD_CALL)
branch_type = AARCH64_INSN_BRANCH_LINK;
- else
branch_type = AARCH64_INSN_BRANCH_NOLINK;
When is poke_type *not* BPF_MOD_CALL?>
The bpf_tramp_image_put() example above uses BPF_MOD_JUMP.
I assume that means BPF also uses this for non-ftrace reasons?
This function is NOT used for ftrace patch-site. It's only used to patch bpf image.
I understand that this isn't ftrace; I meant for the patch site we introduced in the BPF program that is compatible with the patch-site sequence that ftrace uses.
- if (gen_branch_or_nop(branch_type, ip, old_addr, &old_insn) < 0)
return -EFAULT;
- if (gen_branch_or_nop(branch_type, ip, new_addr, &new_insn) < 0)
return -EFAULT;
- mutex_lock(&text_mutex);
- if (aarch64_insn_read(ip, &replaced)) {
ret = -EFAULT;
goto out;
- }
- if (replaced != old_insn) {
ret = -EFAULT;
goto out;
- }
- /* We call aarch64_insn_patch_text_nosync() to replace instruction
* atomically, so no other CPUs will fetch a half-new and half-old
* instruction. But there is chance that another CPU fetches the old
* instruction after bpf_arch_text_poke() finishes, that is, different
* CPUs may execute different versions of instructions at the same
* time before the icache is synchronized by hardware.
*
* 1. when a new trampoline is attached, it is not an issue for
* different CPUs to jump to different trampolines temporarily.
*
* 2. when an old trampoline is freed, we should wait for all other
* CPUs to exit the trampoline and make sure the trampoline is no
* longer reachable, since bpf_tramp_image_put() function already
* uses percpu_ref and rcu task to do the sync, no need to call the
* sync interface here.
*/
How is RCU used for that? It's not clear to me how that works for PREEMPT_RCU (which is the usual configuration for arm64), since we can easily be in a preemptible context, outside of an RCU read side critical section, yet call into a trampoline.
I know that for livepatching we need to use stacktracing to ensure we've finished using code we'd like to free, and I can't immediately see how you can avoid that here. I'm suspicious that there's still a race where threads can enter the trampoline and it can be subsequently freed.
For ftrace today we get away with entering the existing trampolines when not intended because those are statically allocated, and the race is caught when acquiring the ops inside the ftrace core code. This case is different because the CPU can fetch the instruction and execute that at any time, without any RCU involvement.
Can you give more details on how the scheme described above works? How *exactly*` do you ensure that threads which have entered the trampoline (and may have been immediately preempted by an interrupt) have returned? Which RCU mechanism are you using?
If you can point me at where this is implemented I'm happy to take a look.
IIUC, task rcu's critical section ends at a voluntary context switch, since no volutary context switch occurs in a progoluge, when a task rcu's critical section ends, we can ensure no one is running in the prologue [1].
For bpf trampoline, the scenario is similar, except that it may sleep, so a reference count is increased when entering the trampoline and decreased when exiting the trampoline, so we can wait for the reference count to become zero to make sure there is no one in the sleepable region [2].
[1] https://lore.kernel.org/all/20140804192017.GA18337@linux.vnet.ibm.com/ [2] https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail....
Ok. Am I correct in understanding that this means the BPF trampoline is only ever attached to BPF programs and those programs handle this around calling the BPF trampoline?
Thanks, Mark.
On 5/26/2022 6:34 PM, Mark Rutland wrote:
On Thu, May 26, 2022 at 05:45:30PM +0800, Xu Kuohai wrote:
On 5/25/2022 10:10 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:36AM -0400, Xu Kuohai wrote:
Impelment bpf_arch_text_poke() for arm64, so bpf trampoline code can use it to replace nop with jump, or replace jump with nop.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com Reviewed-by: Jakub Sitnicki jakub@cloudflare.com
arch/arm64/net/bpf_jit.h | 1 + arch/arm64/net/bpf_jit_comp.c | 107 +++++++++++++++++++++++++++++++--- 2 files changed, 99 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h index 194c95ccc1cf..1c4b0075a3e2 100644 --- a/arch/arm64/net/bpf_jit.h +++ b/arch/arm64/net/bpf_jit.h @@ -270,6 +270,7 @@ #define A64_BTI_C A64_HINT(AARCH64_INSN_HINT_BTIC) #define A64_BTI_J A64_HINT(AARCH64_INSN_HINT_BTIJ) #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC) +#define A64_NOP A64_HINT(AARCH64_INSN_HINT_NOP) /* DMB */ #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 8ab4035dea27..5ce6ed5f42a1 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -9,6 +9,7 @@ #include <linux/bitfield.h> #include <linux/bpf.h> +#include <linux/memory.h> #include <linux/filter.h> #include <linux/printk.h> #include <linux/slab.h> @@ -18,6 +19,7 @@ #include <asm/cacheflush.h> #include <asm/debug-monitors.h> #include <asm/insn.h> +#include <asm/patching.h> #include <asm/set_memory.h> #include "bpf_jit.h" @@ -235,13 +237,13 @@ static bool is_lsi_offset(int offset, int scale) return true; } +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0) +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
/* Tail call offset to jump into */ -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
- IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
-#define PROLOGUE_OFFSET 9 -#else -#define PROLOGUE_OFFSET 8 -#endif +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8) +/* Offset of nop instruction in bpf prog entry to be poked */ +#define POKE_OFFSET (BTI_INSNS + 1) static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { @@ -279,12 +281,15 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) * */
- if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
emit(A64_BTI_C, ctx);
- emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
- emit(A64_NOP, ctx);
I take it the idea is to make this the same as the regular ftrace patch-site sequence, so that this can call the same trampoline(s) ?
Yes, we can attach a bpf trampoline to bpf prog.
Just to check, is the BPF trampoline *only* attached to BPF programs, or could that be attached to a regular ftrace patch-site?
I has assumed that the point of adding direct calls support was so that this could be called from regular ftrace patch sites, but your replies below on how the trampoline is protected imply that's not the case.
Sorry for the confusion. bpf trampoline could be attached to a regular ftrace patch-site. In this scenario the patch-site is patched by ftrace.
If so, we need some commentary to that effect, and we need some comments in the ftrace code explaining that this needs to be kept in-sync.
This is patched by bpf_arch_text_poke(), not ftrace. I understood that, but if the idea is that the instruction sequence
must match,
then we need to make that clear. Otherwise changes to one side may break the other unexpectedly.
Yes, it's better that these two matches. But I don't think it's a must, since the bpf proglogue is patched by bpf_arch_text_poke(), while a regular kernel function progolue is patched by ftrace.
- /* Sign lr */ if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)) emit(A64_PACIASP, ctx);
- /* BTI landing pad */
- else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
emit(A64_BTI_C, ctx);
/* Save FP and LR registers to stay align with ARM64 AAPCS */ emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); @@ -1529,3 +1534,87 @@ void bpf_jit_free_exec(void *addr) { return vfree(addr); }
+static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
void *addr, u32 *insn)
+{
- if (!addr)
*insn = aarch64_insn_gen_nop();
- else
*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
(unsigned long)addr,
type);
- return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
+}
+int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
void *old_addr, void *new_addr)
+{
- int ret;
- u32 old_insn;
- u32 new_insn;
- u32 replaced;
- unsigned long offset = ~0UL;
- enum aarch64_insn_branch_type branch_type;
- char namebuf[KSYM_NAME_LEN];
- if (!__bpf_address_lookup((unsigned long)ip, NULL, &offset, namebuf))
/* Only poking bpf text is supported. Since kernel function
* entry is set up by ftrace, we reply on ftrace to poke kernel
* functions.
*/
return -EINVAL;
- /* bpf entry */
- if (offset == 0UL)
/* skip to the nop instruction in bpf prog entry:
* bti c // if BTI enabled
* mov x9, x30
* nop
*/
ip = ip + POKE_OFFSET * AARCH64_INSN_SIZE;
When is offset non-zero? is this ever called to patch other instructions, and could this ever be used to try to patch the BTI specifically?
bpf_arch_text_poke() is also called to patch other instructions, for example, bpf_tramp_image_put() calls this to skip calling fexit bpf progs:
int err = bpf_arch_text_poke(im->ip_after_call, BPF_MOD_JUMP, NULL, im->ip_epilogue);
Before this is called, a bpf trampoline looks like:
[...] ip_after_call: nop // to be patched bl <fexit prog> ip_epilogue: bti j [...]
After: [...] ip_after_call: b <ip_epilogue> // patched bl <fexit prog> ip_epilogue: bti j [...]
I strongly suspect we need a higher-level API to say "poke the patchable callsite in the prologue", rather than assuming that offset 0 always means that, or it'll be *very* easy for this to go wrong.
Ah, bpf_arch_text_poke() only patches bpf progs, and the patch-site in bpf prog prologue is constructed for bpf_arch_text_poke(), so we always know the patch-site offset. There's no compiler generated instruction here, so it seems to be not a problem.
I understand all of that. My point is that if anyone ever wants to use bpf_arch_text_poke() to poke the *actual* first instruction, this will go wrong.
I see that x86 does the same thing to skip ENDBR, so if this is just a fragile API, and people are expected to not do that, then fine. It's just confusing.
- if (poke_type == BPF_MOD_CALL)
branch_type = AARCH64_INSN_BRANCH_LINK;
- else
branch_type = AARCH64_INSN_BRANCH_NOLINK;
When is poke_type *not* BPF_MOD_CALL?>
The bpf_tramp_image_put() example above uses BPF_MOD_JUMP.
I assume that means BPF also uses this for non-ftrace reasons?
This function is NOT used for ftrace patch-site. It's only used to patch bpf image.
I understand that this isn't ftrace; I meant for the patch site we introduced in the BPF program that is compatible with the patch-site sequence that ftrace uses.
- if (gen_branch_or_nop(branch_type, ip, old_addr, &old_insn) < 0)
return -EFAULT;
- if (gen_branch_or_nop(branch_type, ip, new_addr, &new_insn) < 0)
return -EFAULT;
- mutex_lock(&text_mutex);
- if (aarch64_insn_read(ip, &replaced)) {
ret = -EFAULT;
goto out;
- }
- if (replaced != old_insn) {
ret = -EFAULT;
goto out;
- }
- /* We call aarch64_insn_patch_text_nosync() to replace instruction
* atomically, so no other CPUs will fetch a half-new and half-old
* instruction. But there is chance that another CPU fetches the old
* instruction after bpf_arch_text_poke() finishes, that is, different
* CPUs may execute different versions of instructions at the same
* time before the icache is synchronized by hardware.
*
* 1. when a new trampoline is attached, it is not an issue for
* different CPUs to jump to different trampolines temporarily.
*
* 2. when an old trampoline is freed, we should wait for all other
* CPUs to exit the trampoline and make sure the trampoline is no
* longer reachable, since bpf_tramp_image_put() function already
* uses percpu_ref and rcu task to do the sync, no need to call the
* sync interface here.
*/
How is RCU used for that? It's not clear to me how that works for PREEMPT_RCU (which is the usual configuration for arm64), since we can easily be in a preemptible context, outside of an RCU read side critical section, yet call into a trampoline.
I know that for livepatching we need to use stacktracing to ensure we've finished using code we'd like to free, and I can't immediately see how you can avoid that here. I'm suspicious that there's still a race where threads can enter the trampoline and it can be subsequently freed.
For ftrace today we get away with entering the existing trampolines when not intended because those are statically allocated, and the race is caught when acquiring the ops inside the ftrace core code. This case is different because the CPU can fetch the instruction and execute that at any time, without any RCU involvement.
Can you give more details on how the scheme described above works? How *exactly*` do you ensure that threads which have entered the trampoline (and may have been immediately preempted by an interrupt) have returned? Which RCU mechanism are you using?
If you can point me at where this is implemented I'm happy to take a look.
IIUC, task rcu's critical section ends at a voluntary context switch, since no volutary context switch occurs in a progoluge, when a task rcu's critical section ends, we can ensure no one is running in the prologue [1].
For bpf trampoline, the scenario is similar, except that it may sleep, so a reference count is increased when entering the trampoline and decreased when exiting the trampoline, so we can wait for the reference count to become zero to make sure there is no one in the sleepable region [2].
[1] https://lore.kernel.org/all/20140804192017.GA18337@linux.vnet.ibm.com/ [2] https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail....
Ok. Am I correct in understanding that this means the BPF trampoline is only ever attached to BPF programs and those programs handle this around calling the BPF trampoline?
bpf trampline could be attached to a regular function. This scenario is handled by the same function.
Thanks, Mark. .
Add bpf trampoline support for arm64. Most of the logic is the same as x86.
Tested on raspberry pi 4b and qemu with KASLR disabled (avoid long jump), result: #9 /1 bpf_cookie/kprobe:OK #9 /2 bpf_cookie/multi_kprobe_link_api:FAIL #9 /3 bpf_cookie/multi_kprobe_attach_api:FAIL #9 /4 bpf_cookie/uprobe:OK #9 /5 bpf_cookie/tracepoint:OK #9 /6 bpf_cookie/perf_event:OK #9 /7 bpf_cookie/trampoline:OK #9 /8 bpf_cookie/lsm:OK #9 bpf_cookie:FAIL #18 /1 bpf_tcp_ca/dctcp:OK #18 /2 bpf_tcp_ca/cubic:OK #18 /3 bpf_tcp_ca/invalid_license:OK #18 /4 bpf_tcp_ca/dctcp_fallback:OK #18 /5 bpf_tcp_ca/rel_setsockopt:OK #18 bpf_tcp_ca:OK #51 /1 dummy_st_ops/dummy_st_ops_attach:OK #51 /2 dummy_st_ops/dummy_init_ret_value:OK #51 /3 dummy_st_ops/dummy_init_ptr_arg:OK #51 /4 dummy_st_ops/dummy_multiple_args:OK #51 dummy_st_ops:OK #55 fentry_fexit:OK #56 fentry_test:OK #57 /1 fexit_bpf2bpf/target_no_callees:OK #57 /2 fexit_bpf2bpf/target_yes_callees:OK #57 /3 fexit_bpf2bpf/func_replace:OK #57 /4 fexit_bpf2bpf/func_replace_verify:OK #57 /5 fexit_bpf2bpf/func_sockmap_update:OK #57 /6 fexit_bpf2bpf/func_replace_return_code:OK #57 /7 fexit_bpf2bpf/func_map_prog_compatibility:OK #57 /8 fexit_bpf2bpf/func_replace_multi:OK #57 /9 fexit_bpf2bpf/fmod_ret_freplace:OK #57 fexit_bpf2bpf:OK #58 fexit_sleep:OK #59 fexit_stress:OK #60 fexit_test:OK #67 get_func_args_test:OK #68 get_func_ip_test:OK #104 modify_return:OK #237 xdp_bpf2bpf:OK
bpf_cookie/multi_kprobe_link_api and bpf_cookie/multi_kprobe_attach_api failed due to lack of multi_kprobe on arm64.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com --- arch/arm64/net/bpf_jit_comp.c | 420 +++++++++++++++++++++++++++++++++- 1 file changed, 413 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 5ce6ed5f42a1..59cb96dc4335 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -142,6 +142,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val, } }
+static inline void emit_bti(u32 insn, struct jit_ctx *ctx) +{ + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) + emit(insn, ctx); +} + /* * Kernel addresses in the vmalloc space use at most 48 bits, and the * remaining bits are guaranteed to be 0x1. So we can compose the address @@ -161,6 +167,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val, } }
+static inline void emit_call(u64 target, struct jit_ctx *ctx) +{ + u8 tmp = bpf2a64[TMP_REG_1]; + + emit_addr_mov_i64(tmp, target, ctx); + emit(A64_BLR(tmp), ctx); +} + static inline int bpf2a64_offset(int bpf_insn, int off, const struct jit_ctx *ctx) { @@ -281,8 +295,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) * */
- if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) - emit(A64_BTI_C, ctx); + emit_bti(A64_BTI_C, ctx);
emit(A64_MOV(1, A64_R(9), A64_LR), ctx); emit(A64_NOP, ctx); @@ -316,8 +329,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) }
/* BTI landing pad for the tail call, done with a BR */ - if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) - emit(A64_BTI_J, ctx); + emit_bti(A64_BTI_J, ctx); }
emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx); @@ -995,8 +1007,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, &func_addr, &func_addr_fixed); if (ret < 0) return ret; - emit_addr_mov_i64(tmp, func_addr, ctx); - emit(A64_BLR(tmp), ctx); + emit_call(func_addr, ctx); emit(A64_MOV(1, r0, A64_R(0)), ctx); break; } @@ -1340,6 +1351,13 @@ static int validate_code(struct jit_ctx *ctx) if (a64_insn == AARCH64_BREAK_FAULT) return -1; } + return 0; +} + +static int validate_ctx(struct jit_ctx *ctx) +{ + if (validate_code(ctx)) + return -1;
if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries)) return -1; @@ -1464,7 +1482,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) build_epilogue(&ctx);
/* 3. Extra pass to validate JITed code. */ - if (validate_code(&ctx)) { + if (validate_ctx(&ctx)) { bpf_jit_binary_free(header); prog = orig_prog; goto out_off; @@ -1535,6 +1553,394 @@ void bpf_jit_free_exec(void *addr) return vfree(addr); }
+static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l, + int args_off, int retval_off, int run_ctx_off, + bool save_ret) +{ + u32 *branch; + u64 enter_prog; + u64 exit_prog; + u8 tmp = bpf2a64[TMP_REG_1]; + u8 r0 = bpf2a64[BPF_REG_0]; + struct bpf_prog *p = l->link.prog; + int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie); + + if (p->aux->sleepable) { + enter_prog = (u64)__bpf_prog_enter_sleepable; + exit_prog = (u64)__bpf_prog_exit_sleepable; + } else { + enter_prog = (u64)__bpf_prog_enter; + exit_prog = (u64)__bpf_prog_exit; + } + + if (l->cookie == 0) { + /* if cookie is zero, one instruction is enough to store it */ + emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx); + } else { + emit_a64_mov_i64(tmp, l->cookie, ctx); + emit(A64_STR64I(tmp, A64_SP, run_ctx_off + cookie_off), ctx); + } + + /* save p to callee saved register x19 to avoid loading p with mov_i64 + * each time. + */ + emit_addr_mov_i64(A64_R(19), (const u64)p, ctx); + + /* arg1: prog */ + emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx); + /* arg2: &run_ctx */ + emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx); + + emit_call(enter_prog, ctx); + + /* if (__bpf_prog_enter(prog) == 0) + * goto skip_exec_of_prog; + */ + branch = ctx->image + ctx->idx; + emit(A64_NOP, ctx); + + /* save return value to callee saved register x20 */ + emit(A64_MOV(1, A64_R(20), r0), ctx); + + emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx); + if (!p->jited) + emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx); + + emit_call((const u64)p->bpf_func, ctx); + + /* store return value */ + if (save_ret) + emit(A64_STR64I(r0, A64_SP, retval_off), ctx); + + if (ctx->image) { + int offset = &ctx->image[ctx->idx] - branch; + *branch = A64_CBZ(1, A64_R(0), offset); + } + emit_bti(A64_BTI_J, ctx); + + /* arg1: prog */ + emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx); + /* arg2: start time */ + emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx); + /* arg3: &run_ctx */ + emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx); + + emit_call(exit_prog, ctx); +} + +static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl, + int args_off, int retval_off, int run_ctx_off, + u32 **branches) +{ + int i; + + /* The first fmod_ret program will receive a garbage return value. + * Set this to 0 to avoid confusing the program. + */ + emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx); + for (i = 0; i < tl->nr_links; i++) { + invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off, + run_ctx_off, true); + /* if (*(u64 *)(sp + retval_off) != 0) + * goto do_fexit; + */ + emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx); + /* Save the location of branch, and generate a nop. + * This nop will be replaced with a cbnz later. + */ + branches[i] = ctx->image + ctx->idx; + emit(A64_NOP, ctx); + } +} + +static void save_args(struct jit_ctx *ctx, int args_off, int nargs) +{ + int i; + + for (i = 0; i < nargs; i++) { + emit(A64_STR64I(i, A64_SP, args_off), ctx); + args_off += 8; + } +} + +static void restore_args(struct jit_ctx *ctx, int args_off, int nargs) +{ + int i; + + for (i = 0; i < nargs; i++) { + emit(A64_LDR64I(i, A64_SP, args_off), ctx); + args_off += 8; + } +} + +/* Based on the x86's implementation of arch_prepare_bpf_trampoline(). + * + * bpf prog and function entry before bpf trampoline hooked: + * mov x9, x30 + * nop + * + * bpf prog and function entry after bpf trampoline hooked: + * mov x9, x30 + * bl <bpf_trampoline> + * + */ +static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, + struct bpf_tramp_links *tlinks, void *orig_call, + int nargs, u32 flags) +{ + int i; + int stack_size; + int retaddr_off; + int regs_off; + int retval_off; + int args_off; + int nargs_off; + int ip_off; + int run_ctx_off; + struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY]; + struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT]; + struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN]; + bool save_ret; + u32 **branches = NULL; + + /* trampoline stack layout: + * [ parent ip ] + * [ FP ] + * SP + retaddr_off [ self ip ] + * [ FP ] + * + * [ padding ] align SP to multiples of 16 + * + * [ x20 ] callee saved reg x20 + * sp + regs_off [ x19 ] callee saved reg x19 + * + * SP + retval_off [ return value ] BPF_TRAMP_F_CALL_ORIG or + * BPF_TRAMP_F_RET_FENTRY_RET + * + * [ argN ] + * [ ... ] + * sp + args_off [ arg1 ] + * + * SP + nargs_off [ args count ] + * + * SP + ip_off [ traced function ] BPF_TRAMP_F_IP_ARG flag + * + * SP + run_ctx_off [ bpf_tramp_run_ctx ] + */ + + stack_size = 0; + run_ctx_off = stack_size; + /* room for bpf_tramp_run_ctx */ + stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8); + + ip_off = stack_size; + /* room for IP address argument */ + if (flags & BPF_TRAMP_F_IP_ARG) + stack_size += 8; + + nargs_off = stack_size; + /* room for args count */ + stack_size += 8; + + args_off = stack_size; + /* room for args */ + stack_size += nargs * 8; + + /* room for return value */ + retval_off = stack_size; + save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET); + if (save_ret) + stack_size += 8; + + /* room for callee saved registers, currently x19 and x20 are used */ + regs_off = stack_size; + stack_size += 16; + + /* round up to multiples of 16 to avoid SPAlignmentFault */ + stack_size = round_up(stack_size, 16); + + /* return address locates above FP */ + retaddr_off = stack_size + 8; + + emit_bti(A64_BTI_C, ctx); + + /* frame for parent function */ + emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx); + emit(A64_MOV(1, A64_FP, A64_SP), ctx); + + /* frame for patched function */ + emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); + emit(A64_MOV(1, A64_FP, A64_SP), ctx); + + /* allocate stack space */ + emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx); + + if (flags & BPF_TRAMP_F_IP_ARG) { + /* save ip address of the traced function */ + emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx); + emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx); + } + + /* save args count*/ + emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx); + emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx); + + /* save args */ + save_args(ctx, args_off, nargs); + + /* save callee saved registers */ + emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx); + emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx); + + if (flags & BPF_TRAMP_F_CALL_ORIG) { + emit_addr_mov_i64(A64_R(0), (const u64)im, ctx); + emit_call((const u64)__bpf_tramp_enter, ctx); + } + + for (i = 0; i < fentry->nr_links; i++) + invoke_bpf_prog(ctx, fentry->links[i], args_off, + retval_off, run_ctx_off, + flags & BPF_TRAMP_F_RET_FENTRY_RET); + + if (fmod_ret->nr_links) { + branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *), + GFP_KERNEL); + if (!branches) + return -ENOMEM; + + invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off, + run_ctx_off, branches); + } + + if (flags & BPF_TRAMP_F_CALL_ORIG) { + restore_args(ctx, args_off, nargs); + /* call original func */ + emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx); + emit(A64_BLR(A64_R(10)), ctx); + /* store return value */ + emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx); + /* reserve a nop for bpf_tramp_image_put */ + im->ip_after_call = ctx->image + ctx->idx; + emit(A64_NOP, ctx); + } + + /* update the branches saved in invoke_bpf_mod_ret with cbnz */ + for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) { + int offset = &ctx->image[ctx->idx] - branches[i]; + *branches[i] = A64_CBNZ(1, A64_R(10), offset); + } + + if (fmod_ret->nr_links) + emit_bti(A64_BTI_J, ctx); + + for (i = 0; i < fexit->nr_links; i++) + invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off, + run_ctx_off, false); + + if (flags & BPF_TRAMP_F_RESTORE_REGS) + restore_args(ctx, args_off, nargs); + + if (flags & BPF_TRAMP_F_CALL_ORIG) { + im->ip_epilogue = ctx->image + ctx->idx; + emit_bti(A64_BTI_J, ctx); + emit_addr_mov_i64(A64_R(0), (const u64)im, ctx); + emit_call((const u64)__bpf_tramp_exit, ctx); + } + + /* restore callee saved register x19 and x20 */ + emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx); + emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx); + + if (save_ret) + emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx); + + /* reset SP */ + emit(A64_MOV(1, A64_SP, A64_FP), ctx); + + /* pop frames */ + emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx); + emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx); + + if (flags & BPF_TRAMP_F_SKIP_FRAME) { + /* skip patched function, return to parent */ + emit(A64_MOV(1, A64_LR, A64_R(9)), ctx); + emit(A64_RET(A64_R(9)), ctx); + } else { + /* return to patched function */ + emit(A64_MOV(1, A64_R(10), A64_LR), ctx); + emit(A64_MOV(1, A64_LR, A64_R(9)), ctx); + emit(A64_RET(A64_R(10)), ctx); + } + + if (ctx->image) + bpf_flush_icache(ctx->image, ctx->image + ctx->idx); + + kfree(branches); + + return ctx->idx; +} + +static inline bool is_long_jump(void *ip, void *target) +{ + long offset; + + /* when ip == NULL, the trampoline address is used by bpf_struct_ops + * as a normal kernel function pointer, which can be always jumped to, + * so don't treat it as a long jump. + */ + if (ip == NULL) + return false; + + offset = (long)target - (long)ip; + return offset < -SZ_128M || offset >= SZ_128M; +} + +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, + void *image_end, const struct btf_func_model *m, + u32 flags, struct bpf_tramp_links *tlinks, + void *orig_call) +{ + int ret; + int nargs = m->nr_args; + int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE; + struct jit_ctx ctx = { + .image = NULL, + .idx = 0 + }; + + /* Return error since the long jump is not implemented. Otherwise, + * ftrace_bug will be triggered when the fentry is patched by ftrace, + * making ftrace no longer work properly. + */ + if (is_long_jump(orig_call, image)) + return -ENOTSUPP; + + /* the first 8 arguments are passed by registers */ + if (nargs > 8) + return -ENOTSUPP; + + ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags); + if (ret < 0) + return ret; + + if (ret > max_insns) + return -EFBIG; + + ctx.image = image; + ctx.idx = 0; + + jit_fill_hole(image, (unsigned int)(image_end - image)); + ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags); + + if (ret > 0 && validate_code(&ctx) < 0) + ret = -EINVAL; + + if (ret > 0) + ret *= AARCH64_INSN_SIZE; + + return ret; +} + static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip, void *addr, u32 *insn) {
On Wed, May 18, 2022 at 6:54 AM Xu Kuohai xukuohai@huawei.com wrote:
Add bpf trampoline support for arm64. Most of the logic is the same as x86.
Tested on raspberry pi 4b and qemu with KASLR disabled (avoid long jump), result: #9 /1 bpf_cookie/kprobe:OK #9 /2 bpf_cookie/multi_kprobe_link_api:FAIL #9 /3 bpf_cookie/multi_kprobe_attach_api:FAIL #9 /4 bpf_cookie/uprobe:OK #9 /5 bpf_cookie/tracepoint:OK #9 /6 bpf_cookie/perf_event:OK #9 /7 bpf_cookie/trampoline:OK #9 /8 bpf_cookie/lsm:OK #9 bpf_cookie:FAIL #18 /1 bpf_tcp_ca/dctcp:OK #18 /2 bpf_tcp_ca/cubic:OK #18 /3 bpf_tcp_ca/invalid_license:OK #18 /4 bpf_tcp_ca/dctcp_fallback:OK #18 /5 bpf_tcp_ca/rel_setsockopt:OK #18 bpf_tcp_ca:OK #51 /1 dummy_st_ops/dummy_st_ops_attach:OK #51 /2 dummy_st_ops/dummy_init_ret_value:OK #51 /3 dummy_st_ops/dummy_init_ptr_arg:OK #51 /4 dummy_st_ops/dummy_multiple_args:OK #51 dummy_st_ops:OK #55 fentry_fexit:OK #56 fentry_test:OK #57 /1 fexit_bpf2bpf/target_no_callees:OK #57 /2 fexit_bpf2bpf/target_yes_callees:OK #57 /3 fexit_bpf2bpf/func_replace:OK #57 /4 fexit_bpf2bpf/func_replace_verify:OK #57 /5 fexit_bpf2bpf/func_sockmap_update:OK #57 /6 fexit_bpf2bpf/func_replace_return_code:OK #57 /7 fexit_bpf2bpf/func_map_prog_compatibility:OK #57 /8 fexit_bpf2bpf/func_replace_multi:OK #57 /9 fexit_bpf2bpf/fmod_ret_freplace:OK #57 fexit_bpf2bpf:OK #58 fexit_sleep:OK #59 fexit_stress:OK #60 fexit_test:OK #67 get_func_args_test:OK #68 get_func_ip_test:OK #104 modify_return:OK #237 xdp_bpf2bpf:OK
bpf_cookie/multi_kprobe_link_api and bpf_cookie/multi_kprobe_attach_api failed due to lack of multi_kprobe on arm64.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
Catalin, Will, Mark,
could you please ack this patch that you don't mind us taking this set through bpf-next ?
On Fri, May 20, 2022 at 02:18:20PM -0700, Alexei Starovoitov wrote:
On Wed, May 18, 2022 at 6:54 AM Xu Kuohai xukuohai@huawei.com wrote:
Add bpf trampoline support for arm64. Most of the logic is the same as x86.
Tested on raspberry pi 4b and qemu with KASLR disabled (avoid long jump), result: #9 /1 bpf_cookie/kprobe:OK #9 /2 bpf_cookie/multi_kprobe_link_api:FAIL #9 /3 bpf_cookie/multi_kprobe_attach_api:FAIL #9 /4 bpf_cookie/uprobe:OK #9 /5 bpf_cookie/tracepoint:OK #9 /6 bpf_cookie/perf_event:OK #9 /7 bpf_cookie/trampoline:OK #9 /8 bpf_cookie/lsm:OK #9 bpf_cookie:FAIL #18 /1 bpf_tcp_ca/dctcp:OK #18 /2 bpf_tcp_ca/cubic:OK #18 /3 bpf_tcp_ca/invalid_license:OK #18 /4 bpf_tcp_ca/dctcp_fallback:OK #18 /5 bpf_tcp_ca/rel_setsockopt:OK #18 bpf_tcp_ca:OK #51 /1 dummy_st_ops/dummy_st_ops_attach:OK #51 /2 dummy_st_ops/dummy_init_ret_value:OK #51 /3 dummy_st_ops/dummy_init_ptr_arg:OK #51 /4 dummy_st_ops/dummy_multiple_args:OK #51 dummy_st_ops:OK #55 fentry_fexit:OK #56 fentry_test:OK #57 /1 fexit_bpf2bpf/target_no_callees:OK #57 /2 fexit_bpf2bpf/target_yes_callees:OK #57 /3 fexit_bpf2bpf/func_replace:OK #57 /4 fexit_bpf2bpf/func_replace_verify:OK #57 /5 fexit_bpf2bpf/func_sockmap_update:OK #57 /6 fexit_bpf2bpf/func_replace_return_code:OK #57 /7 fexit_bpf2bpf/func_map_prog_compatibility:OK #57 /8 fexit_bpf2bpf/func_replace_multi:OK #57 /9 fexit_bpf2bpf/fmod_ret_freplace:OK #57 fexit_bpf2bpf:OK #58 fexit_sleep:OK #59 fexit_stress:OK #60 fexit_test:OK #67 get_func_args_test:OK #68 get_func_ip_test:OK #104 modify_return:OK #237 xdp_bpf2bpf:OK
bpf_cookie/multi_kprobe_link_api and bpf_cookie/multi_kprobe_attach_api failed due to lack of multi_kprobe on arm64.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
Catalin, Will, Mark,
could you please ack this patch that you don't mind us taking this set through bpf-next ?
This is on my queue of things to review alongside some other ftrace and kprobes patches; I'll try to get that out of the way this week.
From a quick glance I'm not too keen on the change to the ftrace trampoline, as to get rid of some existing unsoundness I'd really wanted to move that entirely away from using regs (and had a sketch for how to handle different op handlers). I'd discussed that with Steven and Masami in another thread:
https://lore.kernel.org/linux-arm-kernel/YnJUTuOIX9YoJq23@FVFF77S0Q05N/
I'll see if it's possible to make this all work together. It's not entirely clear to me how the FTRACE_DIRECT its are supposed to play with dynamically allocated trampolines, and we might need to take a step back and reconsider.
Thanks, Mark.
On Wed, May 18, 2022 at 3:54 PM Xu Kuohai xukuohai@huawei.com wrote:
Add bpf trampoline support for arm64. Most of the logic is the same as x86.
Tested on raspberry pi 4b and qemu with KASLR disabled (avoid long jump), result: #9 /1 bpf_cookie/kprobe:OK #9 /2 bpf_cookie/multi_kprobe_link_api:FAIL #9 /3 bpf_cookie/multi_kprobe_attach_api:FAIL #9 /4 bpf_cookie/uprobe:OK #9 /5 bpf_cookie/tracepoint:OK #9 /6 bpf_cookie/perf_event:OK #9 /7 bpf_cookie/trampoline:OK #9 /8 bpf_cookie/lsm:OK #9 bpf_cookie:FAIL #18 /1 bpf_tcp_ca/dctcp:OK #18 /2 bpf_tcp_ca/cubic:OK #18 /3 bpf_tcp_ca/invalid_license:OK #18 /4 bpf_tcp_ca/dctcp_fallback:OK #18 /5 bpf_tcp_ca/rel_setsockopt:OK #18 bpf_tcp_ca:OK #51 /1 dummy_st_ops/dummy_st_ops_attach:OK #51 /2 dummy_st_ops/dummy_init_ret_value:OK #51 /3 dummy_st_ops/dummy_init_ptr_arg:OK #51 /4 dummy_st_ops/dummy_multiple_args:OK #51 dummy_st_ops:OK #55 fentry_fexit:OK #56 fentry_test:OK #57 /1 fexit_bpf2bpf/target_no_callees:OK #57 /2 fexit_bpf2bpf/target_yes_callees:OK #57 /3 fexit_bpf2bpf/func_replace:OK #57 /4 fexit_bpf2bpf/func_replace_verify:OK #57 /5 fexit_bpf2bpf/func_sockmap_update:OK #57 /6 fexit_bpf2bpf/func_replace_return_code:OK #57 /7 fexit_bpf2bpf/func_map_prog_compatibility:OK #57 /8 fexit_bpf2bpf/func_replace_multi:OK #57 /9 fexit_bpf2bpf/fmod_ret_freplace:OK #57 fexit_bpf2bpf:OK #58 fexit_sleep:OK #59 fexit_stress:OK #60 fexit_test:OK #67 get_func_args_test:OK #68 get_func_ip_test:OK #104 modify_return:OK #237 xdp_bpf2bpf:OK
bpf_cookie/multi_kprobe_link_api and bpf_cookie/multi_kprobe_attach_api failed due to lack of multi_kprobe on arm64.
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com
Acked-by: KP Singh kpsingh@kernel.org
Thanks! This is exciting.
The "ipv6" word in assertion message should be "fentry_fexit".
Signed-off-by: Xu Kuohai xukuohai@huawei.com Acked-by: Song Liu songliubraving@fb.com --- tools/testing/selftests/bpf/prog_tests/fentry_fexit.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c b/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c index 130f5b82d2e6..e3c139bde46e 100644 --- a/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c +++ b/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c @@ -28,8 +28,8 @@ void test_fentry_fexit(void)
prog_fd = fexit_skel->progs.test1.prog_fd; err = bpf_prog_test_run_opts(prog_fd, &topts); - ASSERT_OK(err, "ipv6 test_run"); - ASSERT_OK(topts.retval, "ipv6 test retval"); + ASSERT_OK(err, "fentry_fexit test_run"); + ASSERT_OK(topts.retval, "fentry_fexit test retval");
fentry_res = (__u64 *)fentry_skel->bss; fexit_res = (__u64 *)fexit_skel->bss;
linux-kselftest-mirror@lists.linaro.org