There's four patches here, but only one of them actually does anything. The first patch fixes a BPF selftests build failure on my machine and has already been sent to the list separately. The next three are just staged such that there are some patches that avoid changing any functionality pulled out from the whole point of those refactorings, with two cleanups and then the idea.
Maybe this is an odd thing to say in a cover letter, but I'm not actually sure this patch set is a good idea. The issue of extra moves after calls came up as I was reviewing some unrelated performance optimizations to the RISC-V BPF JIT. I figured I'd take a whack at performing the optimization in the context of the arm64 port just to get a breath of fresh air, and I'm not convinced I like the results.
That said, I think I would accept something like this for the RISC-V port because we're already doing a multi-pass optimization for shrinking function addresses so it's not as much extra complexity over there. If we do that we should probably start puling some of this code into the shared BPF compiler, but we're also opening the doors to more complicated BPF JIT optimizations. Given that the BPF JIT appears to have been designed explicitly to be simple/fast as opposed to perform complex optimization, I'm not sure this is a sane way to move forward.
I figured I'd send the patch set out as more of a question than anything else. Specifically:
* How should I go about measuring the performance of these sort of optimizations? I'd like to balance the time it takes to run the JIT with the time spent executing the program, but I don't have any feel for what real BPF programs look like or have any benchmark suite to run. Is there something out there this should be benchmarked against? (I'd also like to know that to run those benchmarks on the RISC-V port.) * Is this the sort of thing that makes sense in a BPF JIT? I guess I've just realized I turned "review this patch" into a way bigger rabbit hole than I really want to go down...
I worked on top of 5.4 for these, but trivially different versions of the patches applied on Linus' master a few days ago when I tried. LMK if those aren't sane places to start from over here, I'm new to both arm64 and BPF so I might be a bit lost.
[PATCH 1/4] selftests/bpf: Elide a check for LLVM versions that can't [PATCH 2/4] arm64: bpf: Convert bpf2a64 to a function [PATCH 3/4] arm64: bpf: Split the read and write halves of dst [PATCH 4/4] arm64: bpf: Elide some moves to a0 after calls
The current stable LLVM BPF backend fails to compile the BPF selftests due to a compiler bug. The bug has been fixed in trunk, but that fix hasn't landed in the binary packages I'm using yet (Fedora arm64). Without this workaround the tests don't compile for me.
This patch triggers a preprocessor warning on LLVM versions that definitely have the bug. The test may be conservative (ie, I'm not sure if 9.1 will have the fix), but it should at least make the current set of stable releases work together.
See https://reviews.llvm.org/D69438 for more information on the fix. I obtained the workaround from https://lore.kernel.org/linux-kselftest/aed8eda7-df20-069b-ea14-f06628984566...
Fixes: 20a9ad2e7136 ("selftests/bpf: add CO-RE relocs array tests") Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com --- .../testing/selftests/bpf/progs/test_core_reloc_arrays.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c b/tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c index bf67f0fdf743..c9a3e0585a84 100644 --- a/tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c +++ b/tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c @@ -40,15 +40,23 @@ int test_core_arrays(void *ctx) /* in->a[2] */ if (BPF_CORE_READ(&out->a2, &in->a[2])) return 1; +#if defined(__clang__) && (__clang_major__ < 10) && (__clang_minor__ < 1) +# warning "clang 9.0 SEGVs on multidimensional arrays, see https://reviews.llvm.org/D69438" +#else /* in->b[1][2][3] */ if (BPF_CORE_READ(&out->b123, &in->b[1][2][3])) return 1; +#endif /* in->c[1].c */ if (BPF_CORE_READ(&out->c1c, &in->c[1].c)) return 1; +#if defined(__clang__) && (__clang_major__ < 10) && (__clang_minor__ < 1) +# warning "clang 9.0 SEGVs on multidimensional arrays, see https://reviews.llvm.org/D69438" +#else /* in->d[0][0].d */ if (BPF_CORE_READ(&out->d00d, &in->d[0][0].d)) return 1; +#endif
return 0; }
On Mon, Jan 27, 2020 at 6:14 PM 'Palmer Dabbelt' via Clang Built Linux clang-built-linux@googlegroups.com wrote:
The current stable LLVM BPF backend fails to compile the BPF selftests due to a compiler bug. The bug has been fixed in trunk, but that fix hasn't landed in the binary packages I'm using yet (Fedora arm64). Without this workaround the tests don't compile for me.
This patch triggers a preprocessor warning on LLVM versions that definitely have the bug. The test may be conservative (ie, I'm not sure if 9.1 will have the fix), but it should at least make the current set of stable releases work together.
Do older versions of clang still work? Should there be a lower bounds?
See https://reviews.llvm.org/D69438 for more information on the fix. I obtained the workaround from https://lore.kernel.org/linux-kselftest/aed8eda7-df20-069b-ea14-f06628984566...
Fixes: 20a9ad2e7136 ("selftests/bpf: add CO-RE relocs array tests") Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com
.../testing/selftests/bpf/progs/test_core_reloc_arrays.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c b/tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c index bf67f0fdf743..c9a3e0585a84 100644 --- a/tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c +++ b/tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c @@ -40,15 +40,23 @@ int test_core_arrays(void *ctx) /* in->a[2] */ if (BPF_CORE_READ(&out->a2, &in->a[2])) return 1; +#if defined(__clang__) && (__clang_major__ < 10) && (__clang_minor__ < 1) +# warning "clang 9.0 SEGVs on multidimensional arrays, see https://reviews.llvm.org/D69438" +#else /* in->b[1][2][3] */ if (BPF_CORE_READ(&out->b123, &in->b[1][2][3])) return 1; +#endif /* in->c[1].c */ if (BPF_CORE_READ(&out->c1c, &in->c[1].c)) return 1; +#if defined(__clang__) && (__clang_major__ < 10) && (__clang_minor__ < 1) +# warning "clang 9.0 SEGVs on multidimensional arrays, see https://reviews.llvm.org/D69438" +#else /* in->d[0][0].d */ if (BPF_CORE_READ(&out->d00d, &in->d[0][0].d)) return 1; +#endif
return 0;
}
2.25.0.341.g760bfbb309-goog
-- You received this message because you are subscribed to the Google Groups "Clang Built Linux" group. To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20200128021145.36774-2-p....
This patch is intended to change no functionality, it just allows me to more cleanly add dynamic register mapping later.
Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com --- arch/arm64/net/bpf_jit_comp.c | 53 +++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index cdc79de0c794..8eee68705056 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -25,7 +25,7 @@ #define TMP_REG_3 (MAX_BPF_JIT_REG + 3)
/* Map BPF registers to A64 registers */ -static const int bpf2a64[] = { +static const int bpf2a64_default[] = { /* return value from in-kernel function, and exit value from eBPF */ [BPF_REG_0] = A64_R(7), /* arguments from eBPF program to in-kernel function */ @@ -60,6 +60,11 @@ struct jit_ctx { u32 stack_size; };
+static inline int bpf2a64(struct jit_ctx *ctx, int bpf_reg) +{ + return bpf2a64_default[bpf_reg]; +} + static inline void emit(const u32 insn, struct jit_ctx *ctx) { if (ctx->image != NULL) @@ -176,12 +181,12 @@ static inline int epilogue_offset(const struct jit_ctx *ctx) static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { const struct bpf_prog *prog = ctx->prog; - const u8 r6 = bpf2a64[BPF_REG_6]; - const u8 r7 = bpf2a64[BPF_REG_7]; - const u8 r8 = bpf2a64[BPF_REG_8]; - const u8 r9 = bpf2a64[BPF_REG_9]; - const u8 fp = bpf2a64[BPF_REG_FP]; - const u8 tcc = bpf2a64[TCALL_CNT]; + const u8 r6 = bpf2a64(ctx, BPF_REG_6); + const u8 r7 = bpf2a64(ctx, BPF_REG_7); + const u8 r8 = bpf2a64(ctx, BPF_REG_8); + const u8 r9 = bpf2a64(ctx, BPF_REG_9); + const u8 fp = bpf2a64(ctx, BPF_REG_FP); + const u8 tcc = bpf2a64(ctx, TCALL_CNT); const int idx0 = ctx->idx; int cur_offset;
@@ -243,12 +248,12 @@ static int out_offset = -1; /* initialized on the first pass of build_body() */ static int emit_bpf_tail_call(struct jit_ctx *ctx) { /* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */ - const u8 r2 = bpf2a64[BPF_REG_2]; - const u8 r3 = bpf2a64[BPF_REG_3]; + const u8 r2 = bpf2a64(ctx, BPF_REG_2); + const u8 r3 = bpf2a64(ctx, BPF_REG_3);
- const u8 tmp = bpf2a64[TMP_REG_1]; - const u8 prg = bpf2a64[TMP_REG_2]; - const u8 tcc = bpf2a64[TCALL_CNT]; + const u8 tmp = bpf2a64(ctx, TMP_REG_1); + const u8 prg = bpf2a64(ctx, TMP_REG_2); + const u8 tcc = bpf2a64(ctx, TCALL_CNT); const int idx0 = ctx->idx; #define cur_offset (ctx->idx - idx0) #define jmp_offset (out_offset - (cur_offset)) @@ -307,12 +312,12 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
static void build_epilogue(struct jit_ctx *ctx) { - const u8 r0 = bpf2a64[BPF_REG_0]; - const u8 r6 = bpf2a64[BPF_REG_6]; - const u8 r7 = bpf2a64[BPF_REG_7]; - const u8 r8 = bpf2a64[BPF_REG_8]; - const u8 r9 = bpf2a64[BPF_REG_9]; - const u8 fp = bpf2a64[BPF_REG_FP]; + const u8 r0 = bpf2a64(ctx, BPF_REG_0); + const u8 r6 = bpf2a64(ctx, BPF_REG_6); + const u8 r7 = bpf2a64(ctx, BPF_REG_7); + const u8 r8 = bpf2a64(ctx, BPF_REG_8); + const u8 r9 = bpf2a64(ctx, BPF_REG_9); + const u8 fp = bpf2a64(ctx, BPF_REG_FP);
/* We're done with BPF stack */ emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx); @@ -343,11 +348,11 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, bool extra_pass) { const u8 code = insn->code; - const u8 dst = bpf2a64[insn->dst_reg]; - const u8 src = bpf2a64[insn->src_reg]; - const u8 tmp = bpf2a64[TMP_REG_1]; - const u8 tmp2 = bpf2a64[TMP_REG_2]; - const u8 tmp3 = bpf2a64[TMP_REG_3]; + const u8 dst = bpf2a64(ctx, insn->dst_reg); + const u8 src = bpf2a64(ctx, insn->src_reg); + const u8 tmp = bpf2a64(ctx, TMP_REG_1); + const u8 tmp2 = bpf2a64(ctx, TMP_REG_2); + const u8 tmp3 = bpf2a64(ctx, TMP_REG_3); const s16 off = insn->off; const s32 imm = insn->imm; const int i = insn - ctx->prog->insnsi; @@ -634,7 +639,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, /* function call */ case BPF_JMP | BPF_CALL: { - const u8 r0 = bpf2a64[BPF_REG_0]; + const u8 r0 = bpf2a64(ctx, BPF_REG_0); bool func_addr_fixed; u64 func_addr; int ret;
This patch is intended to change no functionality, it just allows me to do register renaming later.
Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com --- arch/arm64/net/bpf_jit_comp.c | 107 +++++++++++++++++----------------- 1 file changed, 54 insertions(+), 53 deletions(-)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 8eee68705056..fba5b1b00cd7 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -348,7 +348,8 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, bool extra_pass) { const u8 code = insn->code; - const u8 dst = bpf2a64(ctx, insn->dst_reg); + const u8 dstw = bpf2a64(ctx, insn->dst_reg); + const u8 dstr = bpf2a64(ctx, insn->dst_reg); const u8 src = bpf2a64(ctx, insn->src_reg); const u8 tmp = bpf2a64(ctx, TMP_REG_1); const u8 tmp2 = bpf2a64(ctx, TMP_REG_2); @@ -377,32 +378,32 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, /* dst = src */ case BPF_ALU | BPF_MOV | BPF_X: case BPF_ALU64 | BPF_MOV | BPF_X: - emit(A64_MOV(is64, dst, src), ctx); + emit(A64_MOV(is64, dstw, src), ctx); break; /* dst = dst OP src */ case BPF_ALU | BPF_ADD | BPF_X: case BPF_ALU64 | BPF_ADD | BPF_X: - emit(A64_ADD(is64, dst, dst, src), ctx); + emit(A64_ADD(is64, dstw, dstr, src), ctx); break; case BPF_ALU | BPF_SUB | BPF_X: case BPF_ALU64 | BPF_SUB | BPF_X: - emit(A64_SUB(is64, dst, dst, src), ctx); + emit(A64_SUB(is64, dstw, dstr, src), ctx); break; case BPF_ALU | BPF_AND | BPF_X: case BPF_ALU64 | BPF_AND | BPF_X: - emit(A64_AND(is64, dst, dst, src), ctx); + emit(A64_AND(is64, dstw, dstr, src), ctx); break; case BPF_ALU | BPF_OR | BPF_X: case BPF_ALU64 | BPF_OR | BPF_X: - emit(A64_ORR(is64, dst, dst, src), ctx); + emit(A64_ORR(is64, dstw, dstr, src), ctx); break; case BPF_ALU | BPF_XOR | BPF_X: case BPF_ALU64 | BPF_XOR | BPF_X: - emit(A64_EOR(is64, dst, dst, src), ctx); + emit(A64_EOR(is64, dstw, dstr, src), ctx); break; case BPF_ALU | BPF_MUL | BPF_X: case BPF_ALU64 | BPF_MUL | BPF_X: - emit(A64_MUL(is64, dst, dst, src), ctx); + emit(A64_MUL(is64, dstw, dstr, src), ctx); break; case BPF_ALU | BPF_DIV | BPF_X: case BPF_ALU64 | BPF_DIV | BPF_X: @@ -410,30 +411,30 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, case BPF_ALU64 | BPF_MOD | BPF_X: switch (BPF_OP(code)) { case BPF_DIV: - emit(A64_UDIV(is64, dst, dst, src), ctx); + emit(A64_UDIV(is64, dstw, dstr, src), ctx); break; case BPF_MOD: - emit(A64_UDIV(is64, tmp, dst, src), ctx); - emit(A64_MSUB(is64, dst, dst, tmp, src), ctx); + emit(A64_UDIV(is64, tmp, dstr, src), ctx); + emit(A64_MSUB(is64, dstw, dstr, tmp, src), ctx); break; } break; case BPF_ALU | BPF_LSH | BPF_X: case BPF_ALU64 | BPF_LSH | BPF_X: - emit(A64_LSLV(is64, dst, dst, src), ctx); + emit(A64_LSLV(is64, dstw, dstr, src), ctx); break; case BPF_ALU | BPF_RSH | BPF_X: case BPF_ALU64 | BPF_RSH | BPF_X: - emit(A64_LSRV(is64, dst, dst, src), ctx); + emit(A64_LSRV(is64, dstw, dstr, src), ctx); break; case BPF_ALU | BPF_ARSH | BPF_X: case BPF_ALU64 | BPF_ARSH | BPF_X: - emit(A64_ASRV(is64, dst, dst, src), ctx); + emit(A64_ASRV(is64, dstw, dstr, src), ctx); break; /* dst = -dst */ case BPF_ALU | BPF_NEG: case BPF_ALU64 | BPF_NEG: - emit(A64_NEG(is64, dst, dst), ctx); + emit(A64_NEG(is64, dstw, dstr), ctx); break; /* dst = BSWAP##imm(dst) */ case BPF_ALU | BPF_END | BPF_FROM_LE: @@ -447,16 +448,16 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, #endif switch (imm) { case 16: - emit(A64_REV16(is64, dst, dst), ctx); + emit(A64_REV16(is64, dstw, dstr), ctx); /* zero-extend 16 bits into 64 bits */ - emit(A64_UXTH(is64, dst, dst), ctx); + emit(A64_UXTH(is64, dstw, dstr), ctx); break; case 32: - emit(A64_REV32(is64, dst, dst), ctx); + emit(A64_REV32(is64, dstw, dstr), ctx); /* upper 32 bits already cleared */ break; case 64: - emit(A64_REV64(dst, dst), ctx); + emit(A64_REV64(dstw, dstr), ctx); break; } break; @@ -464,11 +465,11 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, switch (imm) { case 16: /* zero-extend 16 bits into 64 bits */ - emit(A64_UXTH(is64, dst, dst), ctx); + emit(A64_UXTH(is64, dstw, dstr), ctx); break; case 32: /* zero-extend 32 bits into 64 bits */ - emit(A64_UXTW(is64, dst, dst), ctx); + emit(A64_UXTW(is64, dstw, dstr), ctx); break; case 64: /* nop */ @@ -478,61 +479,61 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, /* dst = imm */ case BPF_ALU | BPF_MOV | BPF_K: case BPF_ALU64 | BPF_MOV | BPF_K: - emit_a64_mov_i(is64, dst, imm, ctx); + emit_a64_mov_i(is64, dstw, imm, ctx); break; /* dst = dst OP imm */ case BPF_ALU | BPF_ADD | BPF_K: case BPF_ALU64 | BPF_ADD | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_ADD(is64, dst, dst, tmp), ctx); + emit(A64_ADD(is64, dstw, dstr, tmp), ctx); break; case BPF_ALU | BPF_SUB | BPF_K: case BPF_ALU64 | BPF_SUB | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_SUB(is64, dst, dst, tmp), ctx); + emit(A64_SUB(is64, dstw, dstr, tmp), ctx); break; case BPF_ALU | BPF_AND | BPF_K: case BPF_ALU64 | BPF_AND | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_AND(is64, dst, dst, tmp), ctx); + emit(A64_AND(is64, dstw, dstr, tmp), ctx); break; case BPF_ALU | BPF_OR | BPF_K: case BPF_ALU64 | BPF_OR | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_ORR(is64, dst, dst, tmp), ctx); + emit(A64_ORR(is64, dstw, dstr, tmp), ctx); break; case BPF_ALU | BPF_XOR | BPF_K: case BPF_ALU64 | BPF_XOR | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_EOR(is64, dst, dst, tmp), ctx); + emit(A64_EOR(is64, dstw, dstr, tmp), ctx); break; case BPF_ALU | BPF_MUL | BPF_K: case BPF_ALU64 | BPF_MUL | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_MUL(is64, dst, dst, tmp), ctx); + emit(A64_MUL(is64, dstw, dstr, tmp), ctx); break; case BPF_ALU | BPF_DIV | BPF_K: case BPF_ALU64 | BPF_DIV | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_UDIV(is64, dst, dst, tmp), ctx); + emit(A64_UDIV(is64, dstw, dstr, tmp), ctx); break; case BPF_ALU | BPF_MOD | BPF_K: case BPF_ALU64 | BPF_MOD | BPF_K: emit_a64_mov_i(is64, tmp2, imm, ctx); - emit(A64_UDIV(is64, tmp, dst, tmp2), ctx); - emit(A64_MSUB(is64, dst, dst, tmp, tmp2), ctx); + emit(A64_UDIV(is64, tmp, dstr, tmp2), ctx); + emit(A64_MSUB(is64, dstw, dstr, tmp, tmp2), ctx); break; case BPF_ALU | BPF_LSH | BPF_K: case BPF_ALU64 | BPF_LSH | BPF_K: - emit(A64_LSL(is64, dst, dst, imm), ctx); + emit(A64_LSL(is64, dstw, dstr, imm), ctx); break; case BPF_ALU | BPF_RSH | BPF_K: case BPF_ALU64 | BPF_RSH | BPF_K: - emit(A64_LSR(is64, dst, dst, imm), ctx); + emit(A64_LSR(is64, dstw, dstr, imm), ctx); break; case BPF_ALU | BPF_ARSH | BPF_K: case BPF_ALU64 | BPF_ARSH | BPF_K: - emit(A64_ASR(is64, dst, dst, imm), ctx); + emit(A64_ASR(is64, dstw, dstr, imm), ctx); break;
/* JUMP off */ @@ -562,7 +563,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, case BPF_JMP32 | BPF_JSLT | BPF_X: case BPF_JMP32 | BPF_JSGE | BPF_X: case BPF_JMP32 | BPF_JSLE | BPF_X: - emit(A64_CMP(is64, dst, src), ctx); + emit(A64_CMP(is64, dstr, src), ctx); emit_cond_jmp: jmp_offset = bpf2a64_offset(i + off, i, ctx); check_imm19(jmp_offset); @@ -605,7 +606,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, break; case BPF_JMP | BPF_JSET | BPF_X: case BPF_JMP32 | BPF_JSET | BPF_X: - emit(A64_TST(is64, dst, src), ctx); + emit(A64_TST(is64, dstr, src), ctx); goto emit_cond_jmp; /* IF (dst COND imm) JUMP off */ case BPF_JMP | BPF_JEQ | BPF_K: @@ -629,12 +630,12 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, case BPF_JMP32 | BPF_JSGE | BPF_K: case BPF_JMP32 | BPF_JSLE | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_CMP(is64, dst, tmp), ctx); + emit(A64_CMP(is64, dstr, tmp), ctx); goto emit_cond_jmp; case BPF_JMP | BPF_JSET | BPF_K: case BPF_JMP32 | BPF_JSET | BPF_K: emit_a64_mov_i(is64, tmp, imm, ctx); - emit(A64_TST(is64, dst, tmp), ctx); + emit(A64_TST(is64, dstr, tmp), ctx); goto emit_cond_jmp; /* function call */ case BPF_JMP | BPF_CALL: @@ -676,7 +677,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, u64 imm64;
imm64 = (u64)insn1.imm << 32 | (u32)imm; - emit_a64_mov_i64(dst, imm64, ctx); + emit_a64_mov_i64(dstw, imm64, ctx);
return 1; } @@ -689,16 +690,16 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, emit_a64_mov_i(1, tmp, off, ctx); switch (BPF_SIZE(code)) { case BPF_W: - emit(A64_LDR32(dst, src, tmp), ctx); + emit(A64_LDR32(dstw, src, tmp), ctx); break; case BPF_H: - emit(A64_LDRH(dst, src, tmp), ctx); + emit(A64_LDRH(dstw, src, tmp), ctx); break; case BPF_B: - emit(A64_LDRB(dst, src, tmp), ctx); + emit(A64_LDRB(dstw, src, tmp), ctx); break; case BPF_DW: - emit(A64_LDR64(dst, src, tmp), ctx); + emit(A64_LDR64(dstw, src, tmp), ctx); break; } break; @@ -713,16 +714,16 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, emit_a64_mov_i(1, tmp, imm, ctx); switch (BPF_SIZE(code)) { case BPF_W: - emit(A64_STR32(tmp, dst, tmp2), ctx); + emit(A64_STR32(tmp, dstr, tmp2), ctx); break; case BPF_H: - emit(A64_STRH(tmp, dst, tmp2), ctx); + emit(A64_STRH(tmp, dstr, tmp2), ctx); break; case BPF_B: - emit(A64_STRB(tmp, dst, tmp2), ctx); + emit(A64_STRB(tmp, dstr, tmp2), ctx); break; case BPF_DW: - emit(A64_STR64(tmp, dst, tmp2), ctx); + emit(A64_STR64(tmp, dstr, tmp2), ctx); break; } break; @@ -735,16 +736,16 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, emit_a64_mov_i(1, tmp, off, ctx); switch (BPF_SIZE(code)) { case BPF_W: - emit(A64_STR32(src, dst, tmp), ctx); + emit(A64_STR32(src, dstr, tmp), ctx); break; case BPF_H: - emit(A64_STRH(src, dst, tmp), ctx); + emit(A64_STRH(src, dstr, tmp), ctx); break; case BPF_B: - emit(A64_STRB(src, dst, tmp), ctx); + emit(A64_STRB(src, dstr, tmp), ctx); break; case BPF_DW: - emit(A64_STR64(src, dst, tmp), ctx); + emit(A64_STR64(src, dstr, tmp), ctx); break; } break; @@ -754,10 +755,10 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, /* STX XADD: lock *(u64 *)(dst + off) += src */ case BPF_STX | BPF_XADD | BPF_DW: if (!off) { - reg = dst; + reg = dstr; } else { emit_a64_mov_i(1, tmp, off, ctx); - emit(A64_ADD(1, tmp, tmp, dst), ctx); + emit(A64_ADD(1, tmp, tmp, dstr), ctx); reg = tmp; } if (cpus_have_cap(ARM64_HAS_LSE_ATOMICS)) {
On arm64, the BPF function ABI doesn't match the C function ABI. Specifically, arm64 encodes calls as `a0 = f(a0, a1, ...)` while BPF encodes calls as `BPF_REG_0 = f(BPF_REG_1, BPF_REG_2, ...)`. This discrepancy results in function calls being encoded as a two operations sequence that first does a C ABI calls and then moves the return register into the right place. This results in one extra instruction for every function call.
This patch adds an optimization to the arm64 BPF JIT backend that aims to avoid some of these moves.
I've done no benchmarking to determine if this is correct. I ran the BPF selftests before and after the change on arm64 in QEMU and found that I had a single failure both before and after. I'm not at all confident this code actually works as it's my first time doing anything with both ARM64 and BPF and I didn't even open the documentation for either of these. I was particularly surprised that the code didn't fail any tests -- I was kind of assuming this would fail the tests, get put on the backburner, sit long enough for me to stop caring, and then get deleted.
Signed-off-by: Palmer Dabbelt palmerdabbelt@google.com --- arch/arm64/net/bpf_jit_comp.c | 71 +++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index fba5b1b00cd7..48d900cc7258 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -58,10 +58,14 @@ struct jit_ctx { int *offset; __le32 *image; u32 stack_size; + int reg0_in_reg1; };
static inline int bpf2a64(struct jit_ctx *ctx, int bpf_reg) { + if (ctx->reg0_in_reg1 && bpf_reg == BPF_REG_0) + bpf_reg = BPF_REG_1; + return bpf2a64_default[bpf_reg]; }
@@ -338,6 +342,47 @@ static void build_epilogue(struct jit_ctx *ctx) emit(A64_RET(A64_LR), ctx); }
+static int dead_register(const struct jit_ctx *ctx, int offset, int bpf_reg) +{ + const struct bpf_prog *prog = ctx->prog; + int i; + + for (i = offset; i < prog->len; ++i) { + const struct bpf_insn *insn = &prog->insnsi[i]; + const u8 code = insn->code; + const u8 bpf_dst = insn->dst_reg; + const u8 bpf_src = insn->src_reg; + const int writes_dst = !((code & BPF_ST) || (code & BPF_STX) + || (code & BPF_JMP32) || (code & BPF_JMP)); + const int reads_dst = !((code & BPF_LD)); + const int reads_src = true; + + /* Calls are a bit special in that they clobber a bunch of regisers. */ + if ((code & (BPF_JMP | BPF_CALL)) || (code & (BPF_JMP | BPF_TAIL_CALL))) + if ((bpf_reg >= BPF_REG_0) && (bpf_reg <= BPF_REG_5)) + return false; + + /* Registers that are read before they're written are alive. + * Most opcodes are of the form DST = DEST op SRC, but there + * are some exceptions.*/ + if (bpf_src == bpf_reg && reads_src) + return false; + + if (bpf_dst == bpf_reg && reads_dst) + return false; + + if (bpf_dst == bpf_reg && writes_dst) + return true; + + /* Most BPF instructions are 8 bits long, but some ar 16 bits + * long. */ + if (code & (BPF_LD | BPF_IMM | BPF_DW)) + ++i; + } + + return true; +} + /* JITs an eBPF instruction. * Returns: * 0 - successfully JITed an 8-byte eBPF instruction. @@ -348,7 +393,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, bool extra_pass) { const u8 code = insn->code; - const u8 dstw = bpf2a64(ctx, insn->dst_reg); + u8 dstw; const u8 dstr = bpf2a64(ctx, insn->dst_reg); const u8 src = bpf2a64(ctx, insn->src_reg); const u8 tmp = bpf2a64(ctx, TMP_REG_1); @@ -374,6 +419,27 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, #define check_imm19(imm) check_imm(19, imm) #define check_imm26(imm) check_imm(26, imm)
+ /* Handle BPF_REG_0, which may be in the wrong place because the ARM64 + * ABI doesn't match the BPF ABI for function calls. */ + if (ctx->reg0_in_reg1) { + /* If we're writing BPF_REG_0 then we don't need to do any + * extra work to get the registers back in their correct + * locations. */ + if (insn->dst_reg == BPF_REG_0) + ctx->reg0_in_reg1 = false; + + /* If we're writing to BPF_REG_1 then we need to save BPF_REG_0 + * into the correct location if it's still alive, as otherwise + * it will be clobbered. */ + if (insn->dst_reg == BPF_REG_1) { + if (!dead_register(ctx, off + 1, BPF_REG_0)) + emit(A64_MOV(1, A64_R(7), A64_R(0)), ctx); + ctx->reg0_in_reg1 = false; + } + } + + dstw = bpf2a64(ctx, insn->dst_reg); + switch (code) { /* dst = src */ case BPF_ALU | BPF_MOV | BPF_X: @@ -640,7 +706,6 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, /* function call */ case BPF_JMP | BPF_CALL: { - const u8 r0 = bpf2a64(ctx, BPF_REG_0); bool func_addr_fixed; u64 func_addr; int ret; @@ -651,7 +716,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, return ret; emit_addr_mov_i64(tmp, func_addr, ctx); emit(A64_BLR(tmp), ctx); - emit(A64_MOV(1, r0, A64_R(0)), ctx); + ctx->reg0_in_reg1 = true; break; } /* tail call */
On Tue, 28 Jan 2020 at 03:15, Palmer Dabbelt palmerdabbelt@google.com wrote:
On arm64, the BPF function ABI doesn't match the C function ABI. Specifically, arm64 encodes calls as `a0 = f(a0, a1, ...)` while BPF encodes calls as `BPF_REG_0 = f(BPF_REG_1, BPF_REG_2, ...)`. This discrepancy results in function calls being encoded as a two operations sequence that first does a C ABI calls and then moves the return register into the right place. This results in one extra instruction for every function call.
It's a lot of extra work for one reg-to-reg move, but it always annoyed me in the RISC-V JIT. :-) So, if it *can* be avoided, why not.
[...]
+static int dead_register(const struct jit_ctx *ctx, int offset, int bpf_reg)
Given that a lot of archs (RISC-V, arm?, MIPS?) might benefit from this, it would be nice if it could be made generic (it already is pretty much), and moved to kernel/bpf.
+{
const struct bpf_prog *prog = ctx->prog;
int i;
for (i = offset; i < prog->len; ++i) {
const struct bpf_insn *insn = &prog->insnsi[i];
const u8 code = insn->code;
const u8 bpf_dst = insn->dst_reg;
const u8 bpf_src = insn->src_reg;
const int writes_dst = !((code & BPF_ST) || (code & BPF_STX)
|| (code & BPF_JMP32) || (code & BPF_JMP));
const int reads_dst = !((code & BPF_LD));
const int reads_src = true;
/* Calls are a bit special in that they clobber a bunch of regisers. */
if ((code & (BPF_JMP | BPF_CALL)) || (code & (BPF_JMP | BPF_TAIL_CALL)))
if ((bpf_reg >= BPF_REG_0) && (bpf_reg <= BPF_REG_5))
return false;
/* Registers that are read before they're written are alive.
* Most opcodes are of the form DST = DEST op SRC, but there
* are some exceptions.*/
if (bpf_src == bpf_reg && reads_src)
return false;
if (bpf_dst == bpf_reg && reads_dst)
return false;
if (bpf_dst == bpf_reg && writes_dst)
return true;
/* Most BPF instructions are 8 bits long, but some ar 16 bits
* long. */
A bunch of spelling errors above.
Cheers, Björn
On Mon, Jan 27, 2020 at 06:11:45PM -0800, Palmer Dabbelt wrote:
- /* Handle BPF_REG_0, which may be in the wrong place because the ARM64
* ABI doesn't match the BPF ABI for function calls. */
- if (ctx->reg0_in_reg1) {
/* If we're writing BPF_REG_0 then we don't need to do any
* extra work to get the registers back in their correct
* locations. */
if (insn->dst_reg == BPF_REG_0)
ctx->reg0_in_reg1 = false;
/* If we're writing to BPF_REG_1 then we need to save BPF_REG_0
* into the correct location if it's still alive, as otherwise
* it will be clobbered. */
if (insn->dst_reg == BPF_REG_1) {
if (!dead_register(ctx, off + 1, BPF_REG_0))
emit(A64_MOV(1, A64_R(7), A64_R(0)), ctx);
ctx->reg0_in_reg1 = false;
}
- }
I'm not sure this is correct, since it processes insns as a linear code, but there could be jumps in the middle. The logic should be following the control flow of the program. The verifier is a better place to do such analysis. I don't see how JITs can do it on their own.
On Tue, 28 Jan 2020 at 03:14, Palmer Dabbelt palmerdabbelt@google.com wrote:
There's four patches here, but only one of them actually does anything. The first patch fixes a BPF selftests build failure on my machine and has already been sent to the list separately. The next three are just staged such that there are some patches that avoid changing any functionality pulled out from the whole point of those refactorings, with two cleanups and then the idea.
Maybe this is an odd thing to say in a cover letter, but I'm not actually sure this patch set is a good idea. The issue of extra moves after calls came up as I was reviewing some unrelated performance optimizations to the RISC-V BPF JIT. I figured I'd take a whack at performing the optimization in the context of the arm64 port just to get a breath of fresh air, and I'm not convinced I like the results.
That said, I think I would accept something like this for the RISC-V port because we're already doing a multi-pass optimization for shrinking function addresses so it's not as much extra complexity over there. If we do that we should probably start puling some of this code into the shared BPF compiler, but we're also opening the doors to more complicated BPF JIT optimizations. Given that the BPF JIT appears to have been designed explicitly to be simple/fast as opposed to perform complex optimization, I'm not sure this is a sane way to move forward.
Obviously I can only speak for myself and the RISC-V JIT, but given that we already have opened the door for more advanced translations (branch relaxation e.g.), I think that this makes sense. At the same time we don't want to go all JVM on the JITs. :-P
I figured I'd send the patch set out as more of a question than anything else. Specifically:
- How should I go about measuring the performance of these sort of optimizations? I'd like to balance the time it takes to run the JIT with the time spent executing the program, but I don't have any feel for what real BPF programs look like or have any benchmark suite to run. Is there something out there this should be benchmarked against? (I'd also like to know that to run those benchmarks on the RISC-V port.)
If you run the selftests 'test_progs' with -v it'll measure/print the execution time of the programs. I'd say *most* BPF program invokes a helper (via call). It would be interesting to see, for say the selftests, how often the optimization can be performed.
- Is this the sort of thing that makes sense in a BPF JIT? I guess I've just realized I turned "review this patch" into a way bigger rabbit hole than I really want to go down...
I'd say 'yes'. My hunch, and the workloads I've seen, BPF programs are usually loaded, and then resident for a long time. So, the JIT time is not super critical. The FB/Cilium folks can definitely provide a better sample point, than my hunch. ;-)
Björn
Björn Töpel wrote:
On Tue, 28 Jan 2020 at 03:14, Palmer Dabbelt palmerdabbelt@google.com wrote:
There's four patches here, but only one of them actually does anything. The first patch fixes a BPF selftests build failure on my machine and has already been sent to the list separately. The next three are just staged such that there are some patches that avoid changing any functionality pulled out from the whole point of those refactorings, with two cleanups and then the idea.
Maybe this is an odd thing to say in a cover letter, but I'm not actually sure this patch set is a good idea. The issue of extra moves after calls came up as I was reviewing some unrelated performance optimizations to the RISC-V BPF JIT. I figured I'd take a whack at performing the optimization in the context of the arm64 port just to get a breath of fresh air, and I'm not convinced I like the results.
That said, I think I would accept something like this for the RISC-V port because we're already doing a multi-pass optimization for shrinking function addresses so it's not as much extra complexity over there. If we do that we should probably start puling some of this code into the shared BPF compiler, but we're also opening the doors to more complicated BPF JIT optimizations. Given that the BPF JIT appears to have been designed explicitly to be simple/fast as opposed to perform complex optimization, I'm not sure this is a sane way to move forward.
Obviously I can only speak for myself and the RISC-V JIT, but given that we already have opened the door for more advanced translations (branch relaxation e.g.), I think that this makes sense. At the same time we don't want to go all JVM on the JITs. :-P
I'm not against it although if we start to go this route I would want some way to quantify how we are increasing/descreasing load times.
I figured I'd send the patch set out as more of a question than anything else. Specifically:
- How should I go about measuring the performance of these sort of optimizations? I'd like to balance the time it takes to run the JIT with the time spent executing the program, but I don't have any feel for what real BPF programs look like or have any benchmark suite to run. Is there something out there this should be benchmarked against? (I'd also like to know that to run those benchmarks on the RISC-V port.)
If you run the selftests 'test_progs' with -v it'll measure/print the execution time of the programs. I'd say *most* BPF program invokes a helper (via call). It would be interesting to see, for say the selftests, how often the optimization can be performed.
- Is this the sort of thing that makes sense in a BPF JIT? I guess I've just realized I turned "review this patch" into a way bigger rabbit hole than I really want to go down...
I'd say 'yes'. My hunch, and the workloads I've seen, BPF programs are usually loaded, and then resident for a long time. So, the JIT time is not super critical. The FB/Cilium folks can definitely provide a better sample point, than my hunch. ;-)
In our case the JIT time can be relevant because we are effectively holding up a kubernetes pod load waiting for programs to load. However, we can probably work-around it by doing more aggressive dynamic linking now that this is starting to land.
It would be interesting to have a test to measure load time in selftests or selftests/benchmark/ perhaps. We have some of these out of tree we could push in I think if there is interest.
Björn
On Tue, 04 Feb 2020 12:33:13 PST (-0800), john.fastabend@gmail.com wrote:
Björn Töpel wrote:
On Tue, 28 Jan 2020 at 03:14, Palmer Dabbelt palmerdabbelt@google.com wrote:
There's four patches here, but only one of them actually does anything. The first patch fixes a BPF selftests build failure on my machine and has already been sent to the list separately. The next three are just staged such that there are some patches that avoid changing any functionality pulled out from the whole point of those refactorings, with two cleanups and then the idea.
Maybe this is an odd thing to say in a cover letter, but I'm not actually sure this patch set is a good idea. The issue of extra moves after calls came up as I was reviewing some unrelated performance optimizations to the RISC-V BPF JIT. I figured I'd take a whack at performing the optimization in the context of the arm64 port just to get a breath of fresh air, and I'm not convinced I like the results.
That said, I think I would accept something like this for the RISC-V port because we're already doing a multi-pass optimization for shrinking function addresses so it's not as much extra complexity over there. If we do that we should probably start puling some of this code into the shared BPF compiler, but we're also opening the doors to more complicated BPF JIT optimizations. Given that the BPF JIT appears to have been designed explicitly to be simple/fast as opposed to perform complex optimization, I'm not sure this is a sane way to move forward.
Obviously I can only speak for myself and the RISC-V JIT, but given that we already have opened the door for more advanced translations (branch relaxation e.g.), I think that this makes sense. At the same time we don't want to go all JVM on the JITs. :-P
I'm not against it although if we start to go this route I would want some way to quantify how we are increasing/descreasing load times.
I figured I'd send the patch set out as more of a question than anything else. Specifically:
- How should I go about measuring the performance of these sort of optimizations? I'd like to balance the time it takes to run the JIT with the time spent executing the program, but I don't have any feel for what real BPF programs look like or have any benchmark suite to run. Is there something out there this should be benchmarked against? (I'd also like to know that to run those benchmarks on the RISC-V port.)
If you run the selftests 'test_progs' with -v it'll measure/print the execution time of the programs. I'd say *most* BPF program invokes a helper (via call). It would be interesting to see, for say the selftests, how often the optimization can be performed.
- Is this the sort of thing that makes sense in a BPF JIT? I guess I've just realized I turned "review this patch" into a way bigger rabbit hole than I really want to go down...
I'd say 'yes'. My hunch, and the workloads I've seen, BPF programs are usually loaded, and then resident for a long time. So, the JIT time is not super critical. The FB/Cilium folks can definitely provide a better sample point, than my hunch. ;-)
In our case the JIT time can be relevant because we are effectively holding up a kubernetes pod load waiting for programs to load. However, we can probably work-around it by doing more aggressive dynamic linking now that this is starting to land.
It would be interesting to have a test to measure load time in selftests or selftests/benchmark/ perhaps. We have some of these out of tree we could push in I think if there is interest.
I'd be interested in some sort of benchmark suite for BPF. Something like selftests/bpf/benchmarks/ seems like a reasonable place to me.
Björn
linux-kselftest-mirror@lists.linaro.org