Re: arm64: bpf: Elide some moves to a0 after calls

4 Feb 2020

      Björn Töpel wrote:
...
On Tue, 28 Jan 2020 at 03:14, Palmer Dabbelt palmerdabbelt@google.com wrote:
...
There's four patches here, but only one of them actually does anything.  The
first patch fixes a BPF selftests build failure on my machine and has already
been sent to the list separately.  The next three are just staged such that
there are some patches that avoid changing any functionality pulled out from
the whole point of those refactorings, with two cleanups and then the idea.
Maybe this is an odd thing to say in a cover letter, but I'm not actually sure
this patch set is a good idea.  The issue of extra moves after calls came up as
I was reviewing some unrelated performance optimizations to the RISC-V BPF JIT.
I figured I'd take a whack at performing the optimization in the context of the
arm64 port just to get a breath of fresh air, and I'm not convinced I like the
results.
That said, I think I would accept something like this for the RISC-V port
because we're already doing a multi-pass optimization for shrinking function
addresses so it's not as much extra complexity over there.  If we do that we
should probably start puling some of this code into the shared BPF compiler,
but we're also opening the doors to more complicated BPF JIT optimizations.
Given that the BPF JIT appears to have been designed explicitly to be
simple/fast as opposed to perform complex optimization, I'm not sure this is a
sane way to move forward.
Obviously I can only speak for myself and the RISC-V JIT, but given
that we already have opened the door for more advanced translations
(branch relaxation e.g.), I think that this makes sense. At the same
time we don't want to go all JVM on the JITs. :-P
I'm not against it although if we start to go this route I would want some
way to quantify how we are increasing/descreasing load times.
...
...
I figured I'd send the patch set out as more of a question than anything else.
Specifically:

How should I go about measuring the performance of these sort of
optimizations?  I'd like to balance the time it takes to run the JIT with the
time spent executing the program, but I don't have any feel for what real BPF
programs look like or have any benchmark suite to run.  Is there something
out there this should be benchmarked against?  (I'd also like to know that to
run those benchmarks on the RISC-V port.)

If you run the selftests 'test_progs' with -v it'll measure/print the
execution time of the programs. I'd say *most* BPF program invokes a
helper (via call). It would be interesting to see, for say the
selftests, how often the optimization can be performed.
...

Is this the sort of thing that makes sense in a BPF JIT?  I guess I've just
realized I turned "review this patch" into a way bigger rabbit hole than I
really want to go down...

I'd say 'yes'. My hunch, and the workloads I've seen, BPF programs are
usually loaded, and then resident for a long time. So, the JIT time is
not super critical. The FB/Cilium folks can definitely provide a
better sample point, than my hunch. ;-)
In our case the JIT time can be relevant because we are effectively holding
up a kubernetes pod load waiting for programs to load. However, we can
probably work-around it by doing more aggressive dynamic linking now that
this is starting to land.
It would be interesting to have a test to measure load time in selftests
or selftests/benchmark/ perhaps. We have some of these out of tree we
could push in I think if there is interest.
...
Björn

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: arm64: bpf: Elide some moves to a0 after calls