On Tue, May 5, 2020 at 6:37 AM Daniel Borkmann daniel@iogearbox.net wrote:
On 5/5/20 1:48 AM, Alexei Starovoitov wrote:
On Sat, May 02, 2020 at 01:48:51AM +0200, Daniel Borkmann wrote:
On 4/27/20 11:45 AM, Lorenz Bauer wrote:
On Sun, 26 Apr 2020 at 18:33, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
[...]
+/* Linux packet pointers are either aligned to NET_IP_ALIGN (aka 2 bytes),
- or not aligned if the arch supports efficient unaligned access.
- Since the verifier ensures that eBPF packet accesses follow these rules,
- we can tell LLVM to emit code as if we always had a larger alignment.
- It will yell at us if we end up on a platform where this is not valid.
- */
+typedef uint8_t *net_ptr __attribute__((align_value(8)));
Wow. I didn't know about this attribute. I wonder whether it can help Daniel's memcpy hack.
Yes, I think so.
Just for some more context [0]. I think the problem is a bit more complex in general. Generally, _any_ kind of pointer to some data (except for the stack) is currently treated as byte-by-byte copy from __builtin_memcpy() and other similarly available __builtin_*() helpers on BPF backend since the backend cannot make any assumptions about the data's alignment and whether unaligned access from the underlying arch is ok & efficient (the latter the verifier does judge for us however). So it's definitely not just limited to xdp->data. There is also the issue that while access to any non-stack data can be unaligned, access to the stack however cannot. I've discussed a while back with Yonghong about potential solutions. One would be to add a small patch to the BPF backend to enable __builtin_*() helpers to allow for unaligned access which could then be opt-ed in e.g. via -mattr from llc for the case when we know that the compiled program only runs on archs with efficient unaligned access anyway. However, this still potentially breaks with the BPF stack for the case when objects are, for example, larger than size 8 but with a natural alignment smaller than 8 where __builtin_memcpy() would then decide to emit dw-typed load/stores. But for these cases could then be annotated via __aligned(8) on stack. So this is basically what we do right now as a generic workaround in Cilium [0], meaning, our own memcpy/memset with optimal number of instructions and __aligned(8) where needed; most of the time this __aligned(8) is not needed, so it's really just a few places, and we also have a cocci scripts to catch these during development if needed. Anyway, real thing would be to allow the BPF stack for unaligned access as well and then BPF backend could nicely solve this in a native way w/o any workarounds, but that is tbd.
Thanks, Daniel
[0] https://github.com/cilium/cilium/blob/master/bpf/include/bpf/builtins.h
Daniel, do you mind adding such memcpy to libbpf ?
We could do that, yeah. Is there a way from BPF C code when compiling with clang to get to the actual underlying architecture (x86-64, arm64, ppc, etc) when compiling with `-target bpf` so that we can always fall back to __builtin_*() for those where verifier would bail out on unaligned access? Keep in mind the __bpf_memcpy() and __bpf_memzero() from [0] are fully compile time resolved and I did the implementation for sizes of 1,2,4,..., 96 where the latter is in two' increments, so no odd buffer sizes as we don't need them in our code. If someone does hit such an odd case, then I'm currently throwing a compilation error via __throw_build_bug(). Latter is a nice way to be able to guarantee that a switch/case or if condition is never hit during compilation time. It resolves to __builtin_trap() which is not implemented in the BPF backend and therefore yells to the developer when built into the code (this has a nice property which wouldn't work with BUILD_BUG_ON() for example). Anyway, what I'm saying is that either we'd need the full thing with all sizes or document that unsupported size would be hit when __builtin_trap() assertion is seen.
I think it would be fine to simply document it. Most structures have at least one 'int' and don't have 'packed', so they are multiple of 4 typically. Multiple of 2 limitation should be acceptable for most cases.