On Sun, 26 Apr 2020 at 18:33, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Fri, Apr 24, 2020 at 07:55:55PM +0100, Lorenz Bauer wrote:
cls_redirect is a TC clsact based replacement for the glb-redirect iptables module available at [1]. It enables what GitHub calls "second chance" flows [2], similarly proposed by the Beamer paper [3]. In contrast to glb-redirect, it also supports migrating UDP flows as long as connected sockets are used. cls_redirect is in production at Cloudflare, as part of our own L4 load balancer.
This is awesome. Thank you for contributing! Applied.
There are two major distinctions from glb-redirect: first, cls_redirect relies on receiving encapsulated packets directly from a router. This is because we don't have access to the neighbour tables from BPF, yet. See
Let's make it available then :)
Yes, I've been dragging my feet on this. It seems like the fib_lookup does almost what we want, but I need to experiment with it. In the best case, we'd have a helper that does the following:
* Try and find a neighbour * Return it if one exists * Otherwise, asynchronously trigger neighbour discovery (if it makes sense) * And return the default gateway
I should probably start a new thread about this on the list.
The code base started it's life on v4.19, so there are most likely still hold overs from old workarounds. In no particular order:
The function buf_off is required to defeat a clang optimization that leads to the verifier rejecting the program due to pointer arithmetic in the wrong order.
The function pkt_parse_ipv6 is force inlined, because it would otherwise be rejected due to returning a pointer to stack memory.
The functions fill_tuple and classify_tcp contain kludges, because we've run out of function arguments.
The logic in general is rather nested, due to verifier restrictions. I think this is either because the verifier loses track of constants on the stack, or because it can't track enum like variables.
Have you tried undoing these workarounds to check the latest verifier? If they're still needed we certainly have to improve the verifier.
+#include "test_cls_redirect.skel.h"
Do you use skeleton internally as well or was it just for selftests? ;)
Only for selftests :) Our internal tooling is all Go based. skeleton is really nice though, so I'll make sure to steal some ideas!
+_Static_assert(
sizeof(flow_ports_t) !=
offsetofend(struct bpf_sock_tuple, ipv4.dport) -
offsetof(struct bpf_sock_tuple, ipv4.sport) - 1,
"flow_ports_t must match sport and dport in struct bpf_sock_tuple");
Nice. I didn't realize clang supports it. Of course it should.
+/* Linux packet pointers are either aligned to NET_IP_ALIGN (aka 2 bytes),
- or not aligned if the arch supports efficient unaligned access.
- Since the verifier ensures that eBPF packet accesses follow these rules,
- we can tell LLVM to emit code as if we always had a larger alignment.
- It will yell at us if we end up on a platform where this is not valid.
- */
+typedef uint8_t *net_ptr __attribute__((align_value(8)));
Wow. I didn't know about this attribute. I wonder whether it can help Daniel's memcpy hack.
Yes, I think so.
+typedef struct buf {
struct __sk_buff *skb;
net_ptr head;
/* NB: tail musn't have alignment other than 1, otherwise
* LLVM will go and eliminate code, e.g. when checking packet lengths.
*/
uint8_t *const tail;
+} buf_t;
+static size_t buf_off(const buf_t *buf) +{
/* Clang seems to optimize constructs like
* a - b + c
* if c is known:
* r? = c
* r? -= b
* r? += a
*
* This is a problem if a and b are packet pointers,
* since the verifier allows subtracting two pointers to
* get a scalar, but not a scalar and a pointer.
*
* Use inline asm to break this optimization.
*/
size_t off = (size_t)buf->head;
asm("%0 -= %1" : "+r"(off) : "r"(buf->skb->data));
return off;
+}
We need to look into this one. This part is not gated by allow_ptr_leaks. if (dst_reg == off_reg) { /* scalar -= pointer. Creates an unknown scalar */ verbose(env, "R%d tried to subtract pointer from scalar\n", dst); return -EACCES; } Hopefully not too hard to fix.
+static bool pkt_skip_ipv6_extension_headers(buf_t *pkt,
const struct ipv6hdr *ipv6,
uint8_t *upper_proto,
bool *is_fragment)
+{
/* We understand five extension headers.
* https://tools.ietf.org/html/rfc8200#section-4.1 states that all
* headers should occur once, except Destination Options, which may
* occur twice. Hence we give up after 6 headers.
*/
struct {
uint8_t next;
uint8_t len;
} exthdr = {
.next = ipv6->nexthdr,
};
*is_fragment = false;
+#pragma clang loop unroll(full)
for (int i = 0; i < 6; i++) {
unroll is still needed even with bounded loop support in the verifier?
I've just tried removing this on bpf-next. pkt_ipv4_checksum works without the pragma, pkt_skip_ipv6_extension_headers fails with the following message:
libbpf: load bpf program failed: Invalid argument libbpf: -- BEGIN DUMP LOG --- libbpf: 476: for (int i = 0; i < 6; i++) { 477: switch (exthdr.next) { back-edge from insn 476 to 477 processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
libbpf: -- END LOG -- libbpf: failed to load program 'classifier/cls_redirect' libbpf: failed to load object 'test_cls_redirect' libbpf: failed to load BPF skeleton 'test_cls_redirect': -4007 test_cls_redirect:FAIL:385 #10 cls_redirect:FAIL Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
+/* This function has to be inlined, because the verifier otherwise rejects it
- due to returning a pointer to the stack. This is technically correct, since
- scratch is allocated on the stack. However, this usage should be safe since
- it's the callers stack after all.
- */
Interesting. The verifier can recognize that ptr to stack can be safe in some cases. When I added that check I didn't think that the code can be as tricky as this :)
+static verdict_t process_ipv4(buf_t *pkt, metrics_t *metrics) +{
switch (ipv4->protocol) {
case IPPROTO_ICMP:
return process_icmpv4(pkt, metrics);
case IPPROTO_TCP:
return process_tcp(pkt, ipv4, sizeof(*ipv4), metrics);
case IPPROTO_UDP:
return process_udp(pkt, ipv4, sizeof(*ipv4), metrics);
default:
metrics->errors_total_unknown_l4_proto++;
return INVALID;
}
+}
+static verdict_t process_ipv6(buf_t *pkt, metrics_t *metrics)
if (is_fragment) {
metrics->errors_total_fragmented_ip++;
return INVALID;
}
switch (l4_proto) {
case IPPROTO_ICMPV6:
return process_icmpv6(pkt, metrics);
case IPPROTO_TCP:
return process_tcp(pkt, ipv6, sizeof(*ipv6), metrics);
case IPPROTO_UDP:
return process_udp(pkt, ipv6, sizeof(*ipv6), metrics);
default:
metrics->errors_total_unknown_l4_proto++;
return INVALID;
}
+}
+SEC("classifier/cls_redirect") +int cls_redirect(struct __sk_buff *skb) +{
...
verdict_t verdict;
switch (encap->gue.proto_ctype) {
case IPPROTO_IPIP:
verdict = process_ipv4(&pkt, metrics);
break;
case IPPROTO_IPV6:
verdict = process_ipv6(&pkt, metrics);
break;
The code flow looks pretty clean. I didn't find the copy-paste of ipv[46] -> tcp/udp you were mentioning. So that old issue is now gone?
The biggest offenders are fill_tuple (which is purely a hack) as well as classify_tcp. I'd really like to call classify_tcp from cls_redirect(), using saved packet pointers and lengths. Right now the function is called starting from process_ipv4 and process_ipv6, which means all of those arguments have to be passed down somehow.