=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal.
=== BPF related bits ===
However, when policy is enforced through BPF, the prog is run before the kernel reassembles fragmented packets. This leaves BPF developers in a awkward place: implement reassembly (possibly poorly) or use a stateless method as described above.
Fortunately, the kernel has robust support for fragmented IP packets. This patchset wraps the existing defragmentation facilities in kfuncs so that BPF progs running on middleboxes can reassemble fragmented packets before applying policy.
=== Patchset details ===
This patchset is (hopefully) relatively straightforward from BPF perspective. One thing I'd like to call out is the skb_copy()ing of the prog skb. I did this to maintain the invariant that the ctx remains valid after prog has run. This is relevant b/c ip_defrag() and ip_check_defrag() may consume the skb if the skb is a fragment.
Originally I did play around with teaching the verifier about kfuncs that may consume the ctx and disallowing ctx accesses in ret != 0 branches. It worked ok, but it seemed too complex to modify the surrounding assumptions about ctx validity.
[0]: https://datatracker.ietf.org/doc/html/rfc8900
===
Changes from v1: * Add support for ipv6 defragmentation
Daniel Xu (8): ip: frags: Return actual error codes from ip_check_defrag() bpf: verifier: Support KF_CHANGES_PKT flag bpf, net, frags: Add bpf_ip_check_defrag() kfunc net: ipv6: Factor ipv6_frag_rcv() to take netns and user bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc bpf: selftests: Support not connecting client socket bpf: selftests: Support custom type and proto for client sockets bpf: selftests: Add defrag selftests
Documentation/bpf/kfuncs.rst | 7 + drivers/net/macvlan.c | 2 +- include/linux/btf.h | 1 + include/net/ip.h | 11 + include/net/ipv6.h | 1 + include/net/ipv6_frag.h | 1 + include/net/transp_v6.h | 1 + kernel/bpf/verifier.c | 8 + net/ipv4/Makefile | 1 + net/ipv4/ip_fragment.c | 15 +- net/ipv4/ip_fragment_bpf.c | 98 ++++++ net/ipv6/Makefile | 1 + net/ipv6/af_inet6.c | 4 + net/ipv6/reassembly.c | 16 +- net/ipv6/reassembly_bpf.c | 143 ++++++++ net/packet/af_packet.c | 2 +- tools/testing/selftests/bpf/Makefile | 3 +- .../selftests/bpf/generate_udp_fragments.py | 90 +++++ .../selftests/bpf/ip_check_defrag_frags.h | 57 +++ tools/testing/selftests/bpf/network_helpers.c | 26 +- tools/testing/selftests/bpf/network_helpers.h | 3 + .../bpf/prog_tests/ip_check_defrag.c | 327 ++++++++++++++++++ .../selftests/bpf/progs/bpf_tracing_net.h | 1 + .../selftests/bpf/progs/ip_check_defrag.c | 133 +++++++ 24 files changed, 931 insertions(+), 21 deletions(-) create mode 100644 net/ipv4/ip_fragment_bpf.c create mode 100644 net/ipv6/reassembly_bpf.c create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
For connectionless protocols or raw sockets we do not want to actually connect() to the server.
Signed-off-by: Daniel Xu dxu@dxuuu.xyz --- tools/testing/selftests/bpf/network_helpers.c | 5 +++-- tools/testing/selftests/bpf/network_helpers.h | 1 + 2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c index 01de33191226..24f5efebc7dd 100644 --- a/tools/testing/selftests/bpf/network_helpers.c +++ b/tools/testing/selftests/bpf/network_helpers.c @@ -301,8 +301,9 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts) strlen(opts->cc) + 1)) goto error_close;
- if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail)) - goto error_close; + if (!opts->noconnect) + if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail)) + goto error_close;
return fd;
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h index f882c691b790..8be04cd76d8b 100644 --- a/tools/testing/selftests/bpf/network_helpers.h +++ b/tools/testing/selftests/bpf/network_helpers.h @@ -21,6 +21,7 @@ struct network_helper_opts { const char *cc; int timeout_ms; bool must_fail; + bool noconnect; };
/* ipv4 test vector */
Extend connect_to_fd_opts() to take optional type and protocol parameters for the client socket. These parameters are useful when opening a raw socket to send IP fragments.
Signed-off-by: Daniel Xu dxu@dxuuu.xyz --- tools/testing/selftests/bpf/network_helpers.c | 21 +++++++++++++------ tools/testing/selftests/bpf/network_helpers.h | 2 ++ 2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c index 24f5efebc7dd..4f9ba90b1b7e 100644 --- a/tools/testing/selftests/bpf/network_helpers.c +++ b/tools/testing/selftests/bpf/network_helpers.c @@ -270,14 +270,23 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts) opts = &default_opts;
optlen = sizeof(type); - if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) { - log_err("getsockopt(SOL_TYPE)"); - return -1; + + if (opts->type) { + type = opts->type; + } else { + if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) { + log_err("getsockopt(SOL_TYPE)"); + return -1; + } }
- if (getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen)) { - log_err("getsockopt(SOL_PROTOCOL)"); - return -1; + if (opts->proto) { + protocol = opts->proto; + } else { + if (getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen)) { + log_err("getsockopt(SOL_PROTOCOL)"); + return -1; + } }
addrlen = sizeof(addr); diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h index 8be04cd76d8b..7119804ea79b 100644 --- a/tools/testing/selftests/bpf/network_helpers.h +++ b/tools/testing/selftests/bpf/network_helpers.h @@ -22,6 +22,8 @@ struct network_helper_opts { int timeout_ms; bool must_fail; bool noconnect; + int type; + int proto; };
/* ipv4 test vector */
These selftests tests 2 major scenarios: the BPF based defragmentation can successfully be done and that packet pointers are invalidated after calls to the kfunc. The logic is similar for both ipv4 and ipv6.
In the first scenario, we create a UDP client and UDP echo server. The the server side is fairly straightforward: we attach the prog and simply echo back the message.
The on the client side, we send fragmented packets to and expect the reassembled message back from the server.
Signed-off-by: Daniel Xu dxu@dxuuu.xyz --- tools/testing/selftests/bpf/Makefile | 3 +- .../selftests/bpf/generate_udp_fragments.py | 90 +++++ .../selftests/bpf/ip_check_defrag_frags.h | 57 +++ .../bpf/prog_tests/ip_check_defrag.c | 327 ++++++++++++++++++ .../selftests/bpf/progs/bpf_tracing_net.h | 1 + .../selftests/bpf/progs/ip_check_defrag.c | 133 +++++++ 6 files changed, 610 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index b677dcd0b77a..979af1611139 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -558,7 +558,8 @@ TRUNNER_BPF_PROGS_DIR := progs TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c \ network_helpers.c testing_helpers.c \ btf_helpers.c flow_dissector_load.h \ - cap_helpers.c test_loader.c xsk.c + cap_helpers.c test_loader.c xsk.c \ + ip_check_defrag_frags.h TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko \ $(OUTPUT)/liburandom_read.so \ $(OUTPUT)/xdp_synproxy \ diff --git a/tools/testing/selftests/bpf/generate_udp_fragments.py b/tools/testing/selftests/bpf/generate_udp_fragments.py new file mode 100755 index 000000000000..2b8a1187991c --- /dev/null +++ b/tools/testing/selftests/bpf/generate_udp_fragments.py @@ -0,0 +1,90 @@ +#!/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +""" +This script helps generate fragmented UDP packets. + +While it is technically possible to dynamically generate +fragmented packets in C, it is much harder to read and write +said code. `scapy` is relatively industry standard and really +easy to read / write. + +So we choose to write this script that generates a valid C +header. Rerun script and commit generated file after any +modifications. +""" + +import argparse +import os + +from scapy.all import * + + +# These constants must stay in sync with `ip_check_defrag.c` +VETH1_ADDR = "172.16.1.200" +VETH0_ADDR6 = "fc00::100" +VETH1_ADDR6 = "fc00::200" +CLIENT_PORT = 48878 +SERVER_PORT = 48879 +MAGIC_MESSAGE = "THIS IS THE ORIGINAL MESSAGE, PLEASE REASSEMBLE ME" + + +def print_header(f): + f.write("// SPDX-License-Identifier: GPL-2.0\n") + f.write("/* DO NOT EDIT -- this file is generated */\n") + f.write("\n") + f.write("#ifndef _IP_CHECK_DEFRAG_FRAGS_H\n") + f.write("#define _IP_CHECK_DEFRAG_FRAGS_H\n") + f.write("\n") + f.write("#include <stdint.h>\n") + f.write("\n") + + +def print_frags(f, frags, v6): + for idx, frag in enumerate(frags): + # 10 bytes per line to keep width in check + chunks = [frag[i : i + 10] for i in range(0, len(frag), 10)] + chunks_fmted = [", ".join([str(hex(b)) for b in chunk]) for chunk in chunks] + suffix = "6" if v6 else "" + + f.write(f"static uint8_t frag{suffix}_{idx}[] = {{\n") + for chunk in chunks_fmted: + f.write(f"\t{chunk},\n") + f.write(f"}};\n") + + +def print_trailer(f): + f.write("\n") + f.write("#endif /* _IP_CHECK_DEFRAG_FRAGS_H */\n") + + +def main(f): + # srcip of 0 is filled in by IP_HDRINCL + sip = "0.0.0.0" + sip6 = VETH0_ADDR6 + dip = VETH1_ADDR + dip6 = VETH1_ADDR6 + sport = CLIENT_PORT + dport = SERVER_PORT + payload = MAGIC_MESSAGE.encode() + + # Disable UDPv4 checksums to keep code simpler + pkt = IP(src=sip,dst=dip) / UDP(sport=sport,dport=dport,chksum=0) / Raw(load=payload) + # UDPv6 requires a checksum + # Also pin the ipv6 fragment header ID, otherwise it's a random value + pkt6 = IPv6(src=sip6,dst=dip6) / IPv6ExtHdrFragment(id=0xBEEF) / UDP(sport=sport,dport=dport) / Raw(load=payload) + + frags = [f.build() for f in pkt.fragment(24)] + frags6 = [f.build() for f in fragment6(pkt6, 72)] + + print_header(f) + print_frags(f, frags, False) + print_frags(f, frags6, True) + print_trailer(f) + + +if __name__ == "__main__": + dir = os.path.dirname(os.path.realpath(__file__)) + header = f"{dir}/ip_check_defrag_frags.h" + with open(header, "w") as f: + main(f) diff --git a/tools/testing/selftests/bpf/ip_check_defrag_frags.h b/tools/testing/selftests/bpf/ip_check_defrag_frags.h new file mode 100644 index 000000000000..70ab7e9fa22b --- /dev/null +++ b/tools/testing/selftests/bpf/ip_check_defrag_frags.h @@ -0,0 +1,57 @@ +// SPDX-License-Identifier: GPL-2.0 +/* DO NOT EDIT -- this file is generated */ + +#ifndef _IP_CHECK_DEFRAG_FRAGS_H +#define _IP_CHECK_DEFRAG_FRAGS_H + +#include <stdint.h> + +static uint8_t frag_0[] = { + 0x45, 0x0, 0x0, 0x2c, 0x0, 0x1, 0x20, 0x0, 0x40, 0x11, + 0xac, 0xe8, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8, + 0xbe, 0xee, 0xbe, 0xef, 0x0, 0x3a, 0x0, 0x0, 0x54, 0x48, + 0x49, 0x53, 0x20, 0x49, 0x53, 0x20, 0x54, 0x48, 0x45, 0x20, + 0x4f, 0x52, 0x49, 0x47, +}; +static uint8_t frag_1[] = { + 0x45, 0x0, 0x0, 0x2c, 0x0, 0x1, 0x20, 0x3, 0x40, 0x11, + 0xac, 0xe5, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8, + 0x49, 0x4e, 0x41, 0x4c, 0x20, 0x4d, 0x45, 0x53, 0x53, 0x41, + 0x47, 0x45, 0x2c, 0x20, 0x50, 0x4c, 0x45, 0x41, 0x53, 0x45, + 0x20, 0x52, 0x45, 0x41, +}; +static uint8_t frag_2[] = { + 0x45, 0x0, 0x0, 0x1e, 0x0, 0x1, 0x0, 0x6, 0x40, 0x11, + 0xcc, 0xf0, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8, + 0x53, 0x53, 0x45, 0x4d, 0x42, 0x4c, 0x45, 0x20, 0x4d, 0x45, +}; +static uint8_t frag6_0[] = { + 0x60, 0x0, 0x0, 0x0, 0x0, 0x20, 0x2c, 0x40, 0xfc, 0x0, + 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, + 0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0, + 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, + 0x11, 0x0, 0x0, 0x1, 0x0, 0x0, 0xbe, 0xef, 0xbe, 0xee, + 0xbe, 0xef, 0x0, 0x3a, 0xd0, 0xf8, 0x54, 0x48, 0x49, 0x53, + 0x20, 0x49, 0x53, 0x20, 0x54, 0x48, 0x45, 0x20, 0x4f, 0x52, + 0x49, 0x47, +}; +static uint8_t frag6_1[] = { + 0x60, 0x0, 0x0, 0x0, 0x0, 0x20, 0x2c, 0x40, 0xfc, 0x0, + 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, + 0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0, + 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, + 0x11, 0x0, 0x0, 0x19, 0x0, 0x0, 0xbe, 0xef, 0x49, 0x4e, + 0x41, 0x4c, 0x20, 0x4d, 0x45, 0x53, 0x53, 0x41, 0x47, 0x45, + 0x2c, 0x20, 0x50, 0x4c, 0x45, 0x41, 0x53, 0x45, 0x20, 0x52, + 0x45, 0x41, +}; +static uint8_t frag6_2[] = { + 0x60, 0x0, 0x0, 0x0, 0x0, 0x12, 0x2c, 0x40, 0xfc, 0x0, + 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, + 0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0, + 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, + 0x11, 0x0, 0x0, 0x30, 0x0, 0x0, 0xbe, 0xef, 0x53, 0x53, + 0x45, 0x4d, 0x42, 0x4c, 0x45, 0x20, 0x4d, 0x45, +}; + +#endif /* _IP_CHECK_DEFRAG_FRAGS_H */ diff --git a/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c b/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c new file mode 100644 index 000000000000..c79c4096aab4 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c @@ -0,0 +1,327 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <test_progs.h> +#include <net/if.h> +#include <network_helpers.h> +#include "ip_check_defrag.skel.h" +#include "ip_check_defrag_frags.h" + +/* + * This selftest spins up a client and an echo server, each in their own + * network namespace. The server will receive fragmented messages which + * the attached BPF prog should reassemble. We verify that reassembly + * occurred by checking the original (fragmented) message is received + * in whole. + * + * Topology: + * ========= + * NS0 | NS1 + * | + * client | server + * ---------- | ---------- + * | veth0 | --------- | veth1 | + * ---------- peer ---------- + * | + * | with bpf + */ + +#define NS0 "defrag_ns0" +#define NS1 "defrag_ns1" +#define VETH0 "veth0" +#define VETH1 "veth1" +#define VETH0_ADDR "172.16.1.100" +#define VETH0_ADDR6 "fc00::100" +/* The following constants must stay in sync with `generate_udp_fragments.py` */ +#define VETH1_ADDR "172.16.1.200" +#define VETH1_ADDR6 "fc00::200" +#define CLIENT_PORT 48878 +#define SERVER_PORT 48879 +#define MAGIC_MESSAGE "THIS IS THE ORIGINAL MESSAGE, PLEASE REASSEMBLE ME" + +static char log_buf[1024 * 1024]; + +static int setup_topology(bool ipv6) +{ + bool veth0_up; + bool veth1_up; + int i; + + SYS(fail, "ip netns add " NS0); + SYS(fail, "ip netns add " NS1); + SYS(fail, "ip link add " VETH0 " netns " NS0 " type veth peer name " VETH1 " netns " NS1); + if (ipv6) { + SYS(fail, "ip -6 -net " NS0 " addr add " VETH0_ADDR6 "/64 dev " VETH0 " nodad"); + SYS(fail, "ip -6 -net " NS1 " addr add " VETH1_ADDR6 "/64 dev " VETH1 " nodad"); + } else { + SYS(fail, "ip -net " NS0 " addr add " VETH0_ADDR "/24 dev " VETH0); + SYS(fail, "ip -net " NS1 " addr add " VETH1_ADDR "/24 dev " VETH1); + } + SYS(fail, "ip -net " NS0 " link set dev " VETH0 " up"); + SYS(fail, "ip -net " NS1 " link set dev " VETH1 " up"); + + /* Wait for up to 5s for links to come up */ + for (i = 0; i < 50; ++i) { + veth0_up = !system("ip -net " NS0 " link show " VETH0 " | grep 'state UP'"); + veth1_up = !system("ip -net " NS1 " link show " VETH1 " | grep 'state UP'"); + if (veth0_up && veth1_up) + break; + usleep(100000); + } + + if (!ASSERT_TRUE((veth0_up && veth1_up), "ifaces up")) + goto fail; + + return 0; +fail: + return -1; +} + +static void cleanup_topology(void) +{ + SYS_NOFAIL("test -f /var/run/netns/" NS0 " && ip netns delete " NS0); + SYS_NOFAIL("test -f /var/run/netns/" NS1 " && ip netns delete " NS1); +} + +static int attach(struct ip_check_defrag *skel) +{ + LIBBPF_OPTS(bpf_tc_hook, tc_hook, + .attach_point = BPF_TC_INGRESS); + LIBBPF_OPTS(bpf_tc_opts, tc_attach, + .prog_fd = bpf_program__fd(skel->progs.defrag)); + struct nstoken *nstoken; + int err = -1; + + nstoken = open_netns(NS1); + + tc_hook.ifindex = if_nametoindex(VETH1); + if (!ASSERT_OK(bpf_tc_hook_create(&tc_hook), "bpf_tc_hook_create")) + goto out; + + if (!ASSERT_OK(bpf_tc_attach(&tc_hook, &tc_attach), "bpf_tc_attach")) + goto out; + + err = 0; +out: + close_netns(nstoken); + return err; +} + +static int send_frags(int client) +{ + struct sockaddr_storage saddr; + struct sockaddr *saddr_p; + socklen_t saddr_len; + int err; + + saddr_p = (struct sockaddr *)&saddr; + err = make_sockaddr(AF_INET, VETH1_ADDR, SERVER_PORT, &saddr, &saddr_len); + if (!ASSERT_OK(err, "make_sockaddr")) + return -1; + + err = sendto(client, frag_0, sizeof(frag_0), 0, saddr_p, saddr_len); + if (!ASSERT_GE(err, 0, "sendto frag_0")) + return -1; + + err = sendto(client, frag_1, sizeof(frag_1), 0, saddr_p, saddr_len); + if (!ASSERT_GE(err, 0, "sendto frag_1")) + return -1; + + err = sendto(client, frag_2, sizeof(frag_2), 0, saddr_p, saddr_len); + if (!ASSERT_GE(err, 0, "sendto frag_2")) + return -1; + + return 0; +} + +static int send_frags6(int client) +{ + struct sockaddr_storage saddr; + struct sockaddr *saddr_p; + socklen_t saddr_len; + int err; + + saddr_p = (struct sockaddr *)&saddr; + /* Port needs to be set to 0 for raw ipv6 socket for some reason */ + err = make_sockaddr(AF_INET6, VETH1_ADDR6, 0, &saddr, &saddr_len); + if (!ASSERT_OK(err, "make_sockaddr")) + return -1; + + err = sendto(client, frag6_0, sizeof(frag6_0), 0, saddr_p, saddr_len); + if (!ASSERT_GE(err, 0, "sendto frag6_0")) + return -1; + + err = sendto(client, frag6_1, sizeof(frag6_1), 0, saddr_p, saddr_len); + if (!ASSERT_GE(err, 0, "sendto frag6_1")) + return -1; + + err = sendto(client, frag6_2, sizeof(frag6_2), 0, saddr_p, saddr_len); + if (!ASSERT_GE(err, 0, "sendto frag6_2")) + return -1; + + return 0; +} + +void test_bpf_ip_check_defrag_ok(bool ipv6) +{ + struct network_helper_opts rx_opts = { + .timeout_ms = 1000, + .noconnect = true, + }; + struct network_helper_opts tx_ops = { + .timeout_ms = 1000, + .type = SOCK_RAW, + .proto = IPPROTO_RAW, + .noconnect = true, + }; + struct sockaddr_storage caddr; + struct ip_check_defrag *skel; + struct nstoken *nstoken; + int client_tx_fd = -1; + int client_rx_fd = -1; + socklen_t caddr_len; + int srv_fd = -1; + char buf[1024]; + int len, err; + + skel = ip_check_defrag__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_open")) + return; + + if (!ASSERT_OK(setup_topology(ipv6), "setup_topology")) + goto out; + + if (!ASSERT_OK(attach(skel), "attach")) + goto out; + + /* Start server in ns1 */ + nstoken = open_netns(NS1); + if (!ASSERT_OK_PTR(nstoken, "setns ns1")) + goto out; + srv_fd = start_server(ipv6 ? AF_INET6 : AF_INET, SOCK_DGRAM, NULL, SERVER_PORT, 0); + close_netns(nstoken); + if (!ASSERT_GE(srv_fd, 0, "start_server")) + goto out; + + /* Open tx raw socket in ns0 */ + nstoken = open_netns(NS0); + if (!ASSERT_OK_PTR(nstoken, "setns ns0")) + goto out; + client_tx_fd = connect_to_fd_opts(srv_fd, &tx_ops); + close_netns(nstoken); + if (!ASSERT_GE(client_tx_fd, 0, "connect_to_fd_opts")) + goto out; + + /* Open rx socket in ns0 */ + nstoken = open_netns(NS0); + if (!ASSERT_OK_PTR(nstoken, "setns ns0")) + goto out; + client_rx_fd = connect_to_fd_opts(srv_fd, &rx_opts); + close_netns(nstoken); + if (!ASSERT_GE(client_rx_fd, 0, "connect_to_fd_opts")) + goto out; + + /* Bind rx socket to a premeditated port */ + memset(&caddr, 0, sizeof(caddr)); + nstoken = open_netns(NS0); + if (!ASSERT_OK_PTR(nstoken, "setns ns0")) + goto out; + if (ipv6) { + struct sockaddr_in6 *c = (struct sockaddr_in6 *)&caddr; + + c->sin6_family = AF_INET6; + inet_pton(AF_INET6, VETH0_ADDR6, &c->sin6_addr); + c->sin6_port = htons(CLIENT_PORT); + err = bind(client_rx_fd, (struct sockaddr *)c, sizeof(*c)); + } else { + struct sockaddr_in *c = (struct sockaddr_in *)&caddr; + + c->sin_family = AF_INET; + inet_pton(AF_INET, VETH0_ADDR, &c->sin_addr); + c->sin_port = htons(CLIENT_PORT); + err = bind(client_rx_fd, (struct sockaddr *)c, sizeof(*c)); + } + close_netns(nstoken); + if (!ASSERT_OK(err, "bind")) + goto out; + + /* Send message in fragments */ + if (ipv6) { + if (!ASSERT_OK(send_frags6(client_tx_fd), "send_frags6")) + goto out; + } else { + if (!ASSERT_OK(send_frags(client_tx_fd), "send_frags")) + goto out; + } + + if (!ASSERT_EQ(skel->bss->frags_seen, 3, "frags_seen")) + goto out; + + if (!ASSERT_FALSE(skel->data->is_final_frag, "is_final_frag")) + goto out; + + /* Receive reassembled msg on server and echo back to client */ + len = recvfrom(srv_fd, buf, sizeof(buf), 0, (struct sockaddr *)&caddr, &caddr_len); + if (!ASSERT_GE(len, 0, "server recvfrom")) + goto out; + len = sendto(srv_fd, buf, len, 0, (struct sockaddr *)&caddr, caddr_len); + if (!ASSERT_GE(len, 0, "server sendto")) + goto out; + + /* Expect reassembed message to be echoed back */ + len = recvfrom(client_rx_fd, buf, sizeof(buf), 0, NULL, NULL); + if (!ASSERT_EQ(len, sizeof(MAGIC_MESSAGE) - 1, "client short read")) + goto out; + +out: + if (client_rx_fd != -1) + close(client_rx_fd); + if (client_tx_fd != -1) + close(client_tx_fd); + if (srv_fd != -1) + close(srv_fd); + cleanup_topology(); + ip_check_defrag__destroy(skel); +} + +void test_bpf_ip_check_defrag_fail(void) +{ + const char *err_msg = "invalid mem access 'scalar'"; + LIBBPF_OPTS(bpf_object_open_opts, opts, + .kernel_log_buf = log_buf, + .kernel_log_size = sizeof(log_buf), + .kernel_log_level = 1); + struct ip_check_defrag *skel; + struct bpf_program *prog; + int err; + + skel = ip_check_defrag__open_opts(&opts); + if (!ASSERT_OK_PTR(skel, "ip_check_defrag__open_opts")) + return; + + prog = bpf_object__find_program_by_name(skel->obj, "defrag_fail"); + if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name")) + goto out; + + bpf_program__set_autoload(prog, true); + + err = ip_check_defrag__load(skel); + if (!ASSERT_ERR(err, "ip_check_defrag__load must fail")) + goto out; + + if (!ASSERT_OK_PTR(strstr(log_buf, err_msg), "expected error message")) { + fprintf(stderr, "Expected: %s\n", err_msg); + fprintf(stderr, "Verifier: %s\n", log_buf); + } + +out: + ip_check_defrag__destroy(skel); +} + +void test_bpf_ip_check_defrag(void) +{ + if (test__start_subtest("ok-v4")) + test_bpf_ip_check_defrag_ok(false); + if (test__start_subtest("ok-v6")) + test_bpf_ip_check_defrag_ok(true); + if (test__start_subtest("fail")) + test_bpf_ip_check_defrag_fail(); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h index cfed4df490f3..fde688b8af16 100644 --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h @@ -26,6 +26,7 @@ #define IPV6_AUTOFLOWLABEL 70
#define TC_ACT_UNSPEC (-1) +#define TC_ACT_OK 0 #define TC_ACT_SHOT 2
#define SOL_TCP 6 diff --git a/tools/testing/selftests/bpf/progs/ip_check_defrag.c b/tools/testing/selftests/bpf/progs/ip_check_defrag.c new file mode 100644 index 000000000000..5978fd2dd479 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/ip_check_defrag.c @@ -0,0 +1,133 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_endian.h> +#include "bpf_tracing_net.h" + +#define BPF_F_CURRENT_NETNS (-1) +#define ETH_P_IP 0x0800 +#define ETH_P_IPV6 0x86DD +#define IP_DF 0x4000 +#define IP_MF 0x2000 +#define IP_OFFSET 0x1FFF +#define NEXTHDR_FRAGMENT 44 +#define ctx_ptr(field) (void *)(long)(field) + +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns) __ksym; +int bpf_ipv6_frag_rcv(struct __sk_buff *ctx, u64 netns) __ksym; + +volatile int frags_seen = 0; +volatile bool is_final_frag = true; + +static bool is_frag_v4(struct iphdr *iph) +{ + int offset; + int flags; + + offset = bpf_ntohs(iph->frag_off); + flags = offset & ~IP_OFFSET; + offset &= IP_OFFSET; + offset <<= 3; + + return (flags & IP_MF) || offset; +} + +static bool is_frag_v6(struct ipv6hdr *ip6h) +{ + /* Simplifying assumption that there are no extension headers + * between fixed header and fragmentation header. This assumption + * is only valid in this test case. It saves us the hassle of + * searching all potential extension headers. + */ + return ip6h->nexthdr == NEXTHDR_FRAGMENT; +} + +static int defrag_v4(struct __sk_buff *skb) +{ + void *data_end = ctx_ptr(skb->data_end); + void *data = ctx_ptr(skb->data); + struct iphdr *iph; + + iph = data + sizeof(struct ethhdr); + if (iph + 1 > data_end) + return TC_ACT_SHOT; + + if (!is_frag_v4(iph)) + return TC_ACT_OK; + + frags_seen++; + if (bpf_ip_check_defrag(skb, BPF_F_CURRENT_NETNS)) + return TC_ACT_SHOT; + + data_end = ctx_ptr(skb->data_end); + data = ctx_ptr(skb->data); + iph = data + sizeof(struct ethhdr); + if (iph + 1 > data_end) + return TC_ACT_SHOT; + is_final_frag = is_frag_v4(iph); + + return TC_ACT_OK; +} + +static int defrag_v6(struct __sk_buff *skb) +{ + void *data_end = ctx_ptr(skb->data_end); + void *data = ctx_ptr(skb->data); + struct ipv6hdr *ip6h; + + ip6h = data + sizeof(struct ethhdr); + if (ip6h + 1 > data_end) + return TC_ACT_SHOT; + + if (!is_frag_v6(ip6h)) + return TC_ACT_OK; + + frags_seen++; + if (bpf_ipv6_frag_rcv(skb, BPF_F_CURRENT_NETNS)) + return TC_ACT_SHOT; + + data_end = ctx_ptr(skb->data_end); + data = ctx_ptr(skb->data); + ip6h = data + sizeof(struct ethhdr); + if (ip6h + 1 > data_end) + return TC_ACT_SHOT; + is_final_frag = is_frag_v6(ip6h); + + return TC_ACT_OK; +} + +SEC("tc") +int defrag(struct __sk_buff *skb) +{ + switch (bpf_ntohs(skb->protocol)) { + case ETH_P_IP: + return defrag_v4(skb); + case ETH_P_IPV6: + return defrag_v6(skb); + default: + return TC_ACT_OK; + } +} + +SEC("?tc") +int defrag_fail(struct __sk_buff *skb) +{ + void *data_end = ctx_ptr(skb->data_end); + void *data = ctx_ptr(skb->data); + struct iphdr *iph; + + if (skb->protocol != bpf_htons(ETH_P_IP)) + return TC_ACT_OK; + + iph = data + sizeof(struct ethhdr); + if (iph + 1 > data_end) + return TC_ACT_SHOT; + + if (bpf_ip_check_defrag(skb, BPF_F_CURRENT_NETNS)) + return TC_ACT_SHOT; + + /* Boom. Must revalidate pkt ptrs */ + return iph->ttl ? TC_ACT_OK : TC_ACT_SHOT; +} + +char _license[] SEC("license") = "GPL";
On 27/02/2023 19:51, Daniel Xu wrote:
However, when policy is enforced through BPF, the prog is run before the kernel reassembles fragmented packets. This leaves BPF developers in a awkward place: implement reassembly (possibly poorly) or use a stateless method as described above.
Just out of curiosity - what stops BPF progs using the middle ground of stateful validation? I'm thinking of something like: First-frag: run the usual checks on L4 headers etc, if we PASS then save IPID and maybe expected next frag-offset into a map. But don't try to stash the packet contents anywhere for later reassembly, just PASS it. Subsequent frags: look up the IPID in the map. If we find it, validate and update the frag-offset in the map; if this is the last fragment then delete the map entry. If the frag-offset was bogus or the IPID wasn't found in the map, DROP; otherwise PASS. (If re-ordering is prevalent then use something more sophisticated than just expected next frag-offset, but the principle is the same. And of course you might want to put in timers for expiry etc.) So this avoids the need to stash the packet data and modify/consume SKBs, because you're not actually doing reassembly; the down-side is that the BPF program can't so easily make decisions about the application-layer contents of the fragmented datagram, but for the common case (we just care about the 5-tuple) it's simple enough. But I haven't actually tried it, so maybe there's some obvious reason why it can't work this way.
-ed
Hi Ed,
Thanks for giving this a look.
On Mon, Feb 27, 2023 at 08:38:41PM +0000, Edward Cree wrote:
On 27/02/2023 19:51, Daniel Xu wrote:
However, when policy is enforced through BPF, the prog is run before the kernel reassembles fragmented packets. This leaves BPF developers in a awkward place: implement reassembly (possibly poorly) or use a stateless method as described above.
Just out of curiosity - what stops BPF progs using the middle ground of stateful validation? I'm thinking of something like: First-frag: run the usual checks on L4 headers etc, if we PASS then save IPID and maybe expected next frag-offset into a map. But don't try to stash the packet contents anywhere for later reassembly, just PASS it. Subsequent frags: look up the IPID in the map. If we find it, validate and update the frag-offset in the map; if this is the last fragment then delete the map entry. If the frag-offset was bogus or the IPID wasn't found in the map, DROP; otherwise PASS. (If re-ordering is prevalent then use something more sophisticated than just expected next frag-offset, but the principle is the same. And of course you might want to put in timers for expiry etc.) So this avoids the need to stash the packet data and modify/consume SKBs, because you're not actually doing reassembly; the down-side is that the BPF program can't so easily make decisions about the application-layer contents of the fragmented datagram, but for the common case (we just care about the 5-tuple) it's simple enough. But I haven't actually tried it, so maybe there's some obvious reason why it can't work this way.
I don't believe full L4 headers are required in the first fragment. Sufficiently sneaky attackers can, I think, send a byte at a time to subvert your proposed algorithm. Storing skb data seems inevitable here. Someone can correct me if I'm wrong here.
Reordering like you mentioned is another attack vector. Perhaps there are more sophisticated semi-stateful algorithms that can solve the problem, but it leads me to my next point.
A semi-stateful method like you are proposing is concerning to me from a reliability and correctness stand point. Such a method can suffer from impedance mismatches with the rest of the system. For example, whatever map sizes you choose should probably be aligned with sysfs conntrack values otherwise you may get some very interesting and unexpected pkt drops. I think cilium had a talk about debugging a related conntrack issue in the same vein a while ago. Furthermore, the debugging and troubleshooting facilities will be different (counters, logs, etc).
Unless someone has had lots of experience writing an ip stack from the ground up, I suspect there are quite a few more unknown-unknowns here. What I find valuable about this patch series is that we can leverage the well understood and battle hardened kernel facilities. So avoid all the correctness and security issues that the kernel has spent 20+ years fixing. And make it trivial for the next person that comes along to do the right thing.
Hopefully this all makes sense.
Thanks, Daniel
On 27/02/2023 22:04, Daniel Xu wrote:
I don't believe full L4 headers are required in the first fragment. Sufficiently sneaky attackers can, I think, send a byte at a time to subvert your proposed algorithm. Storing skb data seems inevitable here. Someone can correct me if I'm wrong here.
My thinking was that legitimate traffic would never do this and thus if your first fragment doesn't have enough data to make a determination then you just DROP the packet.
What I find valuable about this patch series is that we can leverage the well understood and battle hardened kernel facilities. So avoid all the correctness and security issues that the kernel has spent 20+ years fixing.
I can certainly see the argument here. I guess it's a question of are you more worried about the DoS from tricking the validator into thinking good fragments are bad (the reverse is irrelevant because if you can trick a validator into thinking your bad fragment belongs to a previously seen good packet, then you can equally trick a reassembler into stitching your bad fragment into that packet), or are you more worried about the DoS from tying lots of memory down in the reassembly cache. Even with reordering handling, a data structure to record which ranges of a packet have been seen takes much less memory than storing the complete fragment bodies. (Just a simple bitmap of 8-byte blocks — the resolution of iph->frag_off — reduces size by a factor of 64, not counting all the overhead of a struct sk_buff for each fragment in the queue. Or you could re-use the rbtree-based code from the reassembler, just with a freshly allocated node containing only offset & length, instead of the whole SKB.) And having a BPF helper effectively consume the skb is awkward, as you noted; someone is likely to decide that skb_copy() is too slow, try to add ctx invalidation, and thereby create a whole new swathe of potential correctness and security issues. Plus, imagine trying to support this in a hardware-offload XDP device. They'd have to reimplement the entire frag cache, which is a much bigger attack surface than just a frag validator, and they couldn't leverage the battle-hardened kernel implementation.
And make it trivial for the next person that comes along to do the right thing.
Fwiw the validator approach could *also* be a helper, it doesn't have to be something the BPF developer writes for themselves.
But if after thinking about the possibility you still prefer your way, I won't try to stop you — I just wanted to ensure it had been considered.
-ed
Hi Ed,
Had some trouble with email yesterday (forgot to renew domain registration) and this reply might not have made it out. Apologies if it's a repost.
On Mon, Feb 27, 2023 at 10:58:47PM +0000, Edward Cree wrote:
On 27/02/2023 22:04, Daniel Xu wrote:
I don't believe full L4 headers are required in the first fragment. Sufficiently sneaky attackers can, I think, send a byte at a time to subvert your proposed algorithm. Storing skb data seems inevitable here. Someone can correct me if I'm wrong here.
My thinking was that legitimate traffic would never do this and thus if your first fragment doesn't have enough data to make a determination then you just DROP the packet.
Right, that would be practical. I had some discussion with coworkers and the other option on the table is to drop all fragments. At least for us in the cloud, fragments are heavily frowned upon (where are they not..) anyways.
What I find valuable about this patch series is that we can leverage the well understood and battle hardened kernel facilities. So avoid all the correctness and security issues that the kernel has spent 20+ years fixing.
I can certainly see the argument here. I guess it's a question of are you more worried about the DoS from tricking the validator into thinking good fragments are bad (the reverse is irrelevant because if you can trick a validator into thinking your bad fragment belongs to a previously seen good packet, then you can equally trick a reassembler into stitching your bad fragment into that packet), or are you more worried about the DoS from tying lots of memory down in the reassembly cache.
Equal balance of concerns on my side. Ideally there are no dropping of valid packets and DoS is very hard to achieve.
Even with reordering handling, a data structure to record which ranges of a packet have been seen takes much less memory than storing the complete fragment bodies. (Just a simple bitmap of 8-byte blocks — the resolution of iph->frag_off — reduces size by a factor of 64, not counting all the overhead of a struct sk_buff for each fragment in the queue. Or you could re-use the rbtree-based code from the reassembler, just with a freshly allocated node containing only offset & length, instead of the whole SKB.)
Yeah, now that you say that, it doesn't sound too bad on space side. But I do wonder -- how much code and complexity is that going to be? For example I think ipv6 frags have a 60s reassembly timeout which adds more stuff to consider. And probably even more I've already forgotten.
B/c at least on the kernel side, this series is 80% code for tests. And the kfunc wrappers are not very invasive at all. Plus it's wrapping infra that hasn't changed much for decades.
And having a BPF helper effectively consume the skb is awkward, as you noted; someone is likely to decide that skb_copy() is too slow, try to add ctx invalidation, and thereby create a whole new swathe of potential correctness and security issues.
Yep. I did try that. While the verifier bits weren't too tricky, there are a lot of infra concerns to solve:
* https://github.com/danobi/linux/commit/35a66af8d54cca647b0adfc7c1da7105d2603... * https://github.com/danobi/linux/commit/e8c86ea75e2ca8f0631632d54ef7633813087... * https://github.com/danobi/linux/commit/972bcf769f41fbfa7f84ce00faf06b5b57bc6...
But FWIW, fragmented packets are kinda a corner case anyways. I don't think it would be resonable to expect high perf when packets are in play.
Plus, imagine trying to support this in a hardware-offload XDP device. They'd have to reimplement the entire frag cache, which is a much bigger attack surface than just a frag validator, and they couldn't leverage the battle-hardened kernel implementation.
Hmm, well this helper is restricted to TC progs for now. I don't quite see a path to enabling for XDP as there would have to be at a minimum quite a few allocations to handle frags. So not sure XDP is a factor at the moment.
And make it trivial for the next person that comes along to do the right thing.
Fwiw the validator approach could *also* be a helper, it doesn't have to be something the BPF developer writes for themselves.
But if after thinking about the possibility you still prefer your way, I won't try to stop you — I just wanted to ensure it had been considered.
Thank you for the discussion. The thought had come to mind originally, but I shied away after seeing some of the reassembly details. Would be interested in hearing more from other folks.
Thanks, Daniel
On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice:
Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration.
Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal.
=== BPF related bits ===
However, when policy is enforced through BPF, the prog is run before the kernel reassembles fragmented packets. This leaves BPF developers in a awkward place: implement reassembly (possibly poorly) or use a stateless method as described above.
Fortunately, the kernel has robust support for fragmented IP packets. This patchset wraps the existing defragmentation facilities in kfuncs so that BPF progs running on middleboxes can reassemble fragmented packets before applying policy.
=== Patchset details ===
This patchset is (hopefully) relatively straightforward from BPF perspective. One thing I'd like to call out is the skb_copy()ing of the prog skb. I did this to maintain the invariant that the ctx remains valid after prog has run. This is relevant b/c ip_defrag() and ip_check_defrag() may consume the skb if the skb is a fragment.
Instead of doing all that with extra skb copy can you hook bpf prog after the networking stack already handled ip defrag? What kind of middle box are you doing? Why does it have to run at TC layer?
linux-kselftest-mirror@lists.linaro.org