TCP SYN/ACK packets of connections from processes/sockets outside a cgroup on the same host are not received by the cgroup's installed cgroup_skb filters.
There were two BPF cgroup_skb programs attached to a cgroup named "my_cgroup".
SEC("cgroup_skb/ingress") int ingress(struct __sk_buff *skb) { /* .... process skb ... */ return 1; }
SEC("cgroup_skb/egress") int egress(struct __sk_buff *skb) { /* .... process skb ... */ return 1;
}
We discovered that when running the command "nc -6 -l 8000" in "my_group" and connecting to it from outside of "my_cgroup" with the command "nc -6 localhost 8000", the egress filter did not detect the SYN/ACK packet. However, we did observe the SYN/ACK packet at the ingress when connecting from a socket in "my_cgroup" to a socket outside of it.
We came across BPF_CGROUP_RUN_PROG_INET_EGRESS(). This macro is responsible for calling BPF programs that are attached to the egress hook of a cgroup and it skips programs if the sending socket is not the owner of the skb. Specifically, in our situation, the SYN/ACK skb is owned by a struct request_sock instance, but the sending socket is the listener socket we use to receive incoming connections. The request_sock is created to manage an incoming connection.
It has been determined that checking the owner of a skb against the sending socket is not required. Removing this check will allow the filters to receive SYN/ACK packets.
To ensure that cgroup_skb filters can receive all signaling packets, including SYN, SYN/ACK, ACK, FIN, and FIN/ACK. A new self-test has been added as well.
Changes from v3:
- Check SKB ownership against full socket instead of just remove the check.
- Address the issue raised by Yonghong.
- Put more details down in the commit message.
Changes from v2:
- Remove redundant blank lines.
Changes from v1:
- Check the number of observed packets instead of just sleeping.
- Use ASSERT_XXX() instead of CHECK()/
[v1] https://lore.kernel.org/all/20230612191641.441774-1-kuifeng@meta.com/ [v2] https://lore.kernel.org/all/20230617052756.640916-2-kuifeng@meta.com/ [v3] https://lore.kernel.org/all/20230620171409.166001-1-kuifeng@meta.com/
Kui-Feng Lee (2): net: bpf: Check SKB ownership against full socket. selftests/bpf: Verify that the cgroup_skb filters receive expected packets.
include/linux/bpf-cgroup.h | 4 +- tools/testing/selftests/bpf/cgroup_helpers.c | 12 + tools/testing/selftests/bpf/cgroup_helpers.h | 1 + tools/testing/selftests/bpf/cgroup_tcp_skb.h | 35 ++ .../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 402 ++++++++++++++++++ .../selftests/bpf/progs/cgroup_tcp_skb.c | 382 +++++++++++++++++ 6 files changed, 834 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
Check SKB ownership of an SKB against full sockets instead of request_sock.
The filters were called only if an SKB is owned by the sock that the SKB is sent out through. In another words, skb->sk should point to the sock that it is sending through its egress. However, the filters would miss SYN/ACK SKBs that they are owned by a request_sock but sent through the listener sock, that is the socket listening incoming connections.
However, the listener socket is also the full socket of the request socket. We should use the full socket as the owner socket of an SKB instead.
What is the ownership check for? ====================================
BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the ownership of an SKB. Alexei referred to a mailing list conversation that took place a few years ago. [1] In that conversation, Daniel Borkmann stated that:
Wouldn't that mean however, when you go through stacked devices that you'd run the same eBPF cgroup program for skb->sk multiple times?
According to what Daniel said, the ownership check mentioned earlier presumably prevents multiple calls of egress filters caused by an skb.
A test that reproduce this scenario shows that the BPF cgroup egress programs can be called multiple times for one SKB if this ownership check is not there. So, we can not just remove this check.
Test Stacked Devices =======================
We use L2TP to build an environment of stacked devices. L2TP (Layer 2 Tunneling Protocol) is a tunneling protocol used to support virtual private networks (VPNs). It relays encapsulated packets; for example in UDP, to its peer by using a socket.
Using L2TP, packets are first sent through the IP stack and should then arrive at an L2TP device. The device will expand its SKB header to encapsulate the packet. The SKB will be sent back to the IP stack using the socket that was made for the L2TP session. After that, the routing process will occur once more, but this time for a new destination.
We changed tools/testing/selftests/net/l2tp.sh to set up a test environment using L2TP. The run_ping() function in l2tp.sh is where the main change occurred.
run_ping() { local desc="$1"
sleep 10 run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1 fc00:101::2 log_test $? 0 "IPv6 route through L2TP tunnel ${desc}" sleep 10
}
The test will use L2TP devices to send PING messages. These messages will have a message size of 227 bytes as a special label to distinguish them. This is not an ideal solution, but works.
During the execution of the test script, bpftrace was attached to ip6_finish_output() and l2tp_xmit_skb(). BPF
bpftrace -e ' kfunc:ip6_finish_output { time("%H:%M:%S: "); printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p skb->sk=%p\n", args->skb, args->skb->len, args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); } kfunc:l2tp_xmit_skb { time("%H:%M:%S: "); printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb, args->session->tunnel->sock); }'
The following is part of the output messages printed by bpftrace.
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275 cgroup=0xffff88810741f800 sk=0xffff888105f3b900 skb->sk=0xffff888105f3b900
16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337 cgroup=0xffff88810741f800 sk=0xffff888103dd6300 skb->sk=0xffff888105f3b900
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337 cgroup=(nil) sk=(nil) skb->sk=(nil)
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275 cgroup=0xffffffff837741d0 sk=0xffff888101fe0000 skb->sk=0xffff888101fe0000
16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337 cgroup=0xffff88810741f800 sk=0xffff888103483180 skb->sk=0xffff888101fe0000
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337 cgroup=(nil) sk=(nil) skb->sk=(nil)
The first four entries describe a PING message that was sent using the ping command, whereas the following four entries describe the response received. Multiple sockets are used to send one skb, including the socket used by the L2TP session. This can be observed.
Based on this information, it seems that the ownership check is designed to avoid multiple calls of egress filters caused by a single skb.
[1] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/,
Signed-off-by: Kui-Feng Lee kuifeng@meta.com --- include/linux/bpf-cgroup.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 57e9e109257e..8506690dbb9c 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -199,9 +199,9 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk, #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \ + if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) { \ typeof(sk) __sk = sk_to_full_sk(sk); \ - if (sk_fullsock(__sk) && \ + if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) && \ cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS)) \ __ret = __cgroup_bpf_run_filter_skb(__sk, skb, \ CGROUP_INET_EGRESS); \
On 6/24/23 3:45 AM, Kui-Feng Lee wrote:
Check SKB ownership of an SKB against full sockets instead of request_sock.
The filters were called only if an SKB is owned by the sock that the SKB is sent out through. In another words, skb->sk should point to the sock that it is sending through its egress. However, the filters would miss SYN/ACK SKBs that they are owned by a request_sock but sent through the listener sock, that is the socket listening incoming connections.
However, the listener socket is also the full socket of the request socket. We should use the full socket as the owner socket of an SKB instead.
What is the ownership check for?
BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the ownership of an SKB. Alexei referred to a mailing list conversation that took place a few years ago. [1] In that conversation, Daniel Borkmann stated that:
Wouldn't that mean however, when you go through stacked devices that you'd run the same eBPF cgroup program for skb->sk multiple times?
According to what Daniel said, the ownership check mentioned earlier presumably prevents multiple calls of egress filters caused by an skb.
A test that reproduce this scenario shows that the BPF cgroup egress programs can be called multiple times for one SKB if this ownership check is not there. So, we can not just remove this check.
Test Stacked Devices
We use L2TP to build an environment of stacked devices. L2TP (Layer 2 Tunneling Protocol) is a tunneling protocol used to support virtual private networks (VPNs). It relays encapsulated packets; for example in UDP, to its peer by using a socket.
Using L2TP, packets are first sent through the IP stack and should then arrive at an L2TP device. The device will expand its SKB header to encapsulate the packet. The SKB will be sent back to the IP stack using the socket that was made for the L2TP session. After that, the routing process will occur once more, but this time for a new destination.
We changed tools/testing/selftests/net/l2tp.sh to set up a test environment using L2TP. The run_ping() function in l2tp.sh is where the main change occurred.
run_ping() { local desc="$1" sleep 10 run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1 fc00:101::2 log_test $? 0 "IPv6 route through L2TP tunnel ${desc}" sleep 10 }
The test will use L2TP devices to send PING messages. These messages will have a message size of 227 bytes as a special label to distinguish them. This is not an ideal solution, but works.
During the execution of the test script, bpftrace was attached to ip6_finish_output() and l2tp_xmit_skb(). BPF
bpftrace -e ' kfunc:ip6_finish_output { time("%H:%M:%S: "); printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p skb->sk=%p\n", args->skb, args->skb->len, args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); } kfunc:l2tp_xmit_skb { time("%H:%M:%S: "); printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb, args->session->tunnel->sock); }'
The following is part of the output messages printed by bpftrace.
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275 cgroup=0xffff88810741f800 sk=0xffff888105f3b900 skb->sk=0xffff888105f3b900 16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300 16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337 cgroup=0xffff88810741f800 sk=0xffff888103dd6300 skb->sk=0xffff888105f3b900 16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337 cgroup=(nil) sk=(nil) skb->sk=(nil) 16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275 cgroup=0xffffffff837741d0 sk=0xffff888101fe0000 skb->sk=0xffff888101fe0000 16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180 16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337 cgroup=0xffff88810741f800 sk=0xffff888103483180 skb->sk=0xffff888101fe0000 16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337 cgroup=(nil) sk=(nil) skb->sk=(nil)
The first four entries describe a PING message that was sent using the ping command, whereas the following four entries describe the response received. Multiple sockets are used to send one skb, including the socket used by the L2TP session. This can be observed.
Based on this information, it seems that the ownership check is designed to avoid multiple calls of egress filters caused by a single skb.
Thanks for this investigation and adding these details to the commit log, this is very useful to keep for the archive. I did some minor formatting and pushed it out to bpf-next.
This test case includes four scenarios: 1. Connect to the server from outside the cgroup and close the connection from outside the cgroup. 2. Connect to the server from outside the cgroup and close the connection from inside the cgroup. 3. Connect to the server from inside the cgroup and close the connection from outside the cgroup. 4. Connect to the server from inside the cgroup and close the connection from inside the cgroup.
The test case is to verify that cgroup_skb/{egress, ingress} filters receive expected packets including SYN, SYN/ACK, ACK, FIN, and FIN/ACK.
Signed-off-by: Kui-Feng Lee kuifeng@meta.com --- tools/testing/selftests/bpf/cgroup_helpers.c | 12 + tools/testing/selftests/bpf/cgroup_helpers.h | 1 + tools/testing/selftests/bpf/cgroup_tcp_skb.h | 35 ++ .../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 402 ++++++++++++++++++ .../selftests/bpf/progs/cgroup_tcp_skb.c | 382 +++++++++++++++++ 5 files changed, 832 insertions(+) create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c index 9e95b37a7dff..2caee8423ee0 100644 --- a/tools/testing/selftests/bpf/cgroup_helpers.c +++ b/tools/testing/selftests/bpf/cgroup_helpers.c @@ -277,6 +277,18 @@ int join_cgroup(const char *relative_path) return join_cgroup_from_top(cgroup_path); }
+/** + * join_root_cgroup() - Join the root cgroup + * + * This function joins the root cgroup. + * + * On success, it returns 0, otherwise on failure it returns 1. + */ +int join_root_cgroup(void) +{ + return join_cgroup_from_top(CGROUP_MOUNT_PATH); +} + /** * join_parent_cgroup() - Join a cgroup in the parent process workdir * @relative_path: The cgroup path, relative to parent process workdir, to join diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h index f099a166c94d..5c2cb9c8b546 100644 --- a/tools/testing/selftests/bpf/cgroup_helpers.h +++ b/tools/testing/selftests/bpf/cgroup_helpers.h @@ -22,6 +22,7 @@ void remove_cgroup(const char *relative_path); unsigned long long get_cgroup_id(const char *relative_path);
int join_cgroup(const char *relative_path); +int join_root_cgroup(void); int join_parent_cgroup(const char *relative_path);
int setup_cgroup_environment(void); diff --git a/tools/testing/selftests/bpf/cgroup_tcp_skb.h b/tools/testing/selftests/bpf/cgroup_tcp_skb.h new file mode 100644 index 000000000000..7f6b24f102fb --- /dev/null +++ b/tools/testing/selftests/bpf/cgroup_tcp_skb.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */ + +/* Define states of a socket to tracking messages sending to and from the + * socket. + * + * These states are based on rfc9293 with some modifications to support + * tracking of messages sent out from a socket. For example, when a SYN is + * received, a new socket is transiting to the SYN_RECV state defined in + * rfc9293. But, we put it in SYN_RECV_SENDING_SYN_ACK state and when + * SYN-ACK is sent out, it moves to SYN_RECV state. With this modification, + * we can track the message sent out from a socket. + */ + +#ifndef __CGROUP_TCP_SKB_H__ +#define __CGROUP_TCP_SKB_H__ + +enum { + INIT, + CLOSED, + SYN_SENT, + SYN_RECV_SENDING_SYN_ACK, + SYN_RECV, + ESTABLISHED, + FIN_WAIT1, + FIN_WAIT2, + CLOSE_WAIT_SENDING_ACK, + CLOSE_WAIT, + CLOSING, + LAST_ACK, + TIME_WAIT_SENDING_ACK, + TIME_WAIT, +}; + +#endif /* __CGROUP_TCP_SKB_H__ */ diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c b/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c new file mode 100644 index 000000000000..95bab61a1e57 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c @@ -0,0 +1,402 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2023 Facebook */ +#include <test_progs.h> +#include <linux/in6.h> +#include <sys/socket.h> +#include <sched.h> +#include <unistd.h> +#include "cgroup_helpers.h" +#include "testing_helpers.h" +#include "cgroup_tcp_skb.skel.h" +#include "cgroup_tcp_skb.h" + +#define CGROUP_TCP_SKB_PATH "/test_cgroup_tcp_skb" + +static int install_filters(int cgroup_fd, + struct bpf_link **egress_link, + struct bpf_link **ingress_link, + struct bpf_program *egress_prog, + struct bpf_program *ingress_prog, + struct cgroup_tcp_skb *skel) +{ + /* Prepare filters */ + skel->bss->g_sock_state = 0; + skel->bss->g_unexpected = 0; + *egress_link = + bpf_program__attach_cgroup(egress_prog, + cgroup_fd); + if (!ASSERT_OK_PTR(egress_link, "egress_link")) + return -1; + *ingress_link = + bpf_program__attach_cgroup(ingress_prog, + cgroup_fd); + if (!ASSERT_OK_PTR(ingress_link, "ingress_link")) + return -1; + + return 0; +} + +static void uninstall_filters(struct bpf_link **egress_link, + struct bpf_link **ingress_link) +{ + bpf_link__destroy(*egress_link); + *egress_link = NULL; + bpf_link__destroy(*ingress_link); + *ingress_link = NULL; +} + +static int create_client_sock_v6(void) +{ + int fd; + + fd = socket(AF_INET6, SOCK_STREAM, 0); + if (fd < 0) { + perror("socket"); + return -1; + } + + return fd; +} + +static int create_server_sock_v6(void) +{ + struct sockaddr_in6 addr = { + .sin6_family = AF_INET6, + .sin6_port = htons(0), + .sin6_addr = IN6ADDR_LOOPBACK_INIT, + }; + int fd, err; + + fd = socket(AF_INET6, SOCK_STREAM, 0); + if (fd < 0) { + perror("socket"); + return -1; + } + + err = bind(fd, (struct sockaddr *)&addr, sizeof(addr)); + if (err < 0) { + perror("bind"); + return -1; + } + + err = listen(fd, 1); + if (err < 0) { + perror("listen"); + return -1; + } + + return fd; +} + +static int get_sock_port_v6(int fd) +{ + struct sockaddr_in6 addr; + socklen_t len; + int err; + + len = sizeof(addr); + err = getsockname(fd, (struct sockaddr *)&addr, &len); + if (err < 0) { + perror("getsockname"); + return -1; + } + + return ntohs(addr.sin6_port); +} + +static int connect_client_server_v6(int client_fd, int listen_fd) +{ + struct sockaddr_in6 addr = { + .sin6_family = AF_INET6, + .sin6_addr = IN6ADDR_LOOPBACK_INIT, + }; + int err; + + addr.sin6_port = htons(get_sock_port_v6(listen_fd)); + if (addr.sin6_port < 0) + return -1; + + err = connect(client_fd, (struct sockaddr *)&addr, sizeof(addr)); + if (err < 0) { + perror("connect"); + return -1; + } + + return 0; +} + +/* Connect to the server in a cgroup from the outside of the cgroup. */ +static int talk_to_cgroup(int *client_fd, int *listen_fd, int *service_fd, + struct cgroup_tcp_skb *skel) +{ + int err, cp; + char buf[5]; + + /* Create client & server socket */ + err = join_root_cgroup(); + if (!ASSERT_OK(err, "join_root_cgroup")) + return -1; + *client_fd = create_client_sock_v6(); + if (!ASSERT_GE(*client_fd, 0, "client_fd")) + return -1; + err = join_cgroup(CGROUP_TCP_SKB_PATH); + if (!ASSERT_OK(err, "join_cgroup")) + return -1; + *listen_fd = create_server_sock_v6(); + if (!ASSERT_GE(*listen_fd, 0, "listen_fd")) + return -1; + skel->bss->g_sock_port = get_sock_port_v6(*listen_fd); + + /* Connect client to server */ + err = connect_client_server_v6(*client_fd, *listen_fd); + if (!ASSERT_OK(err, "connect_client_server_v6")) + return -1; + *service_fd = accept(*listen_fd, NULL, NULL); + if (!ASSERT_GE(*service_fd, 0, "service_fd")) + return -1; + err = join_root_cgroup(); + if (!ASSERT_OK(err, "join_root_cgroup")) + return -1; + cp = write(*client_fd, "hello", 5); + if (!ASSERT_EQ(cp, 5, "write")) + return -1; + cp = read(*service_fd, buf, 5); + if (!ASSERT_EQ(cp, 5, "read")) + return -1; + + return 0; +} + +/* Connect to the server out of a cgroup from inside the cgroup. */ +static int talk_to_outside(int *client_fd, int *listen_fd, int *service_fd, + struct cgroup_tcp_skb *skel) + +{ + int err, cp; + char buf[5]; + + /* Create client & server socket */ + err = join_root_cgroup(); + if (!ASSERT_OK(err, "join_root_cgroup")) + return -1; + *listen_fd = create_server_sock_v6(); + if (!ASSERT_GE(*listen_fd, 0, "listen_fd")) + return -1; + err = join_cgroup(CGROUP_TCP_SKB_PATH); + if (!ASSERT_OK(err, "join_cgroup")) + return -1; + *client_fd = create_client_sock_v6(); + if (!ASSERT_GE(*client_fd, 0, "client_fd")) + return -1; + err = join_root_cgroup(); + if (!ASSERT_OK(err, "join_root_cgroup")) + return -1; + skel->bss->g_sock_port = get_sock_port_v6(*listen_fd); + + /* Connect client to server */ + err = connect_client_server_v6(*client_fd, *listen_fd); + if (!ASSERT_OK(err, "connect_client_server_v6")) + return -1; + *service_fd = accept(*listen_fd, NULL, NULL); + if (!ASSERT_GE(*service_fd, 0, "service_fd")) + return -1; + cp = write(*client_fd, "hello", 5); + if (!ASSERT_EQ(cp, 5, "write")) + return -1; + cp = read(*service_fd, buf, 5); + if (!ASSERT_EQ(cp, 5, "read")) + return -1; + + return 0; +} + +static int close_connection(int *closing_fd, int *peer_fd, int *listen_fd, + struct cgroup_tcp_skb *skel) +{ + __u32 saved_packet_count = 0; + int err; + int i; + + /* Wait for ACKs to be sent */ + saved_packet_count = skel->bss->g_packet_count; + usleep(100000); /* 0.1s */ + for (i = 0; + skel->bss->g_packet_count != saved_packet_count && i < 10; + i++) { + saved_packet_count = skel->bss->g_packet_count; + usleep(100000); /* 0.1s */ + } + if (!ASSERT_EQ(skel->bss->g_packet_count, saved_packet_count, + "packet_count")) + return -1; + + skel->bss->g_packet_count = 0; + saved_packet_count = 0; + + /* Half shutdown to make sure the closing socket having a chance to + * receive a FIN from the peer. + */ + err = shutdown(*closing_fd, SHUT_WR); + if (!ASSERT_OK(err, "shutdown closing_fd")) + return -1; + + /* Wait for FIN and the ACK of the FIN to be observed */ + for (i = 0; + skel->bss->g_packet_count < saved_packet_count + 2 && i < 10; + i++) + usleep(100000); /* 0.1s */ + if (!ASSERT_GE(skel->bss->g_packet_count, saved_packet_count + 2, + "packet_count")) + return -1; + + saved_packet_count = skel->bss->g_packet_count; + + /* Fully shutdown the connection */ + err = close(*peer_fd); + if (!ASSERT_OK(err, "close peer_fd")) + return -1; + *peer_fd = -1; + + /* Wait for FIN and the ACK of the FIN to be observed */ + for (i = 0; + skel->bss->g_packet_count < saved_packet_count + 2 && i < 10; + i++) + usleep(100000); /* 0.1s */ + if (!ASSERT_GE(skel->bss->g_packet_count, saved_packet_count + 2, + "packet_count")) + return -1; + + err = close(*closing_fd); + if (!ASSERT_OK(err, "close closing_fd")) + return -1; + *closing_fd = -1; + + close(*listen_fd); + *listen_fd = -1; + + return 0; +} + +/* This test case includes four scenarios: + * 1. Connect to the server from outside the cgroup and close the connection + * from outside the cgroup. + * 2. Connect to the server from outside the cgroup and close the connection + * from inside the cgroup. + * 3. Connect to the server from inside the cgroup and close the connection + * from outside the cgroup. + * 4. Connect to the server from inside the cgroup and close the connection + * from inside the cgroup. + * + * The test case is to verify that cgroup_skb/{egress,ingress} filters + * receive expected packets including SYN, SYN/ACK, ACK, FIN, and FIN/ACK. + */ +void test_cgroup_tcp_skb(void) +{ + struct bpf_link *ingress_link = NULL; + struct bpf_link *egress_link = NULL; + int client_fd = -1, listen_fd = -1; + struct cgroup_tcp_skb *skel; + int service_fd = -1; + int cgroup_fd = -1; + int err; + + skel = cgroup_tcp_skb__open_and_load(); + if (!ASSERT_OK(!skel, "skel_open_load")) + return; + + err = setup_cgroup_environment(); + if (!ASSERT_OK(err, "setup_cgroup_environment")) + goto cleanup; + + cgroup_fd = create_and_get_cgroup(CGROUP_TCP_SKB_PATH); + if (!ASSERT_GE(cgroup_fd, 0, "cgroup_fd")) + goto cleanup; + + /* Scenario 1 */ + err = install_filters(cgroup_fd, &egress_link, &ingress_link, + skel->progs.server_egress, + skel->progs.server_ingress, + skel); + if (!ASSERT_OK(err, "install_filters")) + goto cleanup; + + err = talk_to_cgroup(&client_fd, &listen_fd, &service_fd, skel); + if (!ASSERT_OK(err, "talk_to_cgroup")) + goto cleanup; + + err = close_connection(&client_fd, &service_fd, &listen_fd, skel); + if (!ASSERT_OK(err, "close_connection")) + goto cleanup; + + ASSERT_EQ(skel->bss->g_unexpected, 0, "g_unexpected"); + ASSERT_EQ(skel->bss->g_sock_state, CLOSED, "g_sock_state"); + + uninstall_filters(&egress_link, &ingress_link); + + /* Scenario 2 */ + err = install_filters(cgroup_fd, &egress_link, &ingress_link, + skel->progs.server_egress_srv, + skel->progs.server_ingress_srv, + skel); + + err = talk_to_cgroup(&client_fd, &listen_fd, &service_fd, skel); + if (!ASSERT_OK(err, "talk_to_cgroup")) + goto cleanup; + + err = close_connection(&service_fd, &client_fd, &listen_fd, skel); + if (!ASSERT_OK(err, "close_connection")) + goto cleanup; + + ASSERT_EQ(skel->bss->g_unexpected, 0, "g_unexpected"); + ASSERT_EQ(skel->bss->g_sock_state, TIME_WAIT, "g_sock_state"); + + uninstall_filters(&egress_link, &ingress_link); + + /* Scenario 3 */ + err = install_filters(cgroup_fd, &egress_link, &ingress_link, + skel->progs.client_egress_srv, + skel->progs.client_ingress_srv, + skel); + + err = talk_to_outside(&client_fd, &listen_fd, &service_fd, skel); + if (!ASSERT_OK(err, "talk_to_outside")) + goto cleanup; + + err = close_connection(&service_fd, &client_fd, &listen_fd, skel); + if (!ASSERT_OK(err, "close_connection")) + goto cleanup; + + ASSERT_EQ(skel->bss->g_unexpected, 0, "g_unexpected"); + ASSERT_EQ(skel->bss->g_sock_state, CLOSED, "g_sock_state"); + + uninstall_filters(&egress_link, &ingress_link); + + /* Scenario 4 */ + err = install_filters(cgroup_fd, &egress_link, &ingress_link, + skel->progs.client_egress, + skel->progs.client_ingress, + skel); + + err = talk_to_outside(&client_fd, &listen_fd, &service_fd, skel); + if (!ASSERT_OK(err, "talk_to_outside")) + goto cleanup; + + err = close_connection(&client_fd, &service_fd, &listen_fd, skel); + if (!ASSERT_OK(err, "close_connection")) + goto cleanup; + + ASSERT_EQ(skel->bss->g_unexpected, 0, "g_unexpected"); + ASSERT_EQ(skel->bss->g_sock_state, TIME_WAIT, "g_sock_state"); + + uninstall_filters(&egress_link, &ingress_link); + +cleanup: + close(client_fd); + close(listen_fd); + close(service_fd); + close(cgroup_fd); + bpf_link__destroy(egress_link); + bpf_link__destroy(ingress_link); + cleanup_cgroup_environment(); + cgroup_tcp_skb__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c b/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c new file mode 100644 index 000000000000..1e2e73f3b749 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c @@ -0,0 +1,382 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */ +#include <linux/bpf.h> +#include <bpf/bpf_endian.h> +#include <bpf/bpf_helpers.h> + +#include <linux/if_ether.h> +#include <linux/in.h> +#include <linux/in6.h> +#include <linux/ipv6.h> +#include <linux/tcp.h> + +#include <sys/types.h> +#include <sys/socket.h> + +#include "cgroup_tcp_skb.h" + +char _license[] SEC("license") = "GPL"; + +__u16 g_sock_port = 0; +__u32 g_sock_state = 0; +int g_unexpected = 0; +__u32 g_packet_count = 0; + +int needed_tcp_pkt(struct __sk_buff *skb, struct tcphdr *tcph) +{ + struct ipv6hdr ip6h; + + if (skb->protocol != bpf_htons(ETH_P_IPV6)) + return 0; + if (bpf_skb_load_bytes(skb, 0, &ip6h, sizeof(ip6h))) + return 0; + + if (ip6h.nexthdr != IPPROTO_TCP) + return 0; + + if (bpf_skb_load_bytes(skb, sizeof(ip6h), tcph, sizeof(*tcph))) + return 0; + + if (tcph->source != bpf_htons(g_sock_port) && + tcph->dest != bpf_htons(g_sock_port)) + return 0; + + return 1; +} + +/* Run accept() on a socket in the cgroup to receive a new connection. */ +static int egress_accept(struct tcphdr *tcph) +{ + if (g_sock_state == SYN_RECV_SENDING_SYN_ACK) { + if (tcph->fin || !tcph->syn || !tcph->ack) + g_unexpected++; + else + g_sock_state = SYN_RECV; + return 1; + } + + return 0; +} + +static int ingress_accept(struct tcphdr *tcph) +{ + switch (g_sock_state) { + case INIT: + if (!tcph->syn || tcph->fin || tcph->ack) + g_unexpected++; + else + g_sock_state = SYN_RECV_SENDING_SYN_ACK; + break; + case SYN_RECV: + if (tcph->fin || tcph->syn || !tcph->ack) + g_unexpected++; + else + g_sock_state = ESTABLISHED; + break; + default: + return 0; + } + + return 1; +} + +/* Run connect() on a socket in the cgroup to start a new connection. */ +static int egress_connect(struct tcphdr *tcph) +{ + if (g_sock_state == INIT) { + if (!tcph->syn || tcph->fin || tcph->ack) + g_unexpected++; + else + g_sock_state = SYN_SENT; + return 1; + } + + return 0; +} + +static int ingress_connect(struct tcphdr *tcph) +{ + if (g_sock_state == SYN_SENT) { + if (tcph->fin || !tcph->syn || !tcph->ack) + g_unexpected++; + else + g_sock_state = ESTABLISHED; + return 1; + } + + return 0; +} + +/* The connection is closed by the peer outside the cgroup. */ +static int egress_close_remote(struct tcphdr *tcph) +{ + switch (g_sock_state) { + case ESTABLISHED: + break; + case CLOSE_WAIT_SENDING_ACK: + if (tcph->fin || tcph->syn || !tcph->ack) + g_unexpected++; + else + g_sock_state = CLOSE_WAIT; + break; + case CLOSE_WAIT: + if (!tcph->fin) + g_unexpected++; + else + g_sock_state = LAST_ACK; + break; + default: + return 0; + } + + return 1; +} + +static int ingress_close_remote(struct tcphdr *tcph) +{ + switch (g_sock_state) { + case ESTABLISHED: + if (tcph->fin) + g_sock_state = CLOSE_WAIT_SENDING_ACK; + break; + case LAST_ACK: + if (tcph->fin || tcph->syn || !tcph->ack) + g_unexpected++; + else + g_sock_state = CLOSED; + break; + default: + return 0; + } + + return 1; +} + +/* The connection is closed by the endpoint inside the cgroup. */ +static int egress_close_local(struct tcphdr *tcph) +{ + switch (g_sock_state) { + case ESTABLISHED: + if (tcph->fin) + g_sock_state = FIN_WAIT1; + break; + case TIME_WAIT_SENDING_ACK: + if (tcph->fin || tcph->syn || !tcph->ack) + g_unexpected++; + else + g_sock_state = TIME_WAIT; + break; + default: + return 0; + } + + return 1; +} + +static int ingress_close_local(struct tcphdr *tcph) +{ + switch (g_sock_state) { + case ESTABLISHED: + break; + case FIN_WAIT1: + if (tcph->fin || tcph->syn || !tcph->ack) + g_unexpected++; + else + g_sock_state = FIN_WAIT2; + break; + case FIN_WAIT2: + if (!tcph->fin || tcph->syn || !tcph->ack) + g_unexpected++; + else + g_sock_state = TIME_WAIT_SENDING_ACK; + break; + default: + return 0; + } + + return 1; +} + +/* Check the types of outgoing packets of a server socket to make sure they + * are consistent with the state of the server socket. + * + * The connection is closed by the client side. + */ +SEC("cgroup_skb/egress") +int server_egress(struct __sk_buff *skb) +{ + struct tcphdr tcph; + + if (!needed_tcp_pkt(skb, &tcph)) + return 1; + + g_packet_count++; + + /* Egress of the server socket. */ + if (egress_accept(&tcph) || egress_close_remote(&tcph)) + return 1; + + g_unexpected++; + return 1; +} + +/* Check the types of incoming packets of a server socket to make sure they + * are consistent with the state of the server socket. + * + * The connection is closed by the client side. + */ +SEC("cgroup_skb/ingress") +int server_ingress(struct __sk_buff *skb) +{ + struct tcphdr tcph; + + if (!needed_tcp_pkt(skb, &tcph)) + return 1; + + g_packet_count++; + + /* Ingress of the server socket. */ + if (ingress_accept(&tcph) || ingress_close_remote(&tcph)) + return 1; + + g_unexpected++; + return 1; +} + +/* Check the types of outgoing packets of a server socket to make sure they + * are consistent with the state of the server socket. + * + * The connection is closed by the server side. + */ +SEC("cgroup_skb/egress") +int server_egress_srv(struct __sk_buff *skb) +{ + struct tcphdr tcph; + + if (!needed_tcp_pkt(skb, &tcph)) + return 1; + + g_packet_count++; + + /* Egress of the server socket. */ + if (egress_accept(&tcph) || egress_close_local(&tcph)) + return 1; + + g_unexpected++; + return 1; +} + +/* Check the types of incoming packets of a server socket to make sure they + * are consistent with the state of the server socket. + * + * The connection is closed by the server side. + */ +SEC("cgroup_skb/ingress") +int server_ingress_srv(struct __sk_buff *skb) +{ + struct tcphdr tcph; + + if (!needed_tcp_pkt(skb, &tcph)) + return 1; + + g_packet_count++; + + /* Ingress of the server socket. */ + if (ingress_accept(&tcph) || ingress_close_local(&tcph)) + return 1; + + g_unexpected++; + return 1; +} + +/* Check the types of outgoing packets of a client socket to make sure they + * are consistent with the state of the client socket. + * + * The connection is closed by the server side. + */ +SEC("cgroup_skb/egress") +int client_egress_srv(struct __sk_buff *skb) +{ + struct tcphdr tcph; + + if (!needed_tcp_pkt(skb, &tcph)) + return 1; + + g_packet_count++; + + /* Egress of the server socket. */ + if (egress_connect(&tcph) || egress_close_remote(&tcph)) + return 1; + + g_unexpected++; + return 1; +} + +/* Check the types of incoming packets of a client socket to make sure they + * are consistent with the state of the client socket. + * + * The connection is closed by the server side. + */ +SEC("cgroup_skb/ingress") +int client_ingress_srv(struct __sk_buff *skb) +{ + struct tcphdr tcph; + + if (!needed_tcp_pkt(skb, &tcph)) + return 1; + + g_packet_count++; + + /* Ingress of the server socket. */ + if (ingress_connect(&tcph) || ingress_close_remote(&tcph)) + return 1; + + g_unexpected++; + return 1; +} + +/* Check the types of outgoing packets of a client socket to make sure they + * are consistent with the state of the client socket. + * + * The connection is closed by the client side. + */ +SEC("cgroup_skb/egress") +int client_egress(struct __sk_buff *skb) +{ + struct tcphdr tcph; + + if (!needed_tcp_pkt(skb, &tcph)) + return 1; + + g_packet_count++; + + /* Egress of the server socket. */ + if (egress_connect(&tcph) || egress_close_local(&tcph)) + return 1; + + g_unexpected++; + return 1; +} + +/* Check the types of incoming packets of a client socket to make sure they + * are consistent with the state of the client socket. + * + * The connection is closed by the client side. + */ +SEC("cgroup_skb/ingress") +int client_ingress(struct __sk_buff *skb) +{ + struct tcphdr tcph; + + if (!needed_tcp_pkt(skb, &tcph)) + return 1; + + g_packet_count++; + + /* Ingress of the server socket. */ + if (ingress_connect(&tcph) || ingress_close_local(&tcph)) + return 1; + + g_unexpected++; + return 1; +}
Hello:
This series was applied to bpf/bpf-next.git (master) by Daniel Borkmann daniel@iogearbox.net:
On Fri, 23 Jun 2023 18:45:58 -0700 you wrote:
TCP SYN/ACK packets of connections from processes/sockets outside a cgroup on the same host are not received by the cgroup's installed cgroup_skb filters.
There were two BPF cgroup_skb programs attached to a cgroup named "my_cgroup".
[...]
Here is the summary with links: - [bpf-next,v4,1/2] net: bpf: Check SKB ownership against full socket. https://git.kernel.org/bpf/bpf-next/c/223f5f79f2ce - [bpf-next,v4,2/2] selftests/bpf: Verify that the cgroup_skb filters receive expected packets. https://git.kernel.org/bpf/bpf-next/c/539c7e67aa4a
You are awesome, thank you!
linux-kselftest-mirror@lists.linaro.org