Check SKB ownership of an SKB against full sockets instead of request_sock.
The filters were called only if an SKB is owned by the sock that the SKB is sent out through. In another words, skb->sk should point to the sock that it is sending through its egress. However, the filters would miss SYN/ACK SKBs that they are owned by a request_sock but sent through the listener sock, that is the socket listening incoming connections.
However, the listener socket is also the full socket of the request socket. We should use the full socket as the owner socket of an SKB instead.
What is the ownership check for? ====================================
BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the ownership of an SKB. Alexei referred to a mailing list conversation that took place a few years ago. [1] In that conversation, Daniel Borkmann stated that:
Wouldn't that mean however, when you go through stacked devices that you'd run the same eBPF cgroup program for skb->sk multiple times?
According to what Daniel said, the ownership check mentioned earlier presumably prevents multiple calls of egress filters caused by an skb.
A test that reproduce this scenario shows that the BPF cgroup egress programs can be called multiple times for one SKB if this ownership check is not there. So, we can not just remove this check.
Test Stacked Devices =======================
We use L2TP to build an environment of stacked devices. L2TP (Layer 2 Tunneling Protocol) is a tunneling protocol used to support virtual private networks (VPNs). It relays encapsulated packets; for example in UDP, to its peer by using a socket.
Using L2TP, packets are first sent through the IP stack and should then arrive at an L2TP device. The device will expand its SKB header to encapsulate the packet. The SKB will be sent back to the IP stack using the socket that was made for the L2TP session. After that, the routing process will occur once more, but this time for a new destination.
We changed tools/testing/selftests/net/l2tp.sh to set up a test environment using L2TP. The run_ping() function in l2tp.sh is where the main change occurred.
run_ping() { local desc="$1"
sleep 10 run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1 fc00:101::2 log_test $? 0 "IPv6 route through L2TP tunnel ${desc}" sleep 10
}
The test will use L2TP devices to send PING messages. These messages will have a message size of 227 bytes as a special label to distinguish them. This is not an ideal solution, but works.
During the execution of the test script, bpftrace was attached to ip6_finish_output() and l2tp_xmit_skb(). BPF
bpftrace -e ' kfunc:ip6_finish_output { time("%H:%M:%S: "); printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p skb->sk=%p\n", args->skb, args->skb->len, args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); } kfunc:l2tp_xmit_skb { time("%H:%M:%S: "); printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb, args->session->tunnel->sock); }'
The following is part of the output messages printed by bpftrace.
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275 cgroup=0xffff88810741f800 sk=0xffff888105f3b900 skb->sk=0xffff888105f3b900
16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337 cgroup=0xffff88810741f800 sk=0xffff888103dd6300 skb->sk=0xffff888105f3b900
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337 cgroup=(nil) sk=(nil) skb->sk=(nil)
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275 cgroup=0xffffffff837741d0 sk=0xffff888101fe0000 skb->sk=0xffff888101fe0000
16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337 cgroup=0xffff88810741f800 sk=0xffff888103483180 skb->sk=0xffff888101fe0000
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337 cgroup=(nil) sk=(nil) skb->sk=(nil)
The first four entries describe a PING message that was sent using the ping command, whereas the following four entries describe the response received. Multiple sockets are used to send one skb, including the socket used by the L2TP session. This can be observed.
Based on this information, it seems that the ownership check is designed to avoid multiple calls of egress filters caused by a single skb.
[1] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/,
Signed-off-by: Kui-Feng Lee kuifeng@meta.com --- include/linux/bpf-cgroup.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 57e9e109257e..8506690dbb9c 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -199,9 +199,9 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk, #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \ + if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) { \ typeof(sk) __sk = sk_to_full_sk(sk); \ - if (sk_fullsock(__sk) && \ + if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) && \ cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS)) \ __ret = __cgroup_bpf_run_filter_skb(__sk, skb, \ CGROUP_INET_EGRESS); \