On 6/22/23 8:28 PM, Yonghong Song wrote:
On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
On 6/21/23 20:37, Yonghong Song wrote:
On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
Always call BPF filters if CGROUP BPF is enabled for EGRESS without checking skb->sk against sk.
The filters were called only if skb is owned by the sock that the skb is sent out through. In another words, skb->sk should point to the sock that it is sending through its egress. However, the filters would miss SYNACK skbs that they are owned by a request_sock but sent through the listening sock, that is the socket listening incoming connections. This is an unnecessary restrict.
The original patch which introduced 'sk == skb->sk' is 3007098494be cgroup: add support for eBPF programs There are no mentioning in commit message why 'sk == skb->sk' is needed. So it is possible that this is just restricted for use cases at that moment. Now there are use cases where 'sk != skb->sk' so removing this check can enable the new use case. Maybe you can add this into your commit message so people can understand the history of 'sk == skb->sk'.
After checking the code and the Alexei's comment[1] again, this check may be different from what I thought. In another post[2], Daniel Borkmann mentioned
Wouldn't that mean however, when you go through stacked devices that you'd run the same eBPF cgroup program for skb->sk multiple times?
I read this paragraph several times. This check ensures the filters are only called for the device on the top of a stack. So, I probably should change the check to
sk == skb_to_full_sk(skb)
I think this should work. It exactly covers your use case: they are owned by a request_sock but sent through the listening sock, that is the socket listening incoming connections and sk == skb->sk for non request_sock/listening_sock case.
Just a thought, should the test look like the below?
int __ret = 0; \ if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) { \ typeof(sk) __sk = sk_to_full_sk(sk); \ if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) && \ cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS)) \ __ret = __cgroup_bpf_run_filter_skb(__sk, skb, \ CGROUP_INET_EGRESS); \ } \
Iow, we do already convert __sk to full sk, so we should then also use that for the test with skb_to_full_sk(skb).
Thanks, Daniel