2025/4/29 08:56, "Cong Wang" xiyou.wangcong@gmail.com wrote:
On Mon, Apr 28, 2025 at 04:16:52PM +0800, Jiayuan Chen wrote:
+bpf_sk_skb_set_redirect_cpu()
+^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
int bpf_sk_skb_set_redirect_cpu(struct __sk_buff *s, int redir_cpu)
+This kfunc ``bpf_sk_skb_set_redirect_cpu()`` is available to
+``BPF_PROG_TYPE_SK_SKB`` BPF programs. It sets the CPU affinity, allowing the
+sockmap packet redirecting process to run on the specified CPU as much as
+possible, helping users reduce the interference between the sockmap redirecting
+background thread and other threads.
I am wondering if it is a better idea to use BPF_MAP_TYPE_CPUMAP for
redirection here instead? Like we did for bpf_redirect_map(). At least
we would not need to store CPU in psock with this approach.
Thanks.
You mean to use BPF_MAP_TYPE_CPUMAP with XDP to redirect packets to a specific CPU?
I tested and found such overhead: 1、Needing to parse the L4 header from the L2 header to obtain the 5-tuple, and then maintaining an additional map to store the relationship between each five-tuple and process/CPU. Compared to multi-process scenario, with one process binding to one CPU and one map, I can directly use a global variable to let the BPF program know which thread it should use, especially for programs that enable reuseport.
2、Furthermore, regarding performance, I tested with cpumap and the results were lower than expected. This is because loopback only has xdp_generic mode and the problem I described in cover letter is actually occurred on loopback...
Code: ''' struct { __uint(type, BPF_MAP_TYPE_CPUMAP); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(struct bpf_cpumap_val)); __uint(max_entries, 64); } cpu_map SEC(".maps");
SEC("xdp") int xdp_stats1_func(struct xdp_md *ctx) { /* Real world: * 1. get 5-tuple from ctx * 2. get corresponding cpu of current skb through XX_MAP */ int ret = bpf_redirect_map(&cpu_map, 3, 0); // redirct to 3 return ret; } '''
Result: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --no-verify Setting up benchmark 'sockmap'... create socket fd c1:14 p1:15 c2:16 p2:17 Benchmark 'sockmap' started. Iter 0 ( 36.439us): Send Speed 561.496 MB/s ... Rcv Speed 33.264 MB/s Iter 1 ( -7.448us): Send Speed 558.443 MB/s ... Rcv Speed 32.611 MB/s Iter 2 ( -2.245us): Send Speed 557.131 MB/s ... Rcv Speed 33.004 MB/s Iter 3 ( -2.845us): Send Speed 547.374 MB/s ... Rcv Speed 33.331 MB/s Iter 4 ( 0.745us): Send Speed 562.891 MB/s ... Rcv Speed 34.117 MB/s Iter 5 ( -2.056us): Send Speed 560.994 MB/s ... Rcv Speed 33.069 MB/s Iter 6 ( 5.343us): Send Speed 562.038 MB/s ... Rcv Speed 33.200 MB/s '''
Instead, we can introduce a new kfunc to specify the CPU used by the backlog running thread, which can avoid using XDP. After all, this is a "problem" brought by the BPF L7 framework itself, and it's better to solve it ourselves.