On 3/20/25 10:48 AM, Matthieu Baerts (NGI0) wrote:
From: Geliang Tang tanggeliang@kylinos.cn
It's necessary to traverse all subflows on the conn_list of an MPTCP socket and then call kfunc to modify the fields of each subflow. In kernel space, mptcp_for_each_subflow() helper is used for this:
mptcp_for_each_subflow(msk, subflow) kfunc(subflow);
But in the MPTCP BPF program, this has not yet been implemented. As Martin suggested recently, this conn_list walking + modify-by-kfunc usage fits the bpf_iter use case.
So this patch adds a new bpf_iter type named "mptcp_subflow" to do this and implements its helpers bpf_iter_mptcp_subflow_new()/_next()/ _destroy(). And register these bpf_iter mptcp_subflow into mptcp common kfunc set. Then bpf_for_each() for mptcp_subflow can be used in BPF program like this:
bpf_for_each(mptcp_subflow, subflow, msk) kfunc(subflow);
Suggested-by: Martin KaFai Lau martin.lau@kernel.org Signed-off-by: Geliang Tang tanggeliang@kylinos.cn Reviewed-by: Mat Martineau martineau@kernel.org Signed-off-by: Matthieu Baerts (NGI0) matttbe@kernel.org
Notes:
- v2:
- Add BUILD_BUG_ON() checks, similar to the ones done with other bpf_iter_(...) helpers.
- Replace msk_owned_by_me() by sock_owned_by_user_nocheck() and !spin_is_locked() (Martin).
- v3:
- Switch parameter from 'struct mptcp_sock' to 'struct sock' (Martin)
- Remove unneeded !msk check (Martin)
- Remove locks checks, add msk_owned_by_me for lockdep (Martin)
- The following note and 2 questions have been added below.
This new bpf_iter will be used by our future BPF packet schedulers and path managers. To see how we are going to use them, please check our export branch [1], especially these two commits:
- "bpf: Add mptcp packet scheduler struct_ops": introduce a new struct_ops.
- "selftests/bpf: Add bpf_burst scheduler & test": new test showing how the new struct_ops and bpf_iter are being used.
[1] https://github.com/multipath-tcp/mptcp_net-next/commits/export
@BPF maintainers: we would like to allow this new mptcp_subflow bpf_iter to be used with struct_ops, but only with the two new ones we are going to introduce that are specific to MPTCP, and with not others struct_ops (TCP CC, sched_ext, etc.). We are not sure how to do that. By chance, do you have examples or doc you could point to us to have this restriction in place, please?
The bpf_qdisc.c has done that. Take a look at the "bpf_qdisc_kfunc_filter()".
It is in net-next and bpf-next/net.
Also, for one of the two future MPTCP struct_ops, not all callbacks should be allowed to use this new bpf_iter, because they are called from different contexts. How can we ensure such callbacks from a struct_ops cannot call mptcp_subflow bpf_iter without adding new dedicated checks looking if some locks are held for all callbacks? We understood that they wanted to have something similar with sched_ext, but we are not sure if this code is ready nor if it is going to be accepted.
Same. Take a look at "bpf_qdisc_kfunc_filter()".