Overall, we encountered a warning [1] that can be triggered by running the selftest I provided.
sockmap works by replacing sk_data_ready, recvmsg, sendmsg operations and implementing fast socket-level forwarding logic: 1. Users can obtain file descriptors through userspace socket()/accept() interfaces, then call BPF syscall to perform these replacements. 2. Users can also use the bpf_sock_hash_update helper (in sockops programs) to replace handlers when TCP connections enter ESTABLISHED state (BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB/BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB)
However, when combined with MPTCP, an issue arises: MPTCP creates subflow sk's and performs TCP handshakes, so the BPF program obtains subflow sk's and may incorrectly replace their sk_prot. We need to reject such operations. In patch 1, we set psock_update_sk_prot to NULL in the subflow's custom sk_prot.
Additionally, if the server's listening socket has MPTCP enabled and the client's TCP also uses MPTCP, we should allow the combination of subflow and sockmap. This is because the latest Golang programs have enabled MPTCP for listening sockets by default [2]. For programs already using sockmap, upgrading Golang should not cause sockmap functionality to fail.
Patch 2 prevents the panic from occurring.
Despite these patches fixing stream corruption, users of sockmap must set GODEBUG=multipathtcp=0 to disable MPTCP until sockmap fully supports it.
[1] truncated warning: ------------[ cut here ]------------
BUG: kernel NULL pointer dereference, address: 00000000000004bb PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 400 Comm: test_progs Not tainted 6.1.0+ #16 RIP: 0010:mptcp_stream_accept (./include/linux/list.h:88 net/mptcp/protocol.c:3719)
RSP: 0018:ffffc90000ef3cf0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8880089dcc58 RDX: 0000000000000003 RSI: 0000002c000000b0 RDI: 0000000000000000 RBP: ffffc90000ef3d38 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880089dc600 R13: ffff88800b859e00 R14: ffff88800638c680 R15: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000004bb CR3: 000000000b8e8006 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? apparmor_socket_accept (security/apparmor/lsm.c:966) do_accept (net/socket.c:1856) __sys_accept4 (net/socket.c:1897 net/socket.c:1927) __x64_sys_accept (net/socket.c:1941) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[2]: https://go-review.googlesource.com/c/go/+/607715
Jiayuan Chen (2): mptcp: disallow MPTCP subflows from sockmap net,mptcp: fix proto fallback detection with BPF
net/mptcp/protocol.c | 5 +++-- net/mptcp/subflow.c | 8 ++++++++ 2 files changed, 11 insertions(+), 2 deletions(-)
The sockmap feature allows bpf syscall from userspace, or based on bpf sockops, replacing the sk_prot of sockets during protocol stack processing with sockmap's custom read/write interfaces. ''' tcp_rcv_state_process() subflow_syn_recv_sock() tcp_init_transfer(BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) bpf_skops_established <== sockops bpf_sock_map_update(sk) <== call bpf helper tcp_bpf_update_proto() <== update sk_prot ''' Consider two scenarios:
1. When the server has MPTCP enabled and the client also requests MPTCP, the sk passed to the BPF program is a subflow sk. Since subflows only handle partial data, replacing their sk_prot is meaningless and will cause traffic disruption.
2. When the server has MPTCP enabled but the client sends a TCP SYN without MPTCP, subflow_syn_recv_sock() performs a fallback on the subflow, replacing the subflow sk's sk_prot with the native sk_prot. ''' subflow_ulp_fallback() subflow_drop_ctx() mptcp_subflow_ops_undo_override() ''' Subsequently, accept::mptcp_stream_accept::mptcp_fallback_tcp_ops() converts the subflow to plain TCP.
For the first case, we should prevent it from being combined with sockmap by setting sk_prot->psock_update_sk_prot to NULL, which will be blocked by sockmap's own flow.
For the second case, since subflow_syn_recv_sock() has already restored sk_prot to native tcp_prot/tcpv6_prot, no further action is needed.
Fixes: d2f77c53342e ("mptcp: check for plain TCP sock at accept time") Reviewed-by: Matthieu Baerts (NGI0) matttbe@kernel.org Signed-off-by: Jiayuan Chen jiayuan.chen@linux.dev --- net/mptcp/subflow.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 2159b5f9988f..c922bbb12bd8 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -1936,6 +1936,10 @@ void __init mptcp_subflow_init(void)
tcp_prot_override = tcp_prot; tcp_prot_override.release_cb = tcp_release_cb_override; +#ifdef CONFIG_BPF_SYSCALL + /* Disable sockmap processing for subflows */ + tcp_prot_override.psock_update_sk_prot = NULL; +#endif
#if IS_ENABLED(CONFIG_MPTCP_IPV6) subflow_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops; @@ -1957,6 +1961,10 @@ void __init mptcp_subflow_init(void)
tcpv6_prot_override = tcpv6_prot; tcpv6_prot_override.release_cb = tcp_release_cb_override; +#ifdef CONFIG_BPF_SYSCALL + /* Disable sockmap processing for subflows */ + tcpv6_prot_override.psock_update_sk_prot = NULL; +#endif #endif
mptcp_diag_subflow_init(&subflow_ulp_ops);
The sockmap feature allows bpf syscall from userspace, or based on bpf sockops, replacing the sk_prot of sockets during protocol stack processing with sockmap's custom read/write interfaces. ''' tcp_rcv_state_process() syn_recv_sock()/subflow_syn_recv_sock() tcp_init_transfer(BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) bpf_skops_established <== sockops bpf_sock_map_update(sk) <== call bpf helper tcp_bpf_update_proto() <== update sk_prot '''
When the server has MPTCP enabled but the client sends a TCP SYN without MPTCP, subflow_syn_recv_sock() performs a fallback on the subflow, replacing the subflow sk's sk_prot with the native sk_prot. ''' subflow_syn_recv_sock() subflow_ulp_fallback() subflow_drop_ctx() mptcp_subflow_ops_undo_override() '''
Then, this subflow can be normally used by sockmap, which replaces the native sk_prot with sockmap's custom sk_prot. The issue occurs when the user executes accept::mptcp_stream_accept::mptcp_fallback_tcp_ops(). Here, it uses sk->sk_prot to compare with the native sk_prot, but this is incorrect when sockmap is used, as we may incorrectly set sk->sk_socket->ops.
This fix uses the more generic sk_family for the comparison instead.
Additionally, this also prevents a PANIC from occurring:
result from ./scripts/decode_stacktrace.sh: ------------[ cut here ]------------
BUG: kernel NULL pointer dereference, address: 00000000000004bb PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 400 Comm: test_progs Not tainted 6.1.0+ #16 RIP: 0010:mptcp_stream_accept (./include/linux/list.h:88 net/mptcp/protocol.c:3719)
RSP: 0018:ffffc90000ef3cf0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8880089dcc58 RDX: 0000000000000003 RSI: 0000002c000000b0 RDI: 0000000000000000 RBP: ffffc90000ef3d38 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880089dc600 R13: ffff88800b859e00 R14: ffff88800638c680 R15: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000004bb CR3: 000000000b8e8006 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? apparmor_socket_accept (security/apparmor/lsm.c:966) do_accept (net/socket.c:1856) __sys_accept4 (net/socket.c:1897 net/socket.c:1927) __x64_sys_accept (net/socket.c:1941) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
Fixes: d2f77c53342e ("mptcp: check for plain TCP sock at accept time") Reviewed-by: Jakub Sitnicki jakub@cloudflare.com Reviewed-by: Matthieu Baerts (NGI0) matttbe@kernel.org Signed-off-by: Jiayuan Chen jiayuan.chen@linux.dev --- net/mptcp/protocol.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 1dbc62537259..13e3510e6c8f 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -79,8 +79,9 @@ static u64 mptcp_wnd_end(const struct mptcp_sock *msk) static bool mptcp_is_tcpsk(struct sock *sk) { struct socket *sock = sk->sk_socket; + unsigned short family = READ_ONCE(sk->sk_family);
- if (unlikely(sk->sk_prot == &tcp_prot)) { + if (unlikely(family == AF_INET)) { /* we are being invoked after mptcp_accept() has * accepted a non-mp-capable flow: sk is a tcp_sk, * not an mptcp one. @@ -91,7 +92,7 @@ static bool mptcp_is_tcpsk(struct sock *sk) sock->ops = &inet_stream_ops; return true; #if IS_ENABLED(CONFIG_MPTCP_IPV6) - } else if (unlikely(sk->sk_prot == &tcpv6_prot)) { + } else if (unlikely(family == AF_INET6)) { sock->ops = &inet6_stream_ops; return true; #endif
Hi Jiayuan,
On 30/11/2025 04:23, Jiayuan Chen wrote:
The sockmap feature allows bpf syscall from userspace, or based on bpf sockops, replacing the sk_prot of sockets during protocol stack processing with sockmap's custom read/write interfaces.
(...)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 1dbc62537259..13e3510e6c8f 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -79,8 +79,9 @@ static u64 mptcp_wnd_end(const struct mptcp_sock *msk) static bool mptcp_is_tcpsk(struct sock *sk) { struct socket *sock = sk->sk_socket;
- unsigned short family = READ_ONCE(sk->sk_family);
- if (unlikely(sk->sk_prot == &tcp_prot)) {
- if (unlikely(family == AF_INET)) { /* we are being invoked after mptcp_accept() has
- accepted a non-mp-capable flow: sk is a tcp_sk,
- not an mptcp one.
@@ -91,7 +92,7 @@ static bool mptcp_is_tcpsk(struct sock *sk) sock->ops = &inet_stream_ops; return true; #if IS_ENABLED(CONFIG_MPTCP_IPV6)
- } else if (unlikely(sk->sk_prot == &tcpv6_prot)) {
- } else if (unlikely(family == AF_INET6)) {
These modifications here break MPTCP: this function (mptcp_is_tcpsk) is there to check if the socket is a "plain" TCP one (return "true") or an MPTCP one (return "false"). If it is not an MPTCP one, the sock ops is modified.
Here, you are saying: any IPv4 or IPv6 socket is a "plain" TCP one, never an MPTCP socket then.
I suggest adding ...
if (sk->sk_protocol == IPPROTO_MPTCP) return false;
... at the beginning of this function. I'm planning to send a patch later on including this check. Once it is sent, do you mind checking it with sockmap if you have the setup available, please?
Cheers, Matt
Hi Jiayuan,
On 30/11/2025 04:23, Jiayuan Chen wrote:
Overall, we encountered a warning [1] that can be triggered by running the selftest I provided.
Thank you for having shared these patches, but there are some issues with them:
- Patch 1/2 has already been queued, see [1].
- Patch 2/2 is the exact same patch as the one sent by Sasha [2], and recently dropped [3] because it breaks MPTCP. See my comment there.
- You need to follow the stable rules [4] when sending patches to stable. In short, here, you should have kept the original upstream commit message, then add the SHA and a note about the conflicts, e.g. similar to [5]. Without that, it is a new patch, not a backport (and the reviewed-by/acked-by/... cannot be kept).
[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree... [2] https://lore.kernel.org/stable/9e6ef98f-12eb-4608-aece-cf321e0a38d7@kernel.o... [3] https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/comm... [4] https://docs.kernel.org/process/stable-kernel-rules.html [5] https://lore.kernel.org/stable/20251129165612.2125498-2-matttbe@kernel.org/T...
Cheers, Matt
2025/12/1 24:21, "Matthieu Baerts" <matttbe@kernel.org mailto:matttbe@kernel.org?to=%22Matthieu%20Baerts%22%20%3Cmatttbe%40kernel.org%3E > wrote:
Hi Jiayuan,
On 30/11/2025 04:23, Jiayuan Chen wrote:
The sockmap feature allows bpf syscall from userspace, or based on bpf sockops, replacing the sk_prot of sockets during protocol stack processing with sockmap's custom read/write interfaces.
(...)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 1dbc62537259..13e3510e6c8f 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -79,8 +79,9 @@ static u64 mptcp_wnd_end(const struct mptcp_sock *msk) static bool mptcp_is_tcpsk(struct sock *sk) { struct socket *sock = sk->sk_socket;
- unsigned short family = READ_ONCE(sk->sk_family);
- if (unlikely(sk->sk_prot == &tcp_prot)) {
- if (unlikely(family == AF_INET)) {
/* we are being invoked after mptcp_accept() has
- accepted a non-mp-capable flow: sk is a tcp_sk,
- not an mptcp one.
@@ -91,7 +92,7 @@ static bool mptcp_is_tcpsk(struct sock *sk) sock->ops = &inet_stream_ops; return true; #if IS_ENABLED(CONFIG_MPTCP_IPV6)
- } else if (unlikely(sk->sk_prot == &tcpv6_prot)) {
- } else if (unlikely(family == AF_INET6)) {
These modifications here break MPTCP: this function (mptcp_is_tcpsk) is there to check if the socket is a "plain" TCP one (return "true") or an MPTCP one (return "false"). If it is not an MPTCP one, the sock ops is modified.
Here, you are saying: any IPv4 or IPv6 socket is a "plain" TCP one, never an MPTCP socket then.
I suggest adding ...
if (sk->sk_protocol == IPPROTO_MPTCP) return false;
... at the beginning of this function. I'm planning to send a patch later on including this check. Once it is sent, do you mind checking it with sockmap if you have the setup available, please?
Yes, of course. I can test it once I receive the patch.
Cheers, Matt -- Sponsored by the NGI0 Core fund.
linux-stable-mirror@lists.linaro.org