On Thu, Apr 3, 2025 at 5:58 AM Ricardo Cañuelo Navarro rcn@igalia.com wrote:
Thanks for reviewing, answers below:
On Wed, Apr 02 2025 at 15:40:56, Xin Long lucien.xin@gmail.com wrote:
The data send path:
sctp_endpoint_lookup_assoc() -> sctp_sendmsg_to_asoc()
And the transport removal path:
sctp_sf_do_asconf() -> sctp_process_asconf() -> sctp_assoc_rm_peer()
are both protected by the same socket lock.
Additionally, when a path is removed, sctp_assoc_rm_peer() updates the transport of all existing chunks in the send queues (peer->transmitted and asoc->outqueue.out_chunk_list) to NULL.
It will be great if you can reproduce the issue locally and help check how the potential race occurs.
That's true but if there isn't enough space in the send buffer, then sctp_sendmsg_to_asoc() will release the lock temporarily.
Oh right, I missed that. Thanks.
The scenario that the reproducer generates is the following:
Thread A Thread B -------------------- --------------------
(1) sctp_sendmsg() lock_sock() sctp_sendmsg_to_asoc() sctp_wait_for_sndbuf() release_sock() sctp_setsockopt(SCTP_SOCKOPT_BINDX_REM) lock_sock() sctp_setsockopt_bindx() sctp_send_asconf_del_ip() ... release_sock() process rcv backlog: sctp_do_sm() sctp_sf_do_asconf() ... sctp_assoc_rm_peer() lock_sock() (2) chunk->transport = transport sctp_primitive_SEND() ... sctp_outq_select_transport() *BUG* switch (new_transport->state)
Notes:
Both threads operate on the same socket.
- Here, sctp_endpoint_lookup_assoc() finds and returns an existing
association and transport.
- At this point, `transport` is already deleted. chunk->transport is
not set to NULL because sctp_assoc_rm_peer() ran _before_ the transport was assigned to the chunk.
We should avoid an extra hashtable lookup on this hot TX path, as it would negatively impact performance.
Good point. I can't really tell the performance impact of the lookup here, my experience with the SCTP implementation is very limited. Do you have any suggestions or alternatives about how to deal with this?
I think the correct approach is to follow how sctp_assoc_rm_peer() handles this.
You can use asoc->peer.last_sent_to (which isn't really used elsewhere) to temporarily store the transport before releasing the socket lock and sleeping in sctp_sendmsg_to_asoc(). After waking up and reacquiring the lock, restore the transport back to asoc->peer.last_sent_to.
Additionally, during an ASCONF update, ensure asoc->peer.last_sent_to is set to a valid transport if it matches the transport being removed.
For example:
in sctp_wait_for_sndbuf():
asoc->peer.last_sent_to = *tp; release_sock(sk); current_timeo = schedule_timeout(current_timeo); lock_sock(sk); *tp = asoc->peer.last_sent_to; asoc->peer.last_sent_to = NULL;
in sctp_assoc_rm_peer():
if (asoc->peer.last_sent_to == peer) asoc->peer.last_sent_to = transport;