On 16/12/2024 12:50, Antonio Quartulli wrote:
On 16/12/2024 12:09, Sabrina Dubroca wrote: [...]
Maybe we should call cancel_sync_work(&ovpn_sock->work) inside ovpn_socket_get()? So the latter will return NULL only when it is sure that the socket has been detached.
At that point we can skip the following return and continue along the "new socket" path.
What do you think?
The work may not have been scheduled yet? (small window between the last kref_put and schedule_work)
Maybe a completion [Documentation/scheduler/completion.rst] would solve it (but it makes things even more complex, unfortunately):
- at the end of ovpn_socket_detach: complete(&ovpn_sock->detached); - in ovpn_socket_new when handling EALREADY: wait_for_completion(&ovpn_sock->detached); - in ovpn_socket_new for the new socket: init_completion(&ovpn_sock-
detached);
but ovpn_sock could be gone immediately after complete(). Maybe something with completion_done() before the kfree_rcu in ovpn_socket_detach? I'm not that familiar with the completion API.
It seems the solution we are aiming for is more complex than the concept of ovpn_socket per se :-D
I'll think a bit more about this..maybe we can avoid entering this situation at all..
I see that there are some kref_put variants that acquire a lock just before hitting zero and running the release cb.
If I implement a kref_put variant that acquires the lock_sock, I could then perform the udp detach under lock, thus ensuring that zero'ing the refcount and erasing the sk_user_data happens while holding the lock_sock.
This way I should be able to prevent the situation where "sk_user_data still says EALREADY, but the refcnt is actually 0".
I hope adding this new API is fine.
I am giving it a try now.
Regards,