On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev sdf@google.com wrote:
On 11/06, Willem de Bruijn wrote:
I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that it somehow implies that I have an option of passing or not passing it for an individual system call. If we know that we're going to use dmabuf with the socket, maybe we should move this flag to the socket() syscall?
fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
?
I think it should then be a setsockopt called before any data is exchanged, with no change of modifying mode later. We generally use setsockopts for the mode of a socket. This use of the protocol field in socket() for setting a mode would be novel. Also, it might miss passively opened connections, or be overly restrictive: one approach for all accepted child sockets.
I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There are plenty of bits we can grab. But setsockopt works as well!
To follow up: if we have this flag on a socket, not on a per-message basis, can we also use recvmsg for the recycling part maybe?
while (true) { memset(msg, 0, ...);
/* receive the tokens */ ret = recvmsg(fd, &msg, 0); /* recycle the tokens from the above recvmsg() */ ret = recvmsg(fd, &msg, MSG_RECYCLE);
}
recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option to recycle a range.
Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)? Or is it more confusing?
It would have to be sendmsg, as recvmsg is a copy_to_user operation.
I am not aware of any precedent in multiplexing the data stream and a control operation stream in this manner. It would also require adding a branch in the sendmsg hot path.
Is it too much plumbing to copy_from_user msg_control deep in recvmsg stack where we need it? Mixing in sendmsg is indeed ugly :-(
I tried exactly the inverse of that when originally adding MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications on sendmsg calls by writing to sendmsg msg_control on return to user. It required significant code churn, which the performance gains did not warrant. Doing so also breaks the simple rule that recv is for reading and send is for writing.
Regarding hot patch: aren't we already doing copy_to_user for the tokens in this hot path, so having one extra condition shouldn't hurt too much?
We're doing that in the optional cmsg handling of recvmsg, which is already a slow path (compared to the data read() itself).
The memory is associated with the socket, freed when the socket is closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket state operation, for which setsockopt is the socket interface.
Is your request purely a dislike, or is there some technical concern with BPF and setsockopt?
It's mostly because I've been bitten too much by custom socket options that are not really on/off/update-value operations:
29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE
I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but things tend to evolve and change.
I see. I'm a bit concerned if we start limiting what we can do in sockets because of dependencies that BPF processing places on them. The use case for BPF [gs]etsockopt is limited to specific control mode calls. Would it make sense to just exclude calls like SO_DEVMEM_DONTNEED from this interpositioning?
At a high level what we really want is a high rate metadata path from user to kernel. And there are no perfect solutions. From kernel to user we use the socket error queue for this. That was never intended for high event rate itself, dealing with ICMP errors and the like before timestamps and zerocopy notifications were added.
If I squint hard enough I can see some prior art in mixing data and high rate state changes within the same channel in NIC descriptor queues, where some devices do this, e.g., { "insert encryption key", "send packet" }. But fundamentally I think we should keep the socket queues for data only.