When looking at improving the user experience around the MPTCP endpoints
setup, I noticed that setting an endpoint with both the 'signal' and the
'subflow' flags -- as it has been done in the past by users according to
bug reports we got -- was resulting on only announcing the endpoint, but
not using it to create subflows: the 'subflow' flag was then ignored.
My initial thought was to modify IPRoute2 to warn the user when the two
flags were set, but it doesn't sound normal to ignore one of them. I
then looked at modifying the kernel not to allow having the two flags
set, but when discussing about that with Mat, we thought it was maybe
not ideal to do that, as there might be use-cases, we might break some
configs. Then I saw it was working before v5.17. So instead, I fixed the
support on the kernel side (patch 5) using Paolo's suggestion. This also
includes a fix on the options side (patch 1: for v5.11+), an explicit
deny of some options combinations (patch 2: for v5.18+), and some
refactoring (patches 3 and 4) to ease the inclusion of the patch 5.
While at it, I added a new selftest (patch 7) to validate this case --
including a modification of the chk_add_nr helper to inverse the sides
were the counters are checked (patch 6) -- and allowed ADD_ADDR echo
just after the MP_JOIN 3WHS.
The selftests modification have the same Fixes tag as the previous
commit, but no 'Cc: Stable': if the backport can work, that's good --
but it still need to be verified by running the selftests -- if not, no
need to worry, many CIs will use the selftests from the last stable
version to validate previous stable releases.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (7):
mptcp: fully established after ADD_ADDR echo on MPJ
mptcp: pm: deny endp with signal + subflow + port
mptcp: pm: reduce indentation blocks
mptcp: pm: don't try to create sf if alloc failed
mptcp: pm: do not ignore 'subflow' if 'signal' flag is also set
selftests: mptcp: join: ability to invert ADD_ADDR check
selftests: mptcp: join: test both signal & subflow
net/mptcp/options.c | 3 +-
net/mptcp/pm_netlink.c | 47 +++++++++++++--------
tools/testing/selftests/net/mptcp/mptcp_join.sh | 55 ++++++++++++++++++-------
3 files changed, 73 insertions(+), 32 deletions(-)
---
base-commit: 0bf50cead4c4710d9f704778c32ab8af47ddf070
change-id: 20240731-upstream-net-20240731-mptcp-endp-subflow-signal-181d640cf5e8
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Corrected the typographical of the word "different"
in the "name" field of the JSON object with ID "4319".
Signed-off-by: Karan Sanghavi <karansanghvi98(a)gmail.com>
---
tools/testing/selftests/tc-testing/tc-tests/filters/cgroup.json | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/filters/cgroup.json b/tools/testing/selftests/tc-testing/tc-tests/filters/cgroup.json
index 03723cf84..6897ff5ad 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/filters/cgroup.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/filters/cgroup.json
@@ -1189,7 +1189,7 @@
},
{
"id": "4319",
- "name": "Replace cgroup filter with diffferent match",
+ "name": "Replace cgroup filter with different match",
"category": [
"filter",
"cgroup"
--
2.43.0
From: Zijian Zhang <zijianzhang(a)bytedance.com>
Original notification mechanism needs poll + recvmmsg which is not
easy for applcations to accommodate. And, it also incurs unignorable
overhead including extra system calls.
While making maximum reuse of the existing MSG_ZEROCOPY related code,
this patch set introduces a new zerocopy socket notification mechanism.
Users of sendmsg pass a control message as a placeholder for the incoming
notifications. Upon returning, kernel embeds notifications directly into
user arguments passed in. By doing so, we can reduce the complexity and
the overhead for managing notifications.
We also have the logic related to copying cmsg to the userspace in sendmsg
generic for any possible uses cases in the future. However, it introduces
ABI change of sendmsg.
Changelog:
v1 -> v2:
- Reuse errormsg queue in the new notification mechanism,
users can actually use these two mechanisms in hybrid way
if they want to do so.
- Update case SCM_ZC_NOTIFICATION in __sock_cmsg_send
1. Regardless of 32-bit, 64-bit program, we will always handle
u64 type user address.
2. The size of data to copy_to_user is precisely calculated
in case of kernel stack leak.
- fix (kbuild-bot)
1. Add SCM_ZC_NOTIFICATION to arch-specific header files.
2. header file types.h in include/uapi/linux/socket.h
v2 -> v3:
- 1. Users can now pass in the address of the zc_info_elem directly
with appropriate cmsg_len instead of the ugly user interface. Plus,
the handler is now compatible with MSG_CMSG_COMPAT and 32-bit
pointer.
- 2. Suggested by Willem, another strategy of getting zc info is
briefly taking the lock of sk_error_queue and move to a private
list, like net_rx_action. I thought sk_error_queue is protected by
sock_lock, so that it's impossible for the handling of zc info and
users recvmsg from the sk_error_queue at the same time.
However, sk_error_queue is protected by its own lock. I am afraid
that during the time it is handling the private list, users may
fail to get other error messages in the queue via recvmsg. Thus,
I don't implement the splice logic in this version. Any comments?
v3 -> v4:
- 1. Change SOCK_ZC_INFO_MAX to 64 to avoid large stack frame size.
- 2. Fix minor typos.
- 3. Change cfg_zerocopy from int to enum in msg_zerocopy.c
Initially, we expect users to pass the user address of the user array
as a data in cmsg, so that the kernel can copy_to_user to this address
directly.
As Willem commented,
> The main design issue with this series is this indirection, rather
> than passing the array of notifications as cmsg.
> This trick circumvents having to deal with compat issues and having to
> figure out copy_to_user in ____sys_sendmsg (as msg_control is an
> in-kernel copy).
> This is quite hacky, from an API design PoV.
> As is passing a pointer, but expecting msg_controllen to hold the
> length not of the pointer, but of the pointed to user buffer.
> I had also hoped for more significant savings. Especially with the
> higher syscall overhead due to meltdown and spectre mitigations vs
> when MSG_ZEROCOPY was introduced and I last tried this optimization.
We solve it by supporting put_cmsg to userspace in sendmsg path starting
from v5.
v4 -> v5:
- 1. Passing user address directly to kernel raises concerns about
ABI. In this version, we support put_cmsg to userspace in TX path
to solve this problem.
v5 -> v6:
- 1. Cleanly copy cmsg to user upon returning of ___sys_sendmsg
v6 -> v7:
- 1. Remove flag MSG_CMSG_COPY_TO_USER, use a member in msghdr instead
- 2. Pass msg to __sock_cmsg_send.
- 3. sendmsg_copy_cmsg_to_user should be put at the end of
____sys_sendmsg to make sure msg_sys->msg_control is a valid pointer.
- 4. Add struct zc_info to contain the array of zc_info_elem, so that
the kernel can update the zc_info->size. Another possible solution is
updating the cmsg_len directly, but it will break for_each_cmsghdr.
- 5. Update selftest to make cfg_notification_limit have the same
semantics in both methods for better comparison.
v7 -> v8:
- 1. Add a static_branch in ____sys_sendmsg to avoid overhead in the
hot path.
- 2. Add ZC_NOTIFICATION_MAX to limit the max size of zc_info->arr.
- 3. Minimize the code in SCM_ZC_NOTIFICATION handler by adding a local
sk_buff_head.
* Performance
We update selftests/net/msg_zerocopy.c to accommodate the new mechanism,
cfg_notification_limit has the same semantics for both methods. Test
results are as follows, we update skb_orphan_frags_rx to the same as
skb_orphan_frags to support zerocopy in the localhost test.
cfg_notification_limit = 1, both method get notifications after 1 calling
of sendmsg. In this case, the new method has around 17% cpu savings in TCP
and 23% cpu savings in UDP.
+----------------------+---------+---------+---------+---------+
| Test Type / Protocol | TCP v4 | TCP v6 | UDP v4 | UDP v6 |
+----------------------+---------+---------+---------+---------+
| ZCopy (MB) | 7523 | 7706 | 7489 | 7304 |
+----------------------+---------+---------+---------+---------+
| New ZCopy (MB) | 8834 | 8993 | 9053 | 9228 |
+----------------------+---------+---------+---------+---------+
| New ZCopy / ZCopy | 117.42% | 116.70% | 120.88% | 126.34% |
+----------------------+---------+---------+---------+---------+
cfg_notification_limit = 32, both get notifications after 32 calling of
sendmsg, which means more chances to coalesce notifications, and less
overhead of poll + recvmsg for the original method. In this case, the new
method has around 7% cpu savings in TCP and slightly better cpu usage in
UDP. In the env of selftest, notifications of TCP are more likely to be
out of order than UDP, it's easier to coalesce more notifications in UDP.
The original method can get one notification with range of 32 in a recvmsg
most of the time. In TCP, most notifications' range is around 2, so the
original method needs around 16 recvmsgs to get notified in one round.
That's the reason for the "New ZCopy / ZCopy" diff in TCP and UDP here.
+----------------------+---------+---------+---------+---------+
| Test Type / Protocol | TCP v4 | TCP v6 | UDP v4 | UDP v6 |
+----------------------+---------+---------+---------+---------+
| ZCopy (MB) | 8842 | 8735 | 10072 | 9380 |
+----------------------+---------+---------+---------+---------+
| New ZCopy (MB) | 9366 | 9477 | 10108 | 9385 |
+----------------------+---------+---------+---------+---------+
| New ZCopy / ZCopy | 106.00% | 108.28% | 100.31% | 100.01% |
+----------------------+---------+---------+---------+---------+
In conclusion, when notification interval is small or notifications are
hard to be coalesced, the new mechanism is highly recommended. Otherwise,
the performance gain from the new mechanism is very limited.
Zijian Zhang (3):
sock: support copying cmsgs to the user space in sendmsg
sock: add MSG_ZEROCOPY notification mechanism based on msg_control
selftests: add MSG_ZEROCOPY msg_control notification test
arch/alpha/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
include/linux/socket.h | 8 ++
include/net/sock.h | 2 +-
include/uapi/asm-generic/socket.h | 2 +
include/uapi/linux/socket.h | 23 +++++
net/core/sock.c | 72 +++++++++++++-
net/ipv4/ip_sockglue.c | 2 +-
net/ipv6/datagram.c | 2 +-
net/socket.c | 63 +++++++++++-
tools/testing/selftests/net/msg_zerocopy.c | 101 ++++++++++++++++++--
tools/testing/selftests/net/msg_zerocopy.sh | 1 +
14 files changed, 265 insertions(+), 19 deletions(-)
--
2.20.1