This series contains 4 independent new features:
- Patch 1: use HMAC-SHA256 library instead of open-coded HMAC.
- Patches 2-3: make ADD_ADDR retransmission timeout adaptive + simplify selftests.
- Patch 4: selftests: check for unexpected fallback counter increments.
- Patches 5-6: record subflows in RPS table, for aRFS support.
Signed-off-by: Matthieu Baerts (NGI0) matttbe@kernel.org --- Christoph Paasch (2): net: Add rfs_needed() helper mptcp: record subflows in RPS table
Eric Biggers (1): mptcp: use HMAC-SHA256 library instead of open-coded HMAC
Gang Yan (1): selftests: mptcp: add checks for fallback counters
Geliang Tang (2): mptcp: make ADD_ADDR retransmission timeout adaptive selftests: mptcp: remove add_addr_timeout settings
Documentation/networking/mptcp-sysctl.rst | 8 +- include/net/rps.h | 85 ++++++++++------ net/mptcp/crypto.c | 35 +------ net/mptcp/pm.c | 28 +++++- net/mptcp/protocol.c | 21 ++++ tools/testing/selftests/net/mptcp/mptcp_join.sh | 126 +++++++++++++++++++++++- 6 files changed, 231 insertions(+), 72 deletions(-) --- base-commit: d23ad54de795ec0054f90ecb03b41e8f2c410f3a change-id: 20250829-net-next-mptcp-misc-feat-6-18-722fa87a60f1
Best regards,
From: Eric Biggers ebiggers@kernel.org
Now that there are easy-to-use HMAC-SHA256 library functions, use these in net/mptcp/crypto.c instead of open-coding the HMAC algorithm.
Remove the WARN_ON_ONCE() for messages longer than SHA256_DIGEST_SIZE. The new implementation handles all message lengths correctly.
The mptcp-crypto KUnit test still passes after this change.
Signed-off-by: Eric Biggers ebiggers@kernel.org Reviewed-by: Matthieu Baerts (NGI0) matttbe@kernel.org Signed-off-by: Matthieu Baerts (NGI0) matttbe@kernel.org --- net/mptcp/crypto.c | 35 ++--------------------------------- 1 file changed, 2 insertions(+), 33 deletions(-)
diff --git a/net/mptcp/crypto.c b/net/mptcp/crypto.c index b08ba959ac4fd485bb833043ff58d4c8bac8a37a..31948e18d97da7ee0ee2ae9e4f7c9ca0e3b330a7 100644 --- a/net/mptcp/crypto.c +++ b/net/mptcp/crypto.c @@ -22,7 +22,6 @@
#include <linux/kernel.h> #include <crypto/sha2.h> -#include <linux/unaligned.h>
#include "protocol.h"
@@ -43,39 +42,9 @@ void mptcp_crypto_key_sha(u64 key, u32 *token, u64 *idsn)
void mptcp_crypto_hmac_sha(u64 key1, u64 key2, u8 *msg, int len, void *hmac) { - u8 input[SHA256_BLOCK_SIZE + SHA256_DIGEST_SIZE]; - u8 key1be[8]; - u8 key2be[8]; - int i; + __be64 key[2] = { cpu_to_be64(key1), cpu_to_be64(key2) };
- if (WARN_ON_ONCE(len > SHA256_DIGEST_SIZE)) - len = SHA256_DIGEST_SIZE; - - put_unaligned_be64(key1, key1be); - put_unaligned_be64(key2, key2be); - - /* Generate key xored with ipad */ - memset(input, 0x36, SHA256_BLOCK_SIZE); - for (i = 0; i < 8; i++) - input[i] ^= key1be[i]; - for (i = 0; i < 8; i++) - input[i + 8] ^= key2be[i]; - - memcpy(&input[SHA256_BLOCK_SIZE], msg, len); - - /* emit sha256(K1 || msg) on the second input block, so we can - * reuse 'input' for the last hashing - */ - sha256(input, SHA256_BLOCK_SIZE + len, &input[SHA256_BLOCK_SIZE]); - - /* Prepare second part of hmac */ - memset(input, 0x5C, SHA256_BLOCK_SIZE); - for (i = 0; i < 8; i++) - input[i] ^= key1be[i]; - for (i = 0; i < 8; i++) - input[i + 8] ^= key2be[i]; - - sha256(input, SHA256_BLOCK_SIZE + SHA256_DIGEST_SIZE, hmac); + hmac_sha256_usingrawkey((const u8 *)key, sizeof(key), msg, len, hmac); }
#if IS_MODULE(CONFIG_MPTCP_KUNIT_TEST)
From: Geliang Tang tanggeliang@kylinos.cn
Currently the ADD_ADDR option is retransmitted with a fixed timeout. This patch makes the retransmission timeout adaptive by using the maximum RTO among all the subflows, while still capping it at the configured maximum value (add_addr_timeout_max). This improves responsiveness when establishing new subflows.
Specifically: 1. Adds mptcp_adjust_add_addr_timeout() helper to compute the adaptive timeout. 2. Uses maximum subflow RTO (icsk_rto) when available. 3. Applies exponential backoff based on retransmission count. 4. Maintains fallback to configured max timeout when no RTO data exists.
This slightly changes the behaviour of the MPTCP "add_addr_timeout" sysctl knob to be used as a maximum instead of a fixed value. But this is seen as an improvement: the ADD_ADDR might be sent quicker than before to improve the overall MPTCP connection. Also, the default value is set to 2 min, which was already way too long, and caused the ADD_ADDR not to be retransmitted for connections shorter than 2 minutes.
Suggested-by: Matthieu Baerts (NGI0) matttbe@kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/576 Reviewed-by: Christoph Paasch cpaasch@openai.com Signed-off-by: Geliang Tang tanggeliang@kylinos.cn Reviewed-by: Matthieu Baerts (NGI0) matttbe@kernel.org Signed-off-by: Matthieu Baerts (NGI0) matttbe@kernel.org --- Documentation/networking/mptcp-sysctl.rst | 8 +++++--- net/mptcp/pm.c | 28 ++++++++++++++++++++++++---- 2 files changed, 29 insertions(+), 7 deletions(-)
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst index 1683c139821e3ba6d9eaa3c59330a523d29f1164..1eb6af26b4a7acdedd575a126c576210a78f0d4d 100644 --- a/Documentation/networking/mptcp-sysctl.rst +++ b/Documentation/networking/mptcp-sysctl.rst @@ -8,9 +8,11 @@ MPTCP Sysfs variables ===============================
add_addr_timeout - INTEGER (seconds) - Set the timeout after which an ADD_ADDR control message will be - resent to an MPTCP peer that has not acknowledged a previous - ADD_ADDR message. + Set the maximum value of timeout after which an ADD_ADDR control message + will be resent to an MPTCP peer that has not acknowledged a previous + ADD_ADDR message. A dynamically estimated retransmission timeout based + on the estimated connection round-trip-time is used if this value is + lower than the maximum one.
Do not retransmit if set to 0.
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c index 136a380602cae872b76560649c924330e5f42533..204e1f61212e2be77a8476f024b59be67d04b80a 100644 --- a/net/mptcp/pm.c +++ b/net/mptcp/pm.c @@ -268,6 +268,27 @@ int mptcp_pm_mp_prio_send_ack(struct mptcp_sock *msk, return -EINVAL; }
+static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk) +{ + const struct net *net = sock_net((struct sock *)msk); + unsigned int rto = mptcp_get_add_addr_timeout(net); + struct mptcp_subflow_context *subflow; + unsigned int max = 0; + + mptcp_for_each_subflow(msk, subflow) { + struct sock *ssk = mptcp_subflow_tcp_sock(subflow); + struct inet_connection_sock *icsk = inet_csk(ssk); + + if (icsk->icsk_rto > max) + max = icsk->icsk_rto; + } + + if (max && max < rto) + rto = max; + + return rto; +} + static void mptcp_pm_add_timer(struct timer_list *timer) { struct mptcp_pm_add_entry *entry = timer_container_of(entry, timer, @@ -292,7 +313,7 @@ static void mptcp_pm_add_timer(struct timer_list *timer) goto out; }
- timeout = mptcp_get_add_addr_timeout(sock_net(sk)); + timeout = mptcp_adjust_add_addr_timeout(msk); if (!timeout) goto out;
@@ -307,7 +328,7 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
if (entry->retrans_times < ADD_ADDR_RETRANS_MAX) sk_reset_timer(sk, timer, - jiffies + timeout); + jiffies + (timeout << entry->retrans_times));
spin_unlock_bh(&msk->pm.lock);
@@ -348,7 +369,6 @@ bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk, { struct mptcp_pm_add_entry *add_entry = NULL; struct sock *sk = (struct sock *)msk; - struct net *net = sock_net(sk); unsigned int timeout;
lockdep_assert_held(&msk->pm.lock); @@ -374,7 +394,7 @@ bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk,
timer_setup(&add_entry->add_timer, mptcp_pm_add_timer, 0); reset_timer: - timeout = mptcp_get_add_addr_timeout(net); + timeout = mptcp_adjust_add_addr_timeout(msk); if (timeout) sk_reset_timer(sk, &add_entry->add_timer, jiffies + timeout);
From: Geliang Tang tanggeliang@kylinos.cn
Now that add_addr_timeout can be dynamically adjusted, there is no need to set specific timeout values in the mptcp_join.sh tests. This patch removes the explicit sysctl settings for net.mptcp.add_addr_timeout from the test scripts.
The change simplifies the test setup and ensures the tests work with the default or dynamically adjusted timeout values.
Signed-off-by: Geliang Tang tanggeliang@kylinos.cn Reviewed-by: Matthieu Baerts (NGI0) matttbe@kernel.org Signed-off-by: Matthieu Baerts (NGI0) matttbe@kernel.org --- tools/testing/selftests/net/mptcp/mptcp_join.sh | 3 --- 1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index 82cae37d9c2026cc55466636d53a76f929a03452..e85bb62046e020dbacbbd44e1f9e110e1d0104c7 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -347,8 +347,6 @@ reset_with_add_addr_timeout() tables="${ip6tables}" fi
- ip netns exec $ns1 sysctl -q net.mptcp.add_addr_timeout=1 - if ! ip netns exec $ns2 $tables -A OUTPUT -p tcp \ -m tcp --tcp-option 30 \ -m bpf --bytecode \ @@ -2183,7 +2181,6 @@ signal_address_tests() pm_nl_add_endpoint $ns2 10.0.4.2 flags signal
# the peer could possibly miss some addr notification, allow retransmission - ip netns exec $ns1 sysctl -q net.mptcp.add_addr_timeout=1 speed=slow \ run_tests $ns1 $ns2 10.0.1.1
From: Gang Yan yangang@kylinos.cn
Recently, some mib counters about fallback has been added, this patch provides a method to check the expected behavior of these mib counters during the test execution.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/571 Signed-off-by: Gang Yan yangang@kylinos.cn Reviewed-by: Matthieu Baerts (NGI0) matttbe@kernel.org Signed-off-by: Matthieu Baerts (NGI0) matttbe@kernel.org --- tools/testing/selftests/net/mptcp/mptcp_join.sh | 123 ++++++++++++++++++++++++ 1 file changed, 123 insertions(+)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh index e85bb62046e020dbacbbd44e1f9e110e1d0104c7..a97b568104bc284f050b2f0e09fe3fdd3341c5cb 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh @@ -74,6 +74,17 @@ unset join_create_err unset join_bind_err unset join_connect_err
+unset fb_ns1 +unset fb_ns2 +unset fb_infinite_map_tx +unset fb_dss_corruption +unset fb_simult_conn +unset fb_mpc_passive +unset fb_mpc_active +unset fb_mpc_data +unset fb_md5_sig +unset fb_dss + # generated using "nfbpf_compile '(ip && (ip[54] & 0xf0) == 0x30) || # (ip6 && (ip6[74] & 0xf0) == 0x30)'" CBPF_MPTCP_SUBOPTION_ADD_ADDR="14, @@ -1397,6 +1408,115 @@ chk_join_tx_nr() print_results "join Tx" ${rc} }
+chk_fallback_nr() +{ + local infinite_map_tx=${fb_infinite_map_tx:-0} + local dss_corruption=${fb_dss_corruption:-0} + local simult_conn=${fb_simult_conn:-0} + local mpc_passive=${fb_mpc_passive:-0} + local mpc_active=${fb_mpc_active:-0} + local mpc_data=${fb_mpc_data:-0} + local md5_sig=${fb_md5_sig:-0} + local dss=${fb_dss:-0} + local rc=${KSFT_PASS} + local ns=$1 + local count + + count=$(mptcp_lib_get_counter ${!ns} "MPTcpExtInfiniteMapTx") + if [ -z "$count" ]; then + rc=${KSFT_SKIP} + elif [ "$count" != "$infinite_map_tx" ]; then + rc=${KSFT_FAIL} + print_check "$ns infinite map tx fallback" + fail_test "got $count infinite map tx fallback[s] in $ns expected $infinite_map_tx" + fi + + count=$(mptcp_lib_get_counter ${!ns} "MPTcpExtDSSCorruptionFallback") + if [ -z "$count" ]; then + rc=${KSFT_SKIP} + elif [ "$count" != "$dss_corruption" ]; then + rc=${KSFT_FAIL} + print_check "$ns dss corruption fallback" + fail_test "got $count dss corruption fallback[s] in $ns expected $dss_corruption" + fi + + count=$(mptcp_lib_get_counter ${!ns} "MPTcpExtSimultConnectFallback") + if [ -z "$count" ]; then + rc=${KSFT_SKIP} + elif [ "$count" != "$simult_conn" ]; then + rc=${KSFT_FAIL} + print_check "$ns simult conn fallback" + fail_test "got $count simult conn fallback[s] in $ns expected $simult_conn" + fi + + count=$(mptcp_lib_get_counter ${!ns} "MPTcpExtMPCapableFallbackACK") + if [ -z "$count" ]; then + rc=${KSFT_SKIP} + elif [ "$count" != "$mpc_passive" ]; then + rc=${KSFT_FAIL} + print_check "$ns mpc passive fallback" + fail_test "got $count mpc passive fallback[s] in $ns expected $mpc_passive" + fi + + count=$(mptcp_lib_get_counter ${!ns} "MPTcpExtMPCapableFallbackSYNACK") + if [ -z "$count" ]; then + rc=${KSFT_SKIP} + elif [ "$count" != "$mpc_active" ]; then + rc=${KSFT_FAIL} + print_check "$ns mpc active fallback" + fail_test "got $count mpc active fallback[s] in $ns expected $mpc_active" + fi + + count=$(mptcp_lib_get_counter ${!ns} "MPTcpExtMPCapableDataFallback") + if [ -z "$count" ]; then + rc=${KSFT_SKIP} + elif [ "$count" != "$mpc_data" ]; then + rc=${KSFT_FAIL} + print_check "$ns mpc data fallback" + fail_test "got $count mpc data fallback[s] in $ns expected $mpc_data" + fi + + count=$(mptcp_lib_get_counter ${!ns} "MPTcpExtMD5SigFallback") + if [ -z "$count" ]; then + rc=${KSFT_SKIP} + elif [ "$count" != "$md5_sig" ]; then + rc=${KSFT_FAIL} + print_check "$ns MD5 Sig fallback" + fail_test "got $count MD5 Sig fallback[s] in $ns expected $md5_sig" + fi + + count=$(mptcp_lib_get_counter ${!ns} "MPTcpExtDssFallback") + if [ -z "$count" ]; then + rc=${KSFT_SKIP} + elif [ "$count" != "$dss" ]; then + rc=${KSFT_FAIL} + print_check "$ns dss fallback" + fail_test "got $count dss fallback[s] in $ns expected $dss" + fi + + return $rc +} + +chk_fallback_nr_all() +{ + local netns=("ns1" "ns2") + local fb_ns=("fb_ns1" "fb_ns2") + local rc=${KSFT_PASS} + + for i in 0 1; do + if [ -n "${!fb_ns[i]}" ]; then + eval "${!fb_ns[i]}" \ + chk_fallback_nr ${netns[i]} || rc=${?} + else + chk_fallback_nr ${netns[i]} || rc=${?} + fi + done + + if [ "${rc}" != "${KSFT_PASS}" ]; then + print_results "fallback" ${rc} + fi +} + chk_join_nr() { local syn_nr=$1 @@ -1482,6 +1602,8 @@ chk_join_nr() join_syn_tx="${join_syn_tx:-${syn_nr}}" \ chk_join_tx_nr
+ chk_fallback_nr_all + if $validate_checksum; then chk_csum_nr $csum_ns1 $csum_ns2 chk_fail_nr $fail_nr $fail_nr @@ -3334,6 +3456,7 @@ fail_tests() join_csum_ns1=+1 join_csum_ns2=+0 \ join_fail_nr=1 join_rst_nr=0 join_infi_nr=1 \ join_corrupted_pkts="$(pedit_action_pkts)" \ + fb_ns1="fb_dss=1" fb_ns2="fb_infinite_map_tx=1" \ chk_join_nr 0 0 0 chk_fail_nr 1 -1 invert fi
From: Christoph Paasch cpaasch@openai.com
Add a helper to check if RFS is needed or not. Allows to make the code a bit cleaner and the next patch to have MPTCP use this helper to decide whether or not to iterate over the subflows.
tun_flow_update() was calling sock_rps_record_flow_hash() regardless of the state of rfs_needed. This was not really a bug as sock_flow_table simply ends up being NULL and thus everything will be fine. This commit here thus also implicitly makes tun_flow_update() respect the state of rfs_needed.
Suggested-by: Matthieu Baerts matttbe@kernel.org Signed-off-by: Christoph Paasch cpaasch@openai.com Acked-by: Matthieu Baerts (NGI0) matttbe@kernel.org Signed-off-by: Matthieu Baerts (NGI0) matttbe@kernel.org --- include/net/rps.h | 85 ++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 56 insertions(+), 29 deletions(-)
diff --git a/include/net/rps.h b/include/net/rps.h index 9917dce42ca457e9c25d9e84ee450235f771d09b..f1794cd2e7fb32a36bde9959fab651663ab190fd 100644 --- a/include/net/rps.h +++ b/include/net/rps.h @@ -85,11 +85,8 @@ static inline void rps_record_sock_flow(struct rps_sock_flow_table *table, WRITE_ONCE(table->ents[index], val); }
-#endif /* CONFIG_RPS */ - -static inline void sock_rps_record_flow_hash(__u32 hash) +static inline void _sock_rps_record_flow_hash(__u32 hash) { -#ifdef CONFIG_RPS struct rps_sock_flow_table *sock_flow_table;
if (!hash) @@ -99,42 +96,33 @@ static inline void sock_rps_record_flow_hash(__u32 hash) if (sock_flow_table) rps_record_sock_flow(sock_flow_table, hash); rcu_read_unlock(); -#endif }
-static inline void sock_rps_record_flow(const struct sock *sk) +static inline void _sock_rps_record_flow(const struct sock *sk) { -#ifdef CONFIG_RPS - if (static_branch_unlikely(&rfs_needed)) { - /* Reading sk->sk_rxhash might incur an expensive cache line - * miss. - * - * TCP_ESTABLISHED does cover almost all states where RFS - * might be useful, and is cheaper [1] than testing : - * IPv4: inet_sk(sk)->inet_daddr - * IPv6: ipv6_addr_any(&sk->sk_v6_daddr) - * OR an additional socket flag - * [1] : sk_state and sk_prot are in the same cache line. + /* Reading sk->sk_rxhash might incur an expensive cache line + * miss. + * + * TCP_ESTABLISHED does cover almost all states where RFS + * might be useful, and is cheaper [1] than testing : + * IPv4: inet_sk(sk)->inet_daddr + * IPv6: ipv6_addr_any(&sk->sk_v6_daddr) + * OR an additional socket flag + * [1] : sk_state and sk_prot are in the same cache line. + */ + if (sk->sk_state == TCP_ESTABLISHED) { + /* This READ_ONCE() is paired with the WRITE_ONCE() + * from sock_rps_save_rxhash() and sock_rps_reset_rxhash(). */ - if (sk->sk_state == TCP_ESTABLISHED) { - /* This READ_ONCE() is paired with the WRITE_ONCE() - * from sock_rps_save_rxhash() and sock_rps_reset_rxhash(). - */ - sock_rps_record_flow_hash(READ_ONCE(sk->sk_rxhash)); - } + _sock_rps_record_flow_hash(READ_ONCE(sk->sk_rxhash)); } -#endif }
-static inline void sock_rps_delete_flow(const struct sock *sk) +static inline void _sock_rps_delete_flow(const struct sock *sk) { -#ifdef CONFIG_RPS struct rps_sock_flow_table *table; u32 hash, index;
- if (!static_branch_unlikely(&rfs_needed)) - return; - hash = READ_ONCE(sk->sk_rxhash); if (!hash) return; @@ -147,6 +135,45 @@ static inline void sock_rps_delete_flow(const struct sock *sk) WRITE_ONCE(table->ents[index], RPS_NO_CPU); } rcu_read_unlock(); +} +#endif /* CONFIG_RPS */ + +static inline bool rfs_is_needed(void) +{ +#ifdef CONFIG_RPS + return static_branch_unlikely(&rfs_needed); +#else + return false; +#endif +} + +static inline void sock_rps_record_flow_hash(__u32 hash) +{ +#ifdef CONFIG_RPS + if (!rfs_is_needed()) + return; + + _sock_rps_record_flow_hash(hash); +#endif +} + +static inline void sock_rps_record_flow(const struct sock *sk) +{ +#ifdef CONFIG_RPS + if (!rfs_is_needed()) + return; + + _sock_rps_record_flow(sk); +#endif +} + +static inline void sock_rps_delete_flow(const struct sock *sk) +{ +#ifdef CONFIG_RPS + if (!rfs_is_needed()) + return; + + _sock_rps_delete_flow(sk); #endif }
From: Christoph Paasch cpaasch@openai.com
Accelerated Receive Flow Steering (aRFS) relies on sockets recording their RX flow hash into the rps_sock_flow_table so that incoming packets are steered to the CPU where the application runs.
With MPTCP, the application interacts with the parent MPTCP socket while data is carried over per-subflow TCP sockets. Without recording these subflows, aRFS cannot steer interrupts and RX processing for the flows to the desired CPU.
Record all subflows in the RPS table by calling sock_rps_record_flow() for each subflow at the start of mptcp_sendmsg(), mptcp_recvmsg() and mptcp_stream_accept(), by using the new helper mptcp_rps_record_subflows().
It does not by itself improve throughput, but ensures that IRQ and RX processing are directed to the right CPU, which is a prerequisite for effective aRFS.
Signed-off-by: Christoph Paasch cpaasch@openai.com Reviewed-by: Matthieu Baerts (NGI0) matttbe@kernel.org Signed-off-by: Matthieu Baerts (NGI0) matttbe@kernel.org --- net/mptcp/protocol.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index ad41c48126e44fda646f1ec1c81957db1407a6cc..a8d57b88578dfea807d3d55e430849aa8005c637 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -12,6 +12,7 @@ #include <linux/sched/signal.h> #include <linux/atomic.h> #include <net/aligned_data.h> +#include <net/rps.h> #include <net/sock.h> #include <net/inet_common.h> #include <net/inet_hashtables.h> @@ -1740,6 +1741,20 @@ static u32 mptcp_send_limit(const struct sock *sk) return limit - not_sent; }
+static void mptcp_rps_record_subflows(const struct mptcp_sock *msk) +{ + struct mptcp_subflow_context *subflow; + + if (!rfs_is_needed()) + return; + + mptcp_for_each_subflow(msk, subflow) { + struct sock *ssk = mptcp_subflow_tcp_sock(subflow); + + sock_rps_record_flow(ssk); + } +} + static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { struct mptcp_sock *msk = mptcp_sk(sk); @@ -1753,6 +1768,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
lock_sock(sk);
+ mptcp_rps_record_subflows(msk); + if (unlikely(inet_test_bit(DEFER_CONNECT, sk) || msg->msg_flags & MSG_FASTOPEN)) { int copied_syn = 0; @@ -2131,6 +2148,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, goto out_err; }
+ mptcp_rps_record_subflows(msk); + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
len = min_t(size_t, len, INT_MAX); @@ -3922,6 +3941,8 @@ static int mptcp_stream_accept(struct socket *sock, struct socket *newsock, mptcp_sock_graft(ssk, newsock); }
+ mptcp_rps_record_subflows(msk); + /* Do late cleanup for the first subflow as necessary. Also * deal with bad peers not doing a complete shutdown. */
On Mon, 01 Sep 2025 11:39:09 +0200 Matthieu Baerts (NGI0) wrote:
This series contains 4 independent new features:
Patch 1: use HMAC-SHA256 library instead of open-coded HMAC.
Patches 2-3: make ADD_ADDR retransmission timeout adaptive + simplify selftests.
Patch 4: selftests: check for unexpected fallback counter increments.
Patches 5-6: record subflows in RPS table, for aRFS support.
I don't see why, but kmemleak started to hit this with the join test 2 branches ago :\ Have you seen any kmemleak issues on your side? We also see occasional leaked skb in driver tests which makes no sense.
unreferenced object 0xffff8880029d3340 (size 3016): comm "softirq", pid 0, jiffies 4297316940 hex dump (first 32 bytes): 0a 00 01 02 0a 00 01 01 00 00 00 00 9e b8 7d 27 ..............}' 0a 00 07 41 00 00 00 00 00 00 00 00 00 00 00 00 ...A............ backtrace (crc 3653d88c): kmem_cache_alloc_noprof+0x284/0x330 sk_prot_alloc.constprop.0+0x4e/0x1b0 sk_clone_lock+0x4b/0x10d0 mptcp_sk_clone_init+0x2e/0x10d0 subflow_syn_recv_sock+0x9d1/0x1680 tcp_check_req+0x3a4/0x1910 tcp_v4_rcv+0x1004/0x30a0 ip_protocol_deliver_rcu+0x82/0x350 ip_local_deliver_finish+0x35d/0x620 ip_local_deliver+0x19c/0x470 ip_rcv+0xc2/0x370 __netif_receive_skb_one_core+0x108/0x180 process_backlog+0x3c1/0x13e0 __napi_poll.constprop.0+0x9f/0x460 net_rx_action+0x54f/0xda0 handle_softirqs+0x215/0x610
Hi Jakub,
On 02/09/2025 16:26, Jakub Kicinski wrote:
On Mon, 01 Sep 2025 11:39:09 +0200 Matthieu Baerts (NGI0) wrote:
This series contains 4 independent new features:
Patch 1: use HMAC-SHA256 library instead of open-coded HMAC.
Patches 2-3: make ADD_ADDR retransmission timeout adaptive + simplify selftests.
Patch 4: selftests: check for unexpected fallback counter increments.
Patches 5-6: record subflows in RPS table, for aRFS support.
I don't see why, but kmemleak started to hit this with the join test 2 branches ago :\ Have you seen any kmemleak issues on your side? We also see occasional leaked skb in driver tests which makes no sense.
unreferenced object 0xffff8880029d3340 (size 3016): comm "softirq", pid 0, jiffies 4297316940 hex dump (first 32 bytes): 0a 00 01 02 0a 00 01 01 00 00 00 00 9e b8 7d 27 ..............}' 0a 00 07 41 00 00 00 00 00 00 00 00 00 00 00 00 ...A............ backtrace (crc 3653d88c): kmem_cache_alloc_noprof+0x284/0x330 sk_prot_alloc.constprop.0+0x4e/0x1b0 sk_clone_lock+0x4b/0x10d0 mptcp_sk_clone_init+0x2e/0x10d0 subflow_syn_recv_sock+0x9d1/0x1680 tcp_check_req+0x3a4/0x1910 tcp_v4_rcv+0x1004/0x30a0 ip_protocol_deliver_rcu+0x82/0x350 ip_local_deliver_finish+0x35d/0x620 ip_local_deliver+0x19c/0x470 ip_rcv+0xc2/0x370 __netif_receive_skb_one_core+0x108/0x180 process_backlog+0x3c1/0x13e0 __napi_poll.constprop.0+0x9f/0x460 net_rx_action+0x54f/0xda0 handle_softirqs+0x215/0x610
Thank you for this notification!
No, I didn't notice that on our side. For KMemLeak, now I'm waiting 5 seconds, then I force the scan, and check for issues once. On NIPA, I see that there are still 2 scans + cat, and apparently, the issue was always visible during the 2nd scan:
https://netdev-3.bots.linux.dev/vmksft-mptcp-dbg/results/279881/1-mptcp-join...
https://netdev-3.bots.linux.dev/vmksft-mptcp-dbg/results/280062/1-mptcp-join...
It is unclear why a second scan is needed and only the second one caught something. Was it the same with the strange issues you mentioned in driver tests? Do you think I should re-add the second scan + cat?
When looking at the modifications of this series, it is unclear what could cause that.
Cheers, Matt
On Tue, 2 Sep 2025 16:51:47 +0200 Matthieu Baerts wrote:
It is unclear why a second scan is needed and only the second one caught something. Was it the same with the strange issues you mentioned in driver tests? Do you think I should re-add the second scan + cat?
Not sure, cc: Catalin, from experience it seems like second scan often surfaces issues the first scan missed.
When looking at the modifications of this series, it is unclear what could cause that.
Yes, I don't think it's related to the series. For one thing the series a couple of before the first report.
On Tue, Sep 02, 2025 at 08:27:59AM -0700, Jakub Kicinski wrote:
On Tue, 2 Sep 2025 16:51:47 +0200 Matthieu Baerts wrote:
It is unclear why a second scan is needed and only the second one caught something. Was it the same with the strange issues you mentioned in driver tests? Do you think I should re-add the second scan + cat?
Not sure, cc: Catalin, from experience it seems like second scan often surfaces issues the first scan missed.
It's some of the kmemleak heuristics to reduce false positives. It does a checksum of the object during scanning and only reports a leak if the checksum is the same in two consecutive scans.
Hi Catalin,
2 Sept 2025 20:25:19 Catalin Marinas catalin.marinas@arm.com:
On Tue, Sep 02, 2025 at 08:27:59AM -0700, Jakub Kicinski wrote:
On Tue, 2 Sep 2025 16:51:47 +0200 Matthieu Baerts wrote:
It is unclear why a second scan is needed and only the second one caught something. Was it the same with the strange issues you mentioned in driver tests? Do you think I should re-add the second scan + cat?
Not sure, cc: Catalin, from experience it seems like second scan often surfaces issues the first scan missed.
It's some of the kmemleak heuristics to reduce false positives. It does a checksum of the object during scanning and only reports a leak if the checksum is the same in two consecutive scans.
Thank you for the explanation!
Does that mean a scan should be triggered at the end of the tests, then wait 5 second for the grace period, then trigger another scan and check the results?
Or wait 5 seconds, then trigger two consecutive scans?
Cheers, Matt
On Tue, Sep 02, 2025 at 08:50:19PM +0200, Matthieu Baerts wrote:
Hi Catalin,
2 Sept 2025 20:25:19 Catalin Marinas catalin.marinas@arm.com:
On Tue, Sep 02, 2025 at 08:27:59AM -0700, Jakub Kicinski wrote:
On Tue, 2 Sep 2025 16:51:47 +0200 Matthieu Baerts wrote:
It is unclear why a second scan is needed and only the second one caught something. Was it the same with the strange issues you mentioned in driver tests? Do you think I should re-add the second scan + cat?
Not sure, cc: Catalin, from experience it seems like second scan often surfaces issues the first scan missed.
It's some of the kmemleak heuristics to reduce false positives. It does a checksum of the object during scanning and only reports a leak if the checksum is the same in two consecutive scans.
Thank you for the explanation!
Does that mean a scan should be triggered at the end of the tests, then wait 5 second for the grace period, then trigger another scan and check the results?
Or wait 5 seconds, then trigger two consecutive scans?
The 5 seconds is the minimum age of an object before it gets reported as a leak. It's not related to the scanning process. So you could do two scans in succession and wait 5 seconds before checking for leaks.
However, I'd go with the first option - do a scan, wait 5 seconds and do another. That's mostly because at the end of the scan kmemleak prints if it found new unreferenced objects. It might not print the message if a leaked object is younger than 5 seconds. In practice, though, the scan may take longer, depending on how loaded your system is.
The second option works as well but waiting between them has a better chance of removing false positives if, say, some objects are moved between lists and two consecutive scans do not detect the list_head change (and update the object's checksum).
2 Sept 2025 23:18:56 Catalin Marinas catalin.marinas@arm.com:
On Tue, Sep 02, 2025 at 08:50:19PM +0200, Matthieu Baerts wrote:
Hi Catalin,
2 Sept 2025 20:25:19 Catalin Marinas catalin.marinas@arm.com:
On Tue, Sep 02, 2025 at 08:27:59AM -0700, Jakub Kicinski wrote:
On Tue, 2 Sep 2025 16:51:47 +0200 Matthieu Baerts wrote:
It is unclear why a second scan is needed and only the second one caught something. Was it the same with the strange issues you mentioned in driver tests? Do you think I should re-add the second scan + cat?
Not sure, cc: Catalin, from experience it seems like second scan often surfaces issues the first scan missed.
It's some of the kmemleak heuristics to reduce false positives. It does a checksum of the object during scanning and only reports a leak if the checksum is the same in two consecutive scans.
Thank you for the explanation!
Does that mean a scan should be triggered at the end of the tests, then wait 5 second for the grace period, then trigger another scan and check the results?
Or wait 5 seconds, then trigger two consecutive scans?
The 5 seconds is the minimum age of an object before it gets reported as a leak. It's not related to the scanning process. So you could do two scans in succession and wait 5 seconds before checking for leaks.
However, I'd go with the first option - do a scan, wait 5 seconds and do another. That's mostly because at the end of the scan kmemleak prints if it found new unreferenced objects. It might not print the message if a leaked object is younger than 5 seconds. In practice, though, the scan may take longer, depending on how loaded your system is.
The second option works as well but waiting between them has a better chance of removing false positives if, say, some objects are moved between lists and two consecutive scans do not detect the list_head change (and update the object's checksum).
Thank you for this very nice reply, that's very clear!
I will then adapt our CI having CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF to do a manual scan at the very end, wait 5 seconds and do another.
Cheers, Matt
On Tue, Sep 2, 2025 at 2:38 PM Matthieu Baerts matttbe@kernel.org wrote:
2 Sept 2025 23:18:56 Catalin Marinas catalin.marinas@arm.com:
On Tue, Sep 02, 2025 at 08:50:19PM +0200, Matthieu Baerts wrote:
Hi Catalin,
2 Sept 2025 20:25:19 Catalin Marinas catalin.marinas@arm.com:
On Tue, Sep 02, 2025 at 08:27:59AM -0700, Jakub Kicinski wrote:
On Tue, 2 Sep 2025 16:51:47 +0200 Matthieu Baerts wrote:
It is unclear why a second scan is needed and only the second one caught something. Was it the same with the strange issues you mentioned in driver tests? Do you think I should re-add the second scan + cat?
Not sure, cc: Catalin, from experience it seems like second scan often surfaces issues the first scan missed.
It's some of the kmemleak heuristics to reduce false positives. It does a checksum of the object during scanning and only reports a leak if the checksum is the same in two consecutive scans.
Thank you for the explanation!
Does that mean a scan should be triggered at the end of the tests, then wait 5 second for the grace period, then trigger another scan and check the results?
Or wait 5 seconds, then trigger two consecutive scans?
The 5 seconds is the minimum age of an object before it gets reported as a leak. It's not related to the scanning process. So you could do two scans in succession and wait 5 seconds before checking for leaks.
However, I'd go with the first option - do a scan, wait 5 seconds and do another. That's mostly because at the end of the scan kmemleak prints if it found new unreferenced objects. It might not print the message if a leaked object is younger than 5 seconds. In practice, though, the scan may take longer, depending on how loaded your system is.
The second option works as well but waiting between them has a better chance of removing false positives if, say, some objects are moved between lists and two consecutive scans do not detect the list_head change (and update the object's checksum).
Thank you for this very nice reply, that's very clear!
I will then adapt our CI having CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF to do a manual scan at the very end, wait 5 seconds and do another.
FWIW - I am able to pretty reliably reproduce the kmemleak. However, I also tried adding an inline kmemleak scan to the test harness (did it once with, once without a sleep). When I do that the kmemleak disappears :-)
(not saying that adding the scan isn't useful, just pointing out that this particular leak seems to be related to how quickly we iterate over the testcases)
Christoph
Hello,
On 01/09/2025 11:39, Matthieu Baerts (NGI0) wrote:
This series contains 4 independent new features:
Patch 1: use HMAC-SHA256 library instead of open-coded HMAC.
Patches 2-3: make ADD_ADDR retransmission timeout adaptive + simplify selftests.
I just noticed that NIPA reported some issues due to these 2 patches. In short, some packets (MPTCP ADD_ADDR notifications) can now be retransmitted quicker, but some tests check MIB counters and don't expect retransmissions. If the environment is a bit slow, it is possible to have more retransmissions. We should adapt the tests to avoid false positives.
Is it possible to drop just these two patches? Or do you prefer to mark the whole series as "Changes requested"?
Patch 4: selftests: check for unexpected fallback counter increments.
Patches 5-6: record subflows in RPS table, for aRFS support.
Cheers, Matt
On Tue, 2 Sep 2025 16:29:33 +0200 Matthieu Baerts wrote:
On 01/09/2025 11:39, Matthieu Baerts (NGI0) wrote:
This series contains 4 independent new features:
Patch 1: use HMAC-SHA256 library instead of open-coded HMAC.
Patches 2-3: make ADD_ADDR retransmission timeout adaptive + simplify selftests.
I just noticed that NIPA reported some issues due to these 2 patches. In short, some packets (MPTCP ADD_ADDR notifications) can now be retransmitted quicker, but some tests check MIB counters and don't expect retransmissions. If the environment is a bit slow, it is possible to have more retransmissions. We should adapt the tests to avoid false positives.
Is it possible to drop just these two patches? Or do you prefer to mark the whole series as "Changes requested"?
Your call, we can also apply as is. mptcp-join is ignored, anyway.
2 Sept 2025 21:09:36 Jakub Kicinski kuba@kernel.org:
On Tue, 2 Sep 2025 16:29:33 +0200 Matthieu Baerts wrote:
On 01/09/2025 11:39, Matthieu Baerts (NGI0) wrote:
This series contains 4 independent new features:
Patch 1: use HMAC-SHA256 library instead of open-coded HMAC.
Patches 2-3: make ADD_ADDR retransmission timeout adaptive + simplify
selftests.
I just noticed that NIPA reported some issues due to these 2 patches. In short, some packets (MPTCP ADD_ADDR notifications) can now be retransmitted quicker, but some tests check MIB counters and don't expect retransmissions. If the environment is a bit slow, it is possible to have more retransmissions. We should adapt the tests to avoid false positives.
Is it possible to drop just these two patches? Or do you prefer to mark the whole series as "Changes requested"?
Your call, we can also apply as is. mptcp-join is ignored, anyway.
I realised patch 3/6 is going to cause issues when running on older kernels, so we would need to revert it if we want to apply all patches.
But if you prefer a v2 for the whole series instead of applying 1,4-6, I can also do that :)
Cheers, Matt
On Tue, 2 Sep 2025 21:25:33 +0200 (GMT+02:00) Matthieu Baerts wrote:
I just noticed that NIPA reported some issues due to these 2 patches. In short, some packets (MPTCP ADD_ADDR notifications) can now be retransmitted quicker, but some tests check MIB counters and don't expect retransmissions. If the environment is a bit slow, it is possible to have more retransmissions. We should adapt the tests to avoid false positives.
Is it possible to drop just these two patches? Or do you prefer to mark the whole series as "Changes requested"?
Your call, we can also apply as is. mptcp-join is ignored, anyway.
I realised patch 3/6 is going to cause issues when running on older kernels, so we would need to revert it if we want to apply all patches.
But if you prefer a v2 for the whole series instead of applying 1,4-6, I can also do that :)
Alright, please send a v2, then. Sorry for the flip-flop.
linux-kselftest-mirror@lists.linaro.org