All packets in the same flow (L3/L4 depending on multipath hash policy) should be directed to the same target, but after [0]/[1] we see stray packets directed towards other targets. This, for instance, causes RST to be sent on TCP connections.
The first two patches solve the problem by ignoring route hints for destinations that are part of multipath group, by using new SKB flags for IPv4 and IPv6. The third patch is a selftest that tests the scenario.
Thanks to Ido, for reviewing and suggesting a way forward in [2] and also suggesting how to write a selftest for this.
v2->v3: - Add NULL check for skb in fib6_select_path (Ido Schimmel) - Use fib_tests.sh for selftest instead of the forwarding suite (Ido Schimmel) v1->v2: - Update to commit messages describing the solution (Ido Schimmel) - Use perf stat to count fib table lookups in selftest (Ido Schimmel)
Sriram Yagnaraman (3): ipv4: ignore dst hint for multipath routes ipv6: ignore dst hint for multipath routes selftests: fib_tests: Add multipath list receive tests
include/linux/ipv6.h | 1 + include/net/ip.h | 1 + net/ipv4/ip_input.c | 3 +- net/ipv4/route.c | 1 + net/ipv6/ip6_input.c | 3 +- net/ipv6/route.c | 3 + tools/testing/selftests/net/fib_tests.sh | 150 ++++++++++++++++++++++- 7 files changed, 159 insertions(+), 3 deletions(-)
Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group.
A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in ip_mkroute_input() and use it in ip_extract_route_hint() to check for the existence of the flag.
Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech --- include/net/ip.h | 1 + net/ipv4/ip_input.c | 3 ++- net/ipv4/route.c | 1 + 3 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/net/ip.h b/include/net/ip.h index 19adacd5ece0..464176a88f86 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -57,6 +57,7 @@ struct inet_skb_parm { #define IPSKB_FRAG_PMTU BIT(6) #define IPSKB_L3SLAVE BIT(7) #define IPSKB_NOPOLICY BIT(8) +#define IPSKB_MULTIPATH BIT(9)
u16 frag_max_size; }; diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index fe9ead9ee863..5e9c8156656a 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -584,7 +584,8 @@ static void ip_sublist_rcv_finish(struct list_head *head) static struct sk_buff *ip_extract_route_hint(const struct net *net, struct sk_buff *skb, int rt_type) { - if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST) + if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST || + IPCB(skb)->flags & IPSKB_MULTIPATH) return NULL;
return skb; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 92fede388d52..33626619aee7 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2144,6 +2144,7 @@ static int ip_mkroute_input(struct sk_buff *skb, int h = fib_multipath_hash(res->fi->fib_net, NULL, skb, hkeys);
fib_select_multipath(res, h); + IPCB(skb)->flags |= IPSKB_MULTIPATH; } #endif
On Mon, Aug 28, 2023 at 01:32:19PM +0200, Sriram Yagnaraman wrote:
Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group.
A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in ip_mkroute_input() and use it in ip_extract_route_hint() to check for the existence of the flag.
Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech
Reviewed-by: Ido Schimmel idosch@nvidia.com
On 8/28/23 5:32 AM, Sriram Yagnaraman wrote:
Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group.
A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in ip_mkroute_input() and use it in ip_extract_route_hint() to check for the existence of the flag.
Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech
include/net/ip.h | 1 + net/ipv4/ip_input.c | 3 ++- net/ipv4/route.c | 1 + 3 files changed, 4 insertions(+), 1 deletion(-)
Reviewed-by: David Ahern dsahern@kernel.org
Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group.
A new SKB flag IP6SKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in fib6_select_path() and use it in ip6_can_use_hint() to check for the existence of the flag.
Fixes: 197dbf24e360 ("ipv6: introduce and uses route look hints for list input.") Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech --- include/linux/ipv6.h | 1 + net/ipv6/ip6_input.c | 3 ++- net/ipv6/route.c | 3 +++ 3 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 839247a4f48e..fe3492a67b35 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -146,6 +146,7 @@ struct inet6_skb_parm { #define IP6SKB_JUMBOGRAM 128 #define IP6SKB_SEG6 256 #define IP6SKB_FAKEJUMBO 512 +#define IP6SKB_MULTIPATH 1024 };
#if defined(CONFIG_NET_L3_MASTER_DEV) diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index d94041bb4287..b8378814532c 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -99,7 +99,8 @@ static bool ip6_can_use_hint(const struct sk_buff *skb, static struct sk_buff *ip6_extract_route_hint(const struct net *net, struct sk_buff *skb) { - if (fib6_routes_require_src(net) || fib6_has_custom_rules(net)) + if (fib6_routes_require_src(net) || fib6_has_custom_rules(net) || + IP6CB(skb)->flags & IP6SKB_MULTIPATH) return NULL;
return skb; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 56a55585eb79..a02328c93a53 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -424,6 +424,9 @@ void fib6_select_path(const struct net *net, struct fib6_result *res, if (match->nh && have_oif_match && res->nh) return;
+ if (skb) + IP6CB(skb)->flags |= IP6SKB_MULTIPATH; + /* We might have already computed the hash for ICMPv6 errors. In such * case it will always be non-zero. Otherwise now is the time to do it. */
On Mon, Aug 28, 2023 at 01:32:20PM +0200, Sriram Yagnaraman wrote:
Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group.
A new SKB flag IP6SKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in fib6_select_path() and use it in ip6_can_use_hint() to check for the existence of the flag.
Fixes: 197dbf24e360 ("ipv6: introduce and uses route look hints for list input.") Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech
Reviewed-by: Ido Schimmel idosch@nvidia.com
On 8/28/23 5:32 AM, Sriram Yagnaraman wrote:
Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group.
A new SKB flag IP6SKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in fib6_select_path() and use it in ip6_can_use_hint() to check for the existence of the flag.
Fixes: 197dbf24e360 ("ipv6: introduce and uses route look hints for list input.") Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech
include/linux/ipv6.h | 1 + net/ipv6/ip6_input.c | 3 ++- net/ipv6/route.c | 3 +++ 3 files changed, 6 insertions(+), 1 deletion(-)
Reviewed-by: David Ahern dsahern@kernel.org
The test uses perf stat to count the number of fib:fib_table_lookup tracepoint hits for IPv4 and the number of fib6:fib6_table_lookup for IPv6. The measured count is checked to be within 5% of the total number of packets sent via veth1.
Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech --- tools/testing/selftests/net/fib_tests.sh | 150 ++++++++++++++++++++++- 1 file changed, 149 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 35d89dfa6f11..1cf78cf4d346 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -9,7 +9,7 @@ ret=0 ksft_skip=4
# all tests in this script. Can be overridden with -t option -TESTS="unregister down carrier nexthop suppress ipv6_notify ipv4_notify ipv6_rt ipv4_rt ipv6_addr_metric ipv4_addr_metric ipv6_route_metrics ipv4_route_metrics ipv4_route_v6_gw rp_filter ipv4_del_addr ipv4_mangle ipv6_mangle ipv4_bcast_neigh" +TESTS="unregister down carrier nexthop suppress ipv6_notify ipv4_notify ipv6_rt ipv4_rt ipv6_addr_metric ipv4_addr_metric ipv6_route_metrics ipv4_route_metrics ipv4_route_v6_gw rp_filter ipv4_del_addr ipv4_mangle ipv6_mangle ipv4_bcast_neigh ipv4_mpath_list ipv6_mpath_list"
VERBOSE=0 PAUSE_ON_FAIL=no @@ -2138,6 +2138,152 @@ ipv4_bcast_neigh_test() cleanup }
+mpath_dep_check() +{ + if [ ! -x "$(command -v mausezahn)" ]; then + echo "mausezahn command not found. Skipping test" + return 1 + fi + + if [ ! -x "$(command -v jq)" ]; then + echo "jq command not found. Skipping test" + return 1 + fi + + if [ ! -x "$(command -v bc)" ]; then + echo "bc command not found. Skipping test" + return 1 + fi + + if [ ! -x "$(command -v perf)" ]; then + echo "perf command not found. Skipping test" + return 1 + fi + + perf list fib:* | grep -q fib_table_lookup + if [ $? -ne 0 ]; then + echo "IPv4 FIB tracepoint not found. Skipping test" + return 1 + fi + + perf list fib6:* | grep -q fib6_table_lookup + if [ $? -ne 0 ]; then + echo "IPv6 FIB tracepoint not found. Skipping test" + return 1 + fi + + return 0 +} + +list_rcv_eval() +{ + local name=$1; shift + local file=$1; shift + local expected=$1; shift + local exp=$1; shift + + + local count=$(tail -n 1 $file | jq '.["counter-value"] | tonumber | floor') + local ratio=$(echo "scale=2; $count / $expected" | bc -l) + local res=$(echo "$ratio $exp" | bc) + [[ $res -eq 1 ]] + log_test $? 0 "$name route hit ratio ($ratio)" +} + +ipv4_mpath_list_test() +{ + echo + echo "IPv4 multipath list receive tests" + + mpath_dep_check || return 1 + + route_setup + + set -e + run_cmd "ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off" + + run_cmd "ip netns exec ns2 bash -c "echo 20000 > /sys/class/net/veth2/gro_flush_timeout"" + run_cmd "ip netns exec ns2 bash -c "echo 1 > /sys/class/net/veth2/napi_defer_hard_irqs"" + run_cmd "ip netns exec ns2 ethtool -K veth2 generic-receive-offload on" + run_cmd "ip -n ns2 link add name nh1 up type dummy" + run_cmd "ip -n ns2 link add name nh2 up type dummy" + run_cmd "ip -n ns2 address add 172.16.201.1/24 dev nh1" + run_cmd "ip -n ns2 address add 172.16.202.1/24 dev nh2" + run_cmd "ip -n ns2 neigh add 172.16.201.2 lladdr 00:11:22:33:44:55 nud perm dev nh1" + run_cmd "ip -n ns2 neigh add 172.16.202.2 lladdr 00:aa:bb:cc:dd:ee nud perm dev nh2" + run_cmd "ip -n ns2 route add 203.0.113.0/24 + nexthop via 172.16.201.2 nexthop via 172.16.202.2" + run_cmd "ip netns exec ns2 sysctl -qw net.ipv4.fib_multipath_hash_policy=1" + set +e + + local dmac=$(ip -n ns2 -j link show dev veth2 | jq -r '.[]["address"]') + local tmp_file=$(mktemp) + local cmd="ip netns exec ns1 mausezahn veth1 -a own -b $dmac + -A 172.16.101.1 -B 203.0.113.1 -t udp 'sp=12345,dp=0-65535' -q" + + # Packets forwarded in a list using a multipath route must not reuse a + # cached result so that a flow always hits the same nexthop. In other + # words, the FIB lookup tracepoint needs to be triggered for every + # packet. + run_cmd "perf stat -e fib:fib_table_lookup --filter 'err == 0' -j -o $tmp_file -- $cmd" + list_rcv_eval "Multipath" $tmp_file 65536 ">= 0.95" + + # The same is not true for a single path route. + run_cmd "ip -n ns2 route replace 203.0.113.0/24 nexthop via 172.16.201.2" + run_cmd "perf stat -e fib:fib_table_lookup --filter 'err == 0' -j -o $tmp_file -- $cmd" + list_rcv_eval "Single path" $tmp_file 65536 "< 0.95" + + rm $tmp_file + route_cleanup +} + +ipv6_mpath_list_test() +{ + echo + echo "IPv6 multipath list receive tests" + + mpath_dep_check || return 1 + + route_setup + + set -e + run_cmd "ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off" + + run_cmd "ip netns exec ns2 bash -c "echo 20000 > /sys/class/net/veth2/gro_flush_timeout"" + run_cmd "ip netns exec ns2 bash -c "echo 1 > /sys/class/net/veth2/napi_defer_hard_irqs"" + run_cmd "ip netns exec ns2 ethtool -K veth2 generic-receive-offload on" + run_cmd "ip -n ns2 link add name nh1 up type dummy" + run_cmd "ip -n ns2 link add name nh2 up type dummy" + run_cmd "ip -n ns2 -6 address add 2001:db8:201::1/64 dev nh1" + run_cmd "ip -n ns2 -6 address add 2001:db8:202::1/64 dev nh2" + run_cmd "ip -n ns2 -6 neigh add 2001:db8:201::2 lladdr 00:11:22:33:44:55 nud perm dev nh1" + run_cmd "ip -n ns2 -6 neigh add 2001:db8:202::2 lladdr 00:aa:bb:cc:dd:ee nud perm dev nh2" + run_cmd "ip -n ns2 -6 route add 2001:db8:301::/64 + nexthop via 2001:db8:201::2 nexthop via 2001:db8:202::2" + run_cmd "ip netns exec ns2 sysctl -qw net.ipv6.fib_multipath_hash_policy=1" + set +e + + local dmac=$(ip -n ns2 -j link show dev veth2 | jq -r '.[]["address"]') + local tmp_file=$(mktemp) + local cmd="ip netns exec ns1 mausezahn -6 veth1 -a own -b $dmac + -A 2001:db8:101::1 -B 2001:db8:301::1 -t udp 'sp=12345,dp=0-65535' -q" + + # Packets forwarded in a list using a multipath route must not reuse a + # cached result so that a flow always hits the same nexthop. In other + # words, the FIB lookup tracepoint needs to be triggered for every + # packet. + run_cmd "perf stat -e fib6:fib6_table_lookup --filter 'err == 0' -j -o $tmp_file -- $cmd" + list_rcv_eval "Multipath" $tmp_file 65536 ">= 0.95" + + # The same is not true for a single path route. + run_cmd "ip -n ns2 route replace 2001:db8:301::/64 nexthop via 2001:db8:201::2" + run_cmd "perf stat -e fib6:fib6_table_lookup --filter 'err == 0' -j -o $tmp_file -- $cmd" + list_rcv_eval "Single path" $tmp_file 65536 "< 0.95" + + rm $tmp_file + route_cleanup +} + ################################################################################ # usage
@@ -2217,6 +2363,8 @@ do ipv4_mangle) ipv4_mangle_test;; ipv6_mangle) ipv6_mangle_test;; ipv4_bcast_neigh) ipv4_bcast_neigh_test;; + ipv4_mpath_list) ipv4_mpath_list_test;; + ipv6_mpath_list) ipv6_mpath_list_test;;
help) echo "Test names: $TESTS"; exit 0;; esac
On Mon, Aug 28, 2023 at 01:32:21PM +0200, Sriram Yagnaraman wrote:
The test uses perf stat to count the number of fib:fib_table_lookup tracepoint hits for IPv4 and the number of fib6:fib6_table_lookup for IPv6. The measured count is checked to be within 5% of the total number of packets sent via veth1.
Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech
I just tested this with a debug config and noticed that the single path test is not very stable. It's not really related to the bug fix, so I think you can simply remove it.
Jakub / Paolo, this change conflicts with changes in net-next and I assume that the next PR that you are going to send is from net-next. What is your preference in this case? Wait for the PR to be accepted and for master to be merged into net?
Thanks
On 8/28/23 9:24 AM, Ido Schimmel wrote:
On Mon, Aug 28, 2023 at 01:32:21PM +0200, Sriram Yagnaraman wrote:
The test uses perf stat to count the number of fib:fib_table_lookup tracepoint hits for IPv4 and the number of fib6:fib6_table_lookup for IPv6. The measured count is checked to be within 5% of the total number of packets sent via veth1.
Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech
I just tested this with a debug config and noticed that the single path test is not very stable. It's not really related to the bug fix, so I think you can simply remove it.
Jakub / Paolo, this change conflicts with changes in net-next and I assume that the next PR that you are going to send is from net-next. What is your preference in this case? Wait for the PR to be accepted and for master to be merged into net?
Thanks
Ido, thanks for staying on top of this change and the details with the test cases.
On Mon, 28 Aug 2023 18:24:20 +0300 Ido Schimmel wrote:
Jakub / Paolo, this change conflicts with changes in net-next and I assume that the next PR that you are going to send is from net-next. What is your preference in this case? Wait for the PR to be accepted and for master to be merged into net?
The trees will be merged before the PR, in the next 24h. As soon as they are you can resend for net-next.
-----Original Message----- From: Ido Schimmel idosch@idosch.org Sent: Monday, 28 August 2023 17:24 To: Sriram Yagnaraman sriram.yagnaraman@est.tech; kuba@kernel.org; pabeni@redhat.com Cc: netdev@vger.kernel.org; linux-kselftest@vger.kernel.org; David S . Miller davem@davemloft.net; Eric Dumazet edumazet@google.com; Jakub Kicinski kuba@kernel.org; Paolo Abeni pabeni@redhat.com; David Ahern dsahern@kernel.org; Ido Schimmel idosch@nvidia.com; Shuah Khan shuah@kernel.org; Petr Machata petrm@nvidia.com Subject: Re: [PATCH net v3 3/3] selftests: fib_tests: Add multipath list receive tests
On Mon, Aug 28, 2023 at 01:32:21PM +0200, Sriram Yagnaraman wrote:
The test uses perf stat to count the number of fib:fib_table_lookup tracepoint hits for IPv4 and the number of fib6:fib6_table_lookup for IPv6. The measured count is checked to be within 5% of the total number of packets sent via veth1.
Signed-off-by: Sriram Yagnaraman sriram.yagnaraman@est.tech
I just tested this with a debug config and noticed that the single path test is not very stable. It's not really related to the bug fix, so I think you can simply remove it.
Sent v4 with just the multipath test and rebased to latest after the merge with net-next.
If it is OK with all of you here, should I try to improve this test to verify TCP resets don't happen when the nexthop is in a multipath group, perhaps using iperf3? I can send another patch if/when I get something working.
On Wed, Aug 30, 2023 at 09:17:22AM +0000, Sriram Yagnaraman wrote:
If it is OK with all of you here, should I try to improve this test to verify TCP resets don't happen when the nexthop is in a multipath group, perhaps using iperf3? I can send another patch if/when I get something working.
Yes, just make sure it's stable. That is, the test reliably fails without the fixes and reliably passes with the fixes.
linux-kselftest-mirror@lists.linaro.org