On 11/6/24 6:30 AM, Vladimir Vdovin wrote:
Check number of paths by fib_info_num_path(), and update_or_create_fnhe() for every path. Problem is that pmtu is cached only for the oif that has received icmp message "need to frag", other oifs will still try to use "default" iface mtu.
An example topology showing the problem:
| host1 +---------+ | dummy0 | 10.179.20.18/32 mtu9000 +---------+ +-----------+----------------+ +---------+ +---------+ | ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31 +---------+ +---------+ | (all here have mtu 9000) | +------+ +------+ | ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31 +------+ +------+ | |
---------+------------+-------------------+------ | +-----+ | ro3 | 10.10.10.10 mtu1500 +-----+ | ======================================== some networks ======================================== | +-----+ | eth0| 10.10.30.30 mtu9000 +-----+ | host2
host1 have enabled multipath and sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18 nexthop via 10.179.2.12 dev ens17f1 weight 1 nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2, host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500. And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp, and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0 cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0 cache
So when host1 tries again to reach host2 with mtu>1500, if packet flow is lucky enough to be hashed with oif=ens17f1 its ok, if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1, until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin deliran@verdict.gg
V8: selftests in pmtu.sh: - Change var names from "dummy" to "host" - Fix errors caused by incorrect iface arguments pass - Add src addr to setup_multipath_new - Change multipath* func order - Change route_get_dst_exception() && route_get_dst_pmtu_from_exception() and arguments pass where they are used as Ido suggested in https://lore.kernel.org/all/ZykH_fdcMBdFgXix@shredder/
V7: selftest in pmtu.sh: - add setup_multipath() with old and new nh tests - add global "dummy_v4" addr variables - add documentation - remove dummy netdev usage in mp nh test - remove useless sysctl opts in mp nh test
V6:
- make commit message cleaner
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- add selftest
- fix compile error
V2:
- fix fib_info_num_path parameter pass
net/ipv4/route.c | 13 +++ tools/testing/selftests/net/pmtu.sh | 119 ++++++++++++++++++++++++---- 2 files changed, 115 insertions(+), 17 deletions(-)
Reviewed-by: David Ahern dsahern@kernel.org