Hi,
The very recent 6.4-rc3 kernel build with AlmaLinux 8.7 on LENOVO 10TX000VCR desktop box fails one test:
[root@host net]# ./fcnal-test.sh [...] TEST: ping out, vrf device+address bind - ns-B loopback IPv6 [ OK ] TEST: ping out, vrf device+address bind - ns-B IPv6 LLA [FAIL] TEST: ping in - ns-A IPv6 [ OK ] [...] Tests passed: 887 Tests failed: 1 [root@host net]#
Please find the config, + dmesg and lshw output here:
https://domac.alu.unizg.hr/~mtodorov/linux/selftests/net-fcnal-test/config-6... https://domac.alu.unizg.hr/~mtodorov/linux/selftests/net-fcnal-test/dmesg.lo... https://domac.alu.unizg.hr/~mtodorov/linux/selftests/net-fcnal-test/lshw.txt
I believe that I have all required configs merged for the selftest/net tests.
Maybe we have a regression?
My knowledge of fcnal-test.sh isn't sufficient to build a smaller reproducer.
Guillaume said in January he could help with the net/fcnal-test.sh, but I was doing the other things in the meantime. Tempus fugit :-/
Best regards, Mirsad
On Wed, May 24, 2023 at 02:17:09PM +0200, Mirsad Todorovac wrote:
Hi,
Hi Mirsad,
The very recent 6.4-rc3 kernel build with AlmaLinux 8.7 on LENOVO 10TX000VCR desktop box fails one test:
[root@host net]# ./fcnal-test.sh [...] TEST: ping out, vrf device+address bind - ns-B loopback IPv6 [ OK ] TEST: ping out, vrf device+address bind - ns-B IPv6 LLA [FAIL] TEST: ping in - ns-A IPv6 [ OK ] [...] Tests passed: 887 Tests failed: 1 [root@host net]#
This test also fails on -net. The problem is specific to ping sockets (same test passes with raw sockets). I believe this test has always failed since fcnal-test.sh started using net.ipv4.ping_group_range (commit e71b7f1f44d3 ("selftests: add ping test with ping_group_range tuned")).
The executed command is:
ip netns exec ns-A ip vrf exec red /usr/bin/ping6 -c1 -w1 -I 2001:db8:3::1 fe80::a846:b5ff:fe4c:da4e%eth1
So ping6 is executed inside VRF 'red' and sets .sin6_scope_id to 'eth1' (which is a slave device of VRF 'red'). Therefore, we have sk->sk_bound_dev_if == 'red' and .sin6_scope_id == 'eth1'. This fails because ping_v6_sendmsg() expects them to be equal:
static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { ... if (__ipv6_addr_needs_scope_id(ipv6_addr_type(daddr))) oif = u->sin6_scope_id; ... if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if)) <-- oif='eth1', but ->sk_bound_dev_if='red' return -EINVAL; ... }
I believe this condition should be relaxed to allow the case where ->sk_bound_dev_if is oif's master device (and maybe there are other VRF cases to also consider).
On 5/31/23 20:11, Guillaume Nault wrote:
On Wed, May 24, 2023 at 02:17:09PM +0200, Mirsad Todorovac wrote:
Hi,
Hi Mirsad,
Hi Guillaume,
The very recent 6.4-rc3 kernel build with AlmaLinux 8.7 on LENOVO 10TX000VCR desktop box fails one test:
[root@host net]# ./fcnal-test.sh [...] TEST: ping out, vrf device+address bind - ns-B loopback IPv6 [ OK ] TEST: ping out, vrf device+address bind - ns-B IPv6 LLA [FAIL] TEST: ping in - ns-A IPv6 [ OK ] [...] Tests passed: 887 Tests failed: 1 [root@host net]#
This test also fails on -net. The problem is specific to ping sockets (same test passes with raw sockets). I believe this test has always failed since fcnal-test.sh started using net.ipv4.ping_group_range (commit e71b7f1f44d3 ("selftests: add ping test with ping_group_range tuned")).
The executed command is:
ip netns exec ns-A ip vrf exec red /usr/bin/ping6 -c1 -w1 -I 2001:db8:3::1 fe80::a846:b5ff:fe4c:da4e%eth1
So ping6 is executed inside VRF 'red' and sets .sin6_scope_id to 'eth1' (which is a slave device of VRF 'red'). Therefore, we have sk->sk_bound_dev_if == 'red' and .sin6_scope_id == 'eth1'. This fails because ping_v6_sendmsg() expects them to be equal:
static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { ... if (__ipv6_addr_needs_scope_id(ipv6_addr_type(daddr))) oif = u->sin6_scope_id; ... if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if)) <-- oif='eth1', but ->sk_bound_dev_if='red' return -EINVAL; ... }
Thank you for your thorough investigation. It helps a great deal to understand the issue.
I am really not that into the network stack, though I can always smuggle the work on the network stack as a work on high-bandwidth multimedia and do it in day hours.
Probably I need to catch up with the network stack homework.
I believe this condition should be relaxed to allow the case where ->sk_bound_dev_if is oif's master device (and maybe there are other VRF cases to also consider).
I have looked into the code, but currently my knowledge of the code is not sufficient for the intervention.
Thank you, Mirsad
On 5/31/23 20:11, Guillaume Nault wrote:
On Wed, May 24, 2023 at 02:17:09PM +0200, Mirsad Todorovac wrote:
Hi,
Hi Mirsad,
The very recent 6.4-rc3 kernel build with AlmaLinux 8.7 on LENOVO 10TX000VCR desktop box fails one test:
[root@host net]# ./fcnal-test.sh [...] TEST: ping out, vrf device+address bind - ns-B loopback IPv6 [ OK ] TEST: ping out, vrf device+address bind - ns-B IPv6 LLA [FAIL] TEST: ping in - ns-A IPv6 [ OK ] [...] Tests passed: 887 Tests failed: 1 [root@host net]#
This test also fails on -net. The problem is specific to ping sockets (same test passes with raw sockets). I believe this test has always failed since fcnal-test.sh started using net.ipv4.ping_group_range (commit e71b7f1f44d3 ("selftests: add ping test with ping_group_range tuned")).
The executed command is:
ip netns exec ns-A ip vrf exec red /usr/bin/ping6 -c1 -w1 -I 2001:db8:3::1 fe80::a846:b5ff:fe4c:da4e%eth1
So ping6 is executed inside VRF 'red' and sets .sin6_scope_id to 'eth1' (which is a slave device of VRF 'red'). Therefore, we have sk->sk_bound_dev_if == 'red' and .sin6_scope_id == 'eth1'. This fails because ping_v6_sendmsg() expects them to be equal:
static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { ... if (__ipv6_addr_needs_scope_id(ipv6_addr_type(daddr))) oif = u->sin6_scope_id; ... if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if)) <-- oif='eth1', but ->sk_bound_dev_if='red' return -EINVAL; ... }
I believe this condition should be relaxed to allow the case where ->sk_bound_dev_if is oif's master device (and maybe there are other VRF cases to also consider).
I've tried something like this, but something makes the kernel stuck here:
TEST: ping out, blocked by route - ns-B loopback IPv6 [ OK ] TEST: ping out, device bind, blocked by route - ns-B loopback IPv6 [ OK ] TEST: ping in, blocked by route - ns-A loopback IPv6 [ OK ] TEST: ping out, unreachable route - ns-B loopback IPv6 [ OK ] TEST: ping out, device bind, unreachable route - ns-B loopback IPv6 [ OK ]
################################################################# With VRF
[hanged process and kernel won't shutdown]
The code is:
--- net/ipv6/ping.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..81293e902293 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -73,6 +73,9 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct rt6_info *rt; struct pingfakehdr pfh; struct ipcm6_cookie ipc6; + struct net *net = sock_net(sk); + struct net_device *dev = NULL; + struct net_device *mdev = NULL;
err = ping_common_sendmsg(AF_INET6, msg, len, &user_icmph, sizeof(user_icmph)); @@ -111,10 +114,17 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) else if (!oif) oif = np->ucast_oif;
+ if (oif) { + dev = dev_get_by_index(net, oif); + mdev = netdev_master_upper_dev_get(dev); + } + addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || - (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if)) + (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if && + !(mdev && sk->sk_bound_dev_if && + mdev != dev_get_by_index(net, sk->sk_bound_dev_if)))) return -EINVAL;
ipcm6_init_sk(&ipc6, np);
I am obviously doing something very stupid.
Regards, Mirsad
On Tue, Jun 06, 2023 at 08:24:54AM +0200, Mirsad Goran Todorovac wrote:
On 5/31/23 20:11, Guillaume Nault wrote:
I believe this condition should be relaxed to allow the case where ->sk_bound_dev_if is oif's master device (and maybe there are other VRF cases to also consider).
I've tried something like this, but something makes the kernel stuck here:
TEST: ping out, blocked by route - ns-B loopback IPv6 [ OK ] TEST: ping out, device bind, blocked by route - ns-B loopback IPv6 [ OK ] TEST: ping in, blocked by route - ns-A loopback IPv6 [ OK ] TEST: ping out, unreachable route - ns-B loopback IPv6 [ OK ] TEST: ping out, device bind, unreachable route - ns-B loopback IPv6 [ OK ]
################################################################# With VRF
[hanged process and kernel won't shutdown]
The code is:
net/ipv6/ping.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..81293e902293 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -73,6 +73,9 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct rt6_info *rt; struct pingfakehdr pfh; struct ipcm6_cookie ipc6;
struct net *net = sock_net(sk);
struct net_device *dev = NULL;
struct net_device *mdev = NULL; err = ping_common_sendmsg(AF_INET6, msg, len, &user_icmph, sizeof(user_icmph));
@@ -111,10 +114,17 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) else if (!oif) oif = np->ucast_oif;
if (oif) {
dev = dev_get_by_index(net, oif);
mdev = netdev_master_upper_dev_get(dev);
}
addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
!(mdev && sk->sk_bound_dev_if &&
mdev != dev_get_by_index(net, sk->sk_bound_dev_if)))) return -EINVAL; ipcm6_init_sk(&ipc6, np);
I am obviously doing something very stupid.
The problem is that dev_get_by_index() holds a reference on 'dev' which your code never releases. Also netdev_master_upper_dev_get() needs rtnl protection. These should have generated some kernel oops.
You can try this instead:
-------- >8 --------
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..f804c11e2146 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -114,7 +114,8 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || - (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if)) + (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if && + l3mdev_master_ifindex_by_index(sock_net(sk), oif) != sk->sk_bound_dev_if)) return -EINVAL;
ipcm6_init_sk(&ipc6, np);
On 6/6/23 15:46, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 08:24:54AM +0200, Mirsad Goran Todorovac wrote:
On 5/31/23 20:11, Guillaume Nault wrote:
I believe this condition should be relaxed to allow the case where ->sk_bound_dev_if is oif's master device (and maybe there are other VRF cases to also consider).
I've tried something like this, but something makes the kernel stuck here:
TEST: ping out, blocked by route - ns-B loopback IPv6 [ OK ] TEST: ping out, device bind, blocked by route - ns-B loopback IPv6 [ OK ] TEST: ping in, blocked by route - ns-A loopback IPv6 [ OK ] TEST: ping out, unreachable route - ns-B loopback IPv6 [ OK ] TEST: ping out, device bind, unreachable route - ns-B loopback IPv6 [ OK ]
################################################################# With VRF
[hanged process and kernel won't shutdown]
The code is:
net/ipv6/ping.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..81293e902293 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -73,6 +73,9 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct rt6_info *rt; struct pingfakehdr pfh; struct ipcm6_cookie ipc6;
struct net *net = sock_net(sk);
struct net_device *dev = NULL;
struct net_device *mdev = NULL; err = ping_common_sendmsg(AF_INET6, msg, len, &user_icmph, sizeof(user_icmph));
@@ -111,10 +114,17 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) else if (!oif) oif = np->ucast_oif;
if (oif) {
dev = dev_get_by_index(net, oif);
mdev = netdev_master_upper_dev_get(dev);
}
addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
!(mdev && sk->sk_bound_dev_if &&
mdev != dev_get_by_index(net, sk->sk_bound_dev_if)))) return -EINVAL; ipcm6_init_sk(&ipc6, np);
I am obviously doing something very stupid.
The problem is that dev_get_by_index() holds a reference on 'dev' which your code never releases. Also netdev_master_upper_dev_get() needs rtnl protection. These should have generated some kernel oops.
You can try this instead:
-------- >8 --------
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..f804c11e2146 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -114,7 +114,8 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
return -EINVAL;l3mdev_master_ifindex_by_index(sock_net(sk), oif) != sk->sk_bound_dev_if))
ipcm6_init_sk(&ipc6, np);
Yes, I stupidly forgot that.
I came across a fixed version:
------ diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..c1d81c49b775 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -73,6 +73,10 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct rt6_info *rt; struct pingfakehdr pfh; struct ipcm6_cookie ipc6; + struct net *net = sock_net(sk); + struct net_device *dev = NULL; + struct net_device *mdev = NULL; + struct net_device *bdev = NULL;
err = ping_common_sendmsg(AF_INET6, msg, len, &user_icmph, sizeof(user_icmph)); @@ -111,10 +115,26 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) else if (!oif) oif = np->ucast_oif;
+ if (oif) { + rcu_read_lock(); + dev = dev_get_by_index_rcu(net, oif); + rcu_read_unlock(); + rtnl_lock(); + mdev = netdev_master_upper_dev_get(dev); + rtnl_unlock(); + } + + if (sk->sk_bound_dev_if) { + rcu_read_lock(); + bdev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if); + rcu_read_unlock(); + } + addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || - (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if)) + (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if && + !(mdev && sk->sk_bound_dev_if && bdev && mdev == bdev))) return -EINVAL;
ipcm6_init_sk(&ipc6, np);
However, this works by the test (888 passed) but your two liner is obviously better :-)
Best regards, Mirsad
On Tue, Jun 06, 2023 at 03:57:35PM +0200, Mirsad Todorovac wrote:
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..c1d81c49b775 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -73,6 +73,10 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct rt6_info *rt; struct pingfakehdr pfh; struct ipcm6_cookie ipc6;
struct net *net = sock_net(sk);
struct net_device *dev = NULL;
struct net_device *mdev = NULL;
struct net_device *bdev = NULL; err = ping_common_sendmsg(AF_INET6, msg, len, &user_icmph, sizeof(user_icmph));
@@ -111,10 +115,26 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) else if (!oif) oif = np->ucast_oif;
if (oif) {
rcu_read_lock();
dev = dev_get_by_index_rcu(net, oif);
rcu_read_unlock();
You can't assume '*dev' is still valid after rcu_read_unlock() unless you hold a reference on it.
rtnl_lock();
mdev = netdev_master_upper_dev_get(dev);
rtnl_unlock();
Because of that, 'dev' might have already disappeared at the time netdev_master_upper_dev_get() is called. So it may dereference an invalid pointer here.
}
if (sk->sk_bound_dev_if) {
rcu_read_lock();
bdev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if);
rcu_read_unlock();
}
addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
!(mdev && sk->sk_bound_dev_if && bdev && mdev == bdev))) return -EINVAL; ipcm6_init_sk(&ipc6, np);
However, this works by the test (888 passed) but your two liner is obviously better :-)
:)
Best regards, Mirsad
-- Mirsad Goran Todorovac Sistem inženjer Grafički fakultet | Akademija likovnih umjetnosti Sveučilište u Zagrebu
System engineer Faculty of Graphic Arts | Academy of Fine Arts University of Zagreb, Republic of Croatia
On 6/6/23 16:11, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 03:57:35PM +0200, Mirsad Todorovac wrote:
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..c1d81c49b775 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -73,6 +73,10 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct rt6_info *rt; struct pingfakehdr pfh; struct ipcm6_cookie ipc6;
struct net *net = sock_net(sk);
struct net_device *dev = NULL;
struct net_device *mdev = NULL;
struct net_device *bdev = NULL; err = ping_common_sendmsg(AF_INET6, msg, len, &user_icmph, sizeof(user_icmph));
@@ -111,10 +115,26 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) else if (!oif) oif = np->ucast_oif;
if (oif) {
rcu_read_lock();
dev = dev_get_by_index_rcu(net, oif);
rcu_read_unlock();
You can't assume '*dev' is still valid after rcu_read_unlock() unless you hold a reference on it.
rtnl_lock();
mdev = netdev_master_upper_dev_get(dev);
rtnl_unlock();
Because of that, 'dev' might have already disappeared at the time netdev_master_upper_dev_get() is called. So it may dereference an invalid pointer here.
Good point, thanks. I didn't expect those to change.
This can be fixed, provided that RCU and RTNL locks can be nested:
rcu_read_lock(); if (oif) { dev = dev_get_by_index_rcu(net, oif); rtnl_lock(); mdev = netdev_master_upper_dev_get(dev); rtnl_unlock(); }
if (sk->sk_bound_dev_if) { bdev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if); }
addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if && !(mdev && sk->sk_bound_dev_if && bdev && mdev == bdev))) { rcu_read_unlock(); return -EINVAL; } rcu_read_unlock();
But again this is still probably not race-free (bdev might also disappear before the mdev == bdev test), even if it passed fcnal-test.sh, there is much duplication of code, so your one-line solution is obviously by far better. :-)
Much obliged.
Best regards, Mirsad
}
if (sk->sk_bound_dev_if) {
rcu_read_lock();
bdev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if);
rcu_read_unlock();
}
addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
!(mdev && sk->sk_bound_dev_if && bdev && mdev == bdev))) return -EINVAL; ipcm6_init_sk(&ipc6, np);
However, this works by the test (888 passed) but your two liner is obviously better :-)
:)
On Tue, Jun 06, 2023 at 04:28:02PM +0200, Mirsad Todorovac wrote:
On 6/6/23 16:11, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 03:57:35PM +0200, Mirsad Todorovac wrote:
if (oif) {
rcu_read_lock();
dev = dev_get_by_index_rcu(net, oif);
rcu_read_unlock();
You can't assume '*dev' is still valid after rcu_read_unlock() unless you hold a reference on it.
rtnl_lock();
mdev = netdev_master_upper_dev_get(dev);
rtnl_unlock();
Because of that, 'dev' might have already disappeared at the time netdev_master_upper_dev_get() is called. So it may dereference an invalid pointer here.
Good point, thanks. I didn't expect those to change.
This can be fixed, provided that RCU and RTNL locks can be nested:
Well, yes and no. You can call rcu_read_{lock,unlock}() while under the rtnl protection, but not the other way around.
rcu_read_lock(); if (oif) { dev = dev_get_by_index_rcu(net, oif); rtnl_lock(); mdev = netdev_master_upper_dev_get(dev); rtnl_unlock(); }
This is invalid: rtnl_lock() uses a mutex, so it can sleep and that's forbidden inside an RCU critical section.
if (sk->sk_bound_dev_if) { bdev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if); } addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if && !(mdev && sk->sk_bound_dev_if && bdev && mdev == bdev))) { rcu_read_unlock(); return -EINVAL;
} rcu_read_unlock();
But again this is still probably not race-free (bdev might also disappear before the mdev == bdev test), even if it passed fcnal-test.sh, there is much duplication of code, so your one-line solution is obviously by far better. :-)
The real problem is choosing the right function for getting the master device. In particular netdev_master_upper_dev_get() was a bad choice. It forces you to take the rtnl, which is unnatural here and obliges you to add extra code, while all this shouldn't be necessary in the first place.
Much obliged.
Best regards, Mirsad
On 6/6/23 20:50, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 04:28:02PM +0200, Mirsad Todorovac wrote:
On 6/6/23 16:11, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 03:57:35PM +0200, Mirsad Todorovac wrote:
if (oif) {
rcu_read_lock();
dev = dev_get_by_index_rcu(net, oif);
rcu_read_unlock();
You can't assume '*dev' is still valid after rcu_read_unlock() unless you hold a reference on it.
rtnl_lock();
mdev = netdev_master_upper_dev_get(dev);
rtnl_unlock();
Because of that, 'dev' might have already disappeared at the time netdev_master_upper_dev_get() is called. So it may dereference an invalid pointer here.
Good point, thanks. I didn't expect those to change.
This can be fixed, provided that RCU and RTNL locks can be nested:
Well, yes and no. You can call rcu_read_{lock,unlock}() while under the rtnl protection, but not the other way around.
rcu_read_lock(); if (oif) { dev = dev_get_by_index_rcu(net, oif); rtnl_lock(); mdev = netdev_master_upper_dev_get(dev); rtnl_unlock(); }
This is invalid: rtnl_lock() uses a mutex, so it can sleep and that's forbidden inside an RCU critical section.
Obviously, that's bad. Mea culpa.
if (sk->sk_bound_dev_if) { bdev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if); } addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if && !(mdev && sk->sk_bound_dev_if && bdev && mdev == bdev))) { rcu_read_unlock(); return -EINVAL;
} rcu_read_unlock();
But again this is still probably not race-free (bdev might also disappear before the mdev == bdev test), even if it passed fcnal-test.sh, there is much duplication of code, so your one-line solution is obviously by far better. :-)
The real problem is choosing the right function for getting the master device. In particular netdev_master_upper_dev_get() was a bad choice. It forces you to take the rtnl, which is unnatural here and obliges you to add extra code, while all this shouldn't be necessary in the first place.
Thank you for the additional insight. I had poor luck with Googling on these.
I made a blunder after blunder. But it was insightful and brainstorming. Good exercise for my little grey cells.
However, learning without making any errors appears to be simply a lot of blunt memorising. :-/
It's good to be in an environment when one can learn from errors.
:-)
Regards, Mirsad
On Tue, Jun 06, 2023 at 09:17:24PM +0200, Mirsad Goran Todorovac wrote:
On 6/6/23 20:50, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 04:28:02PM +0200, Mirsad Todorovac wrote:
On 6/6/23 16:11, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 03:57:35PM +0200, Mirsad Todorovac wrote:
if (oif) {
rcu_read_lock();
dev = dev_get_by_index_rcu(net, oif);
rcu_read_unlock();
You can't assume '*dev' is still valid after rcu_read_unlock() unless you hold a reference on it.
rtnl_lock();
mdev = netdev_master_upper_dev_get(dev);
rtnl_unlock();
Because of that, 'dev' might have already disappeared at the time netdev_master_upper_dev_get() is called. So it may dereference an invalid pointer here.
Good point, thanks. I didn't expect those to change.
This can be fixed, provided that RCU and RTNL locks can be nested:
Well, yes and no. You can call rcu_read_{lock,unlock}() while under the rtnl protection, but not the other way around.
rcu_read_lock(); if (oif) { dev = dev_get_by_index_rcu(net, oif); rtnl_lock(); mdev = netdev_master_upper_dev_get(dev); rtnl_unlock(); }
This is invalid: rtnl_lock() uses a mutex, so it can sleep and that's forbidden inside an RCU critical section.
Obviously, that's bad. Mea culpa.
if (sk->sk_bound_dev_if) { bdev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if); } addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) || (oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if && !(mdev && sk->sk_bound_dev_if && bdev && mdev == bdev))) { rcu_read_unlock(); return -EINVAL;
} rcu_read_unlock();
But again this is still probably not race-free (bdev might also disappear before the mdev == bdev test), even if it passed fcnal-test.sh, there is much duplication of code, so your one-line solution is obviously by far better. :-)
The real problem is choosing the right function for getting the master device. In particular netdev_master_upper_dev_get() was a bad choice. It forces you to take the rtnl, which is unnatural here and obliges you to add extra code, while all this shouldn't be necessary in the first place.
Thank you for the additional insight. I had poor luck with Googling on these.
I made a blunder after blunder. But it was insightful and brainstorming. Good exercise for my little grey cells.
However, learning without making any errors appears to be simply a lot of blunt memorising. :-/
It's good to be in an environment when one can learn from errors.
:-)
I'm happy you found this useful.
Regards, Mirsad
On 6/6/23 15:46, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 08:24:54AM +0200, Mirsad Goran Todorovac wrote:
On 5/31/23 20:11, Guillaume Nault wrote:
I believe this condition should be relaxed to allow the case where ->sk_bound_dev_if is oif's master device (and maybe there are other VRF cases to also consider).
I've tried something like this, but something makes the kernel stuck here:
TEST: ping out, blocked by route - ns-B loopback IPv6 [ OK ] TEST: ping out, device bind, blocked by route - ns-B loopback IPv6 [ OK ] TEST: ping in, blocked by route - ns-A loopback IPv6 [ OK ] TEST: ping out, unreachable route - ns-B loopback IPv6 [ OK ] TEST: ping out, device bind, unreachable route - ns-B loopback IPv6 [ OK ]
################################################################# With VRF
[hanged process and kernel won't shutdown]
The code is:
net/ipv6/ping.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..81293e902293 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -73,6 +73,9 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct rt6_info *rt; struct pingfakehdr pfh; struct ipcm6_cookie ipc6;
struct net *net = sock_net(sk);
struct net_device *dev = NULL;
struct net_device *mdev = NULL; err = ping_common_sendmsg(AF_INET6, msg, len, &user_icmph, sizeof(user_icmph));
@@ -111,10 +114,17 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) else if (!oif) oif = np->ucast_oif;
if (oif) {
dev = dev_get_by_index(net, oif);
mdev = netdev_master_upper_dev_get(dev);
}
addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
!(mdev && sk->sk_bound_dev_if &&
mdev != dev_get_by_index(net, sk->sk_bound_dev_if)))) return -EINVAL; ipcm6_init_sk(&ipc6, np);
I am obviously doing something very stupid.
The problem is that dev_get_by_index() holds a reference on 'dev' which your code never releases. Also netdev_master_upper_dev_get() needs rtnl protection. These should have generated some kernel oops.
You can try this instead:
-------- >8 --------
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..f804c11e2146 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -114,7 +114,8 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
return -EINVAL;l3mdev_master_ifindex_by_index(sock_net(sk), oif) != sk->sk_bound_dev_if))
ipcm6_init_sk(&ipc6, np);
The problem appears to be fixed:
# ./fcnal-test.sh [...] TEST: ping out, vrf device+address bind - ns-B loopback IPv6 [ OK ] TEST: ping out, vrf device+address bind - ns-B IPv6 LLA [ OK ] TEST: ping in - ns-A IPv6 [ OK ] [...] Tests passed: 888 Tests failed: 0 #
The test passed in both environments that manifested the bug.
Tested-by: Mirsad Todorovac mirsad.todorovac@alu.unizg.hr
However, test on my AMD Ubuntu 22.04 box with 6.4-rc5 commit a4d7d7011219 has shown additional four failed tests:
root@host # grep -n FAIL ../fcnal-test-4.log 90:TEST: ping local, VRF bind - VRF IP [FAIL] 92:TEST: ping local, device bind - ns-A IP [FAIL] 116:TEST: ping local, VRF bind - VRF IP [FAIL] 118:TEST: ping local, device bind - ns-A IP [FAIL] root@host #
But you would probably want me to file a separate bug report?
Best regards, Mirsad
On Tue, Jun 06, 2023 at 08:07:36PM +0200, Mirsad Goran Todorovac wrote:
On 6/6/23 15:46, Guillaume Nault wrote:
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..f804c11e2146 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -114,7 +114,8 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
return -EINVAL; ipcm6_init_sk(&ipc6, np);l3mdev_master_ifindex_by_index(sock_net(sk), oif) != sk->sk_bound_dev_if))
The problem appears to be fixed:
# ./fcnal-test.sh [...] TEST: ping out, vrf device+address bind - ns-B loopback IPv6 [ OK ] TEST: ping out, vrf device+address bind - ns-B IPv6 LLA [ OK ] TEST: ping in - ns-A IPv6 [ OK ] [...] Tests passed: 888 Tests failed: 0 #
The test passed in both environments that manifested the bug.
Tested-by: Mirsad Todorovac mirsad.todorovac@alu.unizg.hr
Thanks. I'll review and post this patch (probably tomorrow).
However, test on my AMD Ubuntu 22.04 box with 6.4-rc5 commit a4d7d7011219 has shown additional four failed tests:
root@host # grep -n FAIL ../fcnal-test-4.log 90:TEST: ping local, VRF bind - VRF IP [FAIL] 92:TEST: ping local, device bind - ns-A IP [FAIL] 116:TEST: ping local, VRF bind - VRF IP [FAIL] 118:TEST: ping local, device bind - ns-A IP [FAIL] root@host #
But you would probably want me to file a separate bug report?
Are these new failures? Do they also happen on the -net tree?
Best regards, Mirsad
On 6/6/23 20:57, Guillaume Nault wrote:
On Tue, Jun 06, 2023 at 08:07:36PM +0200, Mirsad Goran Todorovac wrote:
On 6/6/23 15:46, Guillaume Nault wrote:
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index c4835dbdfcff..f804c11e2146 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -114,7 +114,8 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) addr_type = ipv6_addr_type(daddr); if ((__ipv6_addr_needs_scope_id(addr_type) && !oif) || (addr_type & IPV6_ADDR_MAPPED) ||
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if))
(oif && sk->sk_bound_dev_if && oif != sk->sk_bound_dev_if &&
ipcm6_init_sk(&ipc6, np);l3mdev_master_ifindex_by_index(sock_net(sk), oif) != sk->sk_bound_dev_if)) return -EINVAL;
The problem appears to be fixed:
# ./fcnal-test.sh [...] TEST: ping out, vrf device+address bind - ns-B loopback IPv6 [ OK ] TEST: ping out, vrf device+address bind - ns-B IPv6 LLA [ OK ] TEST: ping in - ns-A IPv6 [ OK ] [...] Tests passed: 888 Tests failed: 0 #
The test passed in both environments that manifested the bug.
Tested-by: Mirsad Todorovac mirsad.todorovac@alu.unizg.hr
Thanks. I'll review and post this patch (probably tomorrow).
However, test on my AMD Ubuntu 22.04 box with 6.4-rc5 commit a4d7d7011219 has shown additional four failed tests:
root@host # grep -n FAIL ../fcnal-test-4.log 90:TEST: ping local, VRF bind - VRF IP [FAIL] 92:TEST: ping local, device bind - ns-A IP [FAIL] 116:TEST: ping local, VRF bind - VRF IP [FAIL] 118:TEST: ping local, device bind - ns-A IP [FAIL] root@host #
But you would probably want me to file a separate bug report?
Are these new failures? Do they also happen on the -net tree?
I cannot tell if those are new for the architecture (Ubuntu 22.04 + AMD Ryzen)
However, Ubuntu's unsigned 6.3.1 generic mainline kernel is also affected. So, it might seem like an old problem.
(If you could isolate the exact tests, I could try a bisect.)
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
SYSCTL: net.ipv4.raw_l3mdev_accept=1
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
Yes, just tested, w commit 42510dffd0e2 these are still present in fcnal-test.sh output:
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
I have attached the config used and lshw.
Best regards, Mirsad
On Wed, Jun 07, 2023 at 12:04:52AM +0200, Mirsad Goran Todorovac wrote:
I cannot tell if those are new for the architecture (Ubuntu 22.04 + AMD Ryzen)
However, Ubuntu's unsigned 6.3.1 generic mainline kernel is also affected. So, it might seem like an old problem.
(If you could isolate the exact tests, I could try a bisect.)
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
SYSCTL: net.ipv4.raw_l3mdev_accept=1
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
Yes, just tested, w commit 42510dffd0e2 these are still present in fcnal-test.sh output:
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
I have the same failures here. They don't seem to be recent. I'll take a look.
I have attached the config used and lshw.
Best regards, Mirsad
On 6/7/23 18:51, Guillaume Nault wrote:
On Wed, Jun 07, 2023 at 12:04:52AM +0200, Mirsad Goran Todorovac wrote:
I cannot tell if those are new for the architecture (Ubuntu 22.04 + AMD Ryzen)
However, Ubuntu's unsigned 6.3.1 generic mainline kernel is also affected. So, it might seem like an old problem.
(If you could isolate the exact tests, I could try a bisect.)
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
SYSCTL: net.ipv4.raw_l3mdev_accept=1
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
Yes, just tested, w commit 42510dffd0e2 these are still present in fcnal-test.sh output:
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
I have the same failures here. They don't seem to be recent. I'll take a look.
Certainly. I thought it might be something architecture-specific?
I have reproduced it also on a Lenovo IdeaPad 3 with Ubuntu 22.10, but on Lenovo desktop with AlmaLinux 8.8 (CentOS fork), the result was "888/888 passed".
However, I have a question:
In the ping + "With VRF" section, the tests with net.ipv4.raw_l3mdev_accept=1 are repeated twice, while "No VRF" section has the versions:
SYSCTL: net.ipv4.raw_l3mdev_accept=0
and
SYSCTL: net.ipv4.raw_l3mdev_accept=1
The same happens with the IPv6 ping tests.
In that case, it could be that we have only 2 actual FAIL cases, because the error is reported twice.
Is this intentional?
Thanks, Mirsad
74 ################################################################# 75 With VRF 76 77 SYSCTL: net.ipv4.raw_l3mdev_accept=1 78 79 TEST: ping out, VRF bind - ns-B IP [ OK ] 80 TEST: ping out, device bind - ns-B IP [ OK ] 81 TEST: ping out, vrf device + dev address bind - ns-B IP [ OK ] 82 TEST: ping out, vrf device + vrf address bind - ns-B IP [ OK ] 83 TEST: ping out, VRF bind - ns-B loopback IP [ OK ] 84 TEST: ping out, device bind - ns-B loopback IP [ OK ] 85 TEST: ping out, vrf device + dev address bind - ns-B loopback IP [ OK ] 86 TEST: ping out, vrf device + vrf address bind - ns-B loopback IP [ OK ] 87 TEST: ping in - ns-A IP [ OK ] 88 TEST: ping in - VRF IP [ OK ] 89 TEST: ping local, VRF bind - ns-A IP [ OK ] 90 TEST: ping local, VRF bind - VRF IP [FAIL] 91 TEST: ping local, VRF bind - loopback [ OK ] 92 TEST: ping local, device bind - ns-A IP [FAIL] 93 TEST: ping local, device bind - VRF IP [ OK ] 94 TEST: ping local, device bind - loopback [ OK ] 95 TEST: ping out, vrf bind, blocked by rule - ns-B loopback IP [ OK ] 96 TEST: ping out, device bind, blocked by rule - ns-B loopback IP [ OK ] 97 TEST: ping in, blocked by rule - ns-A loopback IP [ OK ] 98 TEST: ping out, vrf bind, unreachable route - ns-B loopback IP [ OK ] 99 TEST: ping out, device bind, unreachable route - ns-B loopback IP [ OK ] 100 TEST: ping in, unreachable route - ns-A loopback IP [ OK ] 101 SYSCTL: net.ipv4.ping_group_range=0 2147483647 102 103 SYSCTL: net.ipv4.raw_l3mdev_accept=1 104 105 TEST: ping out, VRF bind - ns-B IP [ OK ] 106 TEST: ping out, device bind - ns-B IP [ OK ] 107 TEST: ping out, vrf device + dev address bind - ns-B IP [ OK ] 108 TEST: ping out, vrf device + vrf address bind - ns-B IP [ OK ] 109 TEST: ping out, VRF bind - ns-B loopback IP [ OK ] 110 TEST: ping out, device bind - ns-B loopback IP [ OK ] 111 TEST: ping out, vrf device + dev address bind - ns-B loopback IP [ OK ] 112 TEST: ping out, vrf device + vrf address bind - ns-B loopback IP [ OK ] 113 TEST: ping in - ns-A IP [ OK ] 114 TEST: ping in - VRF IP [ OK ] 115 TEST: ping local, VRF bind - ns-A IP [ OK ] 116 TEST: ping local, VRF bind - VRF IP [FAIL] 117 TEST: ping local, VRF bind - loopback [ OK ] 118 TEST: ping local, device bind - ns-A IP [FAIL] 119 TEST: ping local, device bind - VRF IP [ OK ]
On Thu, Jun 08, 2023 at 07:37:15AM +0200, Mirsad Goran Todorovac wrote:
On 6/7/23 18:51, Guillaume Nault wrote:
On Wed, Jun 07, 2023 at 12:04:52AM +0200, Mirsad Goran Todorovac wrote:
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
I have the same failures here. They don't seem to be recent. I'll take a look.
Certainly. I thought it might be something architecture-specific?
I have reproduced it also on a Lenovo IdeaPad 3 with Ubuntu 22.10, but on Lenovo desktop with AlmaLinux 8.8 (CentOS fork), the result was "888/888 passed".
I've taken a deeper look at these failures. That's actually a problem in ping. That's probably why you have different results depending on the distribution.
The problem is that, for some versions, 'ping -I netdev ...' doesn't bind the socket to 'netdev' if the IPv4 address to ping is set on that same device. The VRF tests depend on this socket binding, so they fail when ping refuses to bind. That was fixed upstream with commit 92ce8ef21393 ("Revert "ping: do not bind to device when destination IP is on device"") (https://github.com/iputils/iputils/commit/92ce8ef2139353da3bf55fe2280bd4abd2...).
Long story short, the tests should pass with the latest upstream ping version.
Alternatively, you can modify the commands run by fcnal-test.sh and provide the -I option twice: one for setting the device binding and one for setting the source IPv4 address. This way ping should accept to bind its socket.
Something like (not tested):
- run_cmd ping -c1 -w1 -I ${VRF} ${a} + run_cmd ping -c1 -w1 -I ${VRF} -I ${a} ${a} [...] - run_cmd ping -c1 -w1 -I ${NSA_DEV} ${a} + run_cmd ping -c1 -w1 -I ${NSA_DEV} -I ${a} ${a}
However, I have a question:
In the ping + "With VRF" section, the tests with net.ipv4.raw_l3mdev_accept=1 are repeated twice, while "No VRF" section has the versions:
SYSCTL: net.ipv4.raw_l3mdev_accept=0
and
SYSCTL: net.ipv4.raw_l3mdev_accept=1
The same happens with the IPv6 ping tests.
In that case, it could be that we have only 2 actual FAIL cases, because the error is reported twice.
Is this intentional?
I don't know why the non-VRF tests are run once with raw_l3mdev_accept=0 and once with raw_l3mdev_accept=1. Unless I'm missing something, this option shouldn't affect non-VRF users. Maybe the objective is to make sure that it really doesn't affect them. David certainly knows better.
Thanks, Mirsad
On 6/9/23 18:13, Guillaume Nault wrote:
On Thu, Jun 08, 2023 at 07:37:15AM +0200, Mirsad Goran Todorovac wrote:
On 6/7/23 18:51, Guillaume Nault wrote:
On Wed, Jun 07, 2023 at 12:04:52AM +0200, Mirsad Goran Todorovac wrote:
[...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, device bind - VRF IP [ OK ] [...]
I have the same failures here. They don't seem to be recent. I'll take a look.
Certainly. I thought it might be something architecture-specific?
I have reproduced it also on a Lenovo IdeaPad 3 with Ubuntu 22.10, but on Lenovo desktop with AlmaLinux 8.8 (CentOS fork), the result was "888/888 passed".
I've taken a deeper look at these failures. That's actually a problem in ping. That's probably why you have different results depending on the distribution.
Thank you for your work. I feel encouraged by your aim to get to the bottom of the problem ...
The problem is that, for some versions, 'ping -I netdev ...' doesn't bind the socket to 'netdev' if the IPv4 address to ping is set on that same device. The VRF tests depend on this socket binding, so they fail when ping refuses to bind. That was fixed upstream with commit 92ce8ef21393 ("Revert "ping: do not bind to device when destination IP is on device"") (https://github.com/iputils/iputils/commit/92ce8ef2139353da3bf55fe2280bd4abd2...).
Long story short, the tests should pass with the latest upstream ping version.
Alternatively, you can modify the commands run by fcnal-test.sh and provide the -I option twice: one for setting the device binding and one for setting the source IPv4 address. This way ping should accept to bind its socket.
Something like (not tested):
run_cmd ping -c1 -w1 -I ${VRF} ${a}
run_cmd ping -c1 -w1 -I ${VRF} -I ${a} ${a}
[...]
run_cmd ping -c1 -w1 -I ${NSA_DEV} ${a}
run_cmd ping -c1 -w1 -I ${NSA_DEV} -I ${a} ${a}
I have tested this and the fix appears to work:
################################################################# With VRF
SYSCTL: net.ipv4.raw_l3mdev_accept=1
TEST: ping out, VRF bind - ns-B IP [ OK ] TEST: ping out, device bind - ns-B IP [ OK ] TEST: ping out, vrf device + dev address bind - ns-B IP [ OK ] TEST: ping out, vrf device + vrf address bind - ns-B IP [ OK ] TEST: ping out, VRF bind - ns-B loopback IP [ OK ] TEST: ping out, device bind - ns-B loopback IP [ OK ] TEST: ping out, vrf device + dev address bind - ns-B loopback IP [ OK ] TEST: ping out, vrf device + vrf address bind - ns-B loopback IP [ OK ] TEST: ping in - ns-A IP [ OK ] TEST: ping in - VRF IP [ OK ] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [ OK ] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [ OK ] TEST: ping local, device bind - VRF IP [ OK ] TEST: ping local, device bind - loopback [ OK ] TEST: ping out, vrf bind, blocked by rule - ns-B loopback IP [ OK ] TEST: ping out, device bind, blocked by rule - ns-B loopback IP [ OK ] TEST: ping in, blocked by rule - ns-A loopback IP [ OK ] TEST: ping out, vrf bind, unreachable route - ns-B loopback IP [ OK ] TEST: ping out, device bind, unreachable route - ns-B loopback IP [ OK ] TEST: ping in, unreachable route - ns-A loopback IP [ OK ] SYSCTL: net.ipv4.ping_group_range=0 2147483647
SYSCTL: net.ipv4.raw_l3mdev_accept=1
TEST: ping out, VRF bind - ns-B IP [ OK ] TEST: ping out, device bind - ns-B IP [ OK ] TEST: ping out, vrf device + dev address bind - ns-B IP [ OK ] TEST: ping out, vrf device + vrf address bind - ns-B IP [ OK ] TEST: ping out, VRF bind - ns-B loopback IP [ OK ] TEST: ping out, device bind - ns-B loopback IP [ OK ] TEST: ping out, vrf device + dev address bind - ns-B loopback IP [ OK ] TEST: ping out, vrf device + vrf address bind - ns-B loopback IP [ OK ] TEST: ping in - ns-A IP [ OK ] TEST: ping in - VRF IP [ OK ] TEST: ping local, VRF bind - ns-A IP [ OK ] TEST: ping local, VRF bind - VRF IP [ OK ] TEST: ping local, VRF bind - loopback [ OK ] TEST: ping local, device bind - ns-A IP [ OK ] TEST: ping local, device bind - VRF IP [ OK ] TEST: ping local, device bind - loopback [ OK ] TEST: ping out, vrf bind, blocked by rule - ns-B loopback IP [ OK ] TEST: ping out, device bind, blocked by rule - ns-B loopback IP [ OK ] TEST: ping in, blocked by rule - ns-A loopback IP [ OK ] TEST: ping out, vrf bind, unreachable route - ns-B loopback IP [ OK ] TEST: ping out, device bind, unreachable route - ns-B loopback IP [ OK ] TEST: ping in, unreachable route - ns-A loopback IP [ OK ]
###########################################################################
This also works on the Lenovo IdeaPad 3 Ubuntu 22.10 laptop, but on the AlmaLinux 8.8 Lenovo desktop I have a problem:
[root@pc-mtodorov net]# grep FAIL ../fcnal-test-4.log TEST: ping local, VRF bind - ns-A IP [FAIL] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, VRF bind - ns-A IP [FAIL] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, device bind - ns-A IP [FAIL] [root@pc-mtodorov net]#
Kernel is the recent one:
[root@pc-mtodorov net]# uname -rms Linux 6.4.0-rc5-testnet-00003-g5b23878f7ed9 x86_64 [root@pc-mtodorov net]#
However, I have a question:
In the ping + "With VRF" section, the tests with net.ipv4.raw_l3mdev_accept=1 are repeated twice, while "No VRF" section has the versions:
SYSCTL: net.ipv4.raw_l3mdev_accept=0
and
SYSCTL: net.ipv4.raw_l3mdev_accept=1
The same happens with the IPv6 ping tests.
In that case, it could be that we have only 2 actual FAIL cases, because the error is reported twice.
Is this intentional?
I don't know why the non-VRF tests are run once with raw_l3mdev_accept=0 and once with raw_l3mdev_accept=1. Unless I'm missing something, this option shouldn't affect non-VRF users. Maybe the objective is to make sure that it really doesn't affect them. David certainly knows better.
The problem appears to be that non-VRF tests are being ran with raw_l3mdev_accept={0|1}, while VRF tests w raw_l3mdev_accept={1|1} ...
I will try to fix that, but I am not sure of the semantics either.
Regards, Mirsad
On Sat, Jun 10, 2023 at 08:04:02PM +0200, Mirsad Goran Todorovac wrote:
This also works on the Lenovo IdeaPad 3 Ubuntu 22.10 laptop, but on the AlmaLinux 8.8 Lenovo desktop I have a problem:
[root@pc-mtodorov net]# grep FAIL ../fcnal-test-4.log TEST: ping local, VRF bind - ns-A IP [FAIL] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, VRF bind - ns-A IP [FAIL] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, device bind - ns-A IP [FAIL] [root@pc-mtodorov net]#
Kernel is the recent one:
[root@pc-mtodorov net]# uname -rms Linux 6.4.0-rc5-testnet-00003-g5b23878f7ed9 x86_64 [root@pc-mtodorov net]#
Maybe a problem with the ping version used by the distribution. You can try "./fcnal-test.sh -t ipv4_ping -p -v" to view the commands run and make the script stop when there's a test failure (so that you can see the ping output and try your own commands in the testing environment).
However, I have a question:
In the ping + "With VRF" section, the tests with net.ipv4.raw_l3mdev_accept=1 are repeated twice, while "No VRF" section has the versions:
SYSCTL: net.ipv4.raw_l3mdev_accept=0
and
SYSCTL: net.ipv4.raw_l3mdev_accept=1
The same happens with the IPv6 ping tests.
In that case, it could be that we have only 2 actual FAIL cases, because the error is reported twice.
Is this intentional?
I don't know why the non-VRF tests are run once with raw_l3mdev_accept=0 and once with raw_l3mdev_accept=1. Unless I'm missing something, this option shouldn't affect non-VRF users. Maybe the objective is to make sure that it really doesn't affect them. David certainly knows better.
The problem appears to be that non-VRF tests are being ran with raw_l3mdev_accept={0|1}, while VRF tests w raw_l3mdev_accept={1|1} ...
The reason the VRF tests run twice is to test both raw and ping sockets (using the "net.ipv4.ping_group_range" sysctl). It doesn't seem anyone ever intended to run the VRF tests with raw_l3mdev_accept=0.
Only the non-VRF tests were intended to be tested with raw_l3mdev_accept=0 (see commit c032dd8cc7e2 ("selftests: Add ipv4 ping tests to fcnal-test")). But I have no idea why.
I will try to fix that, but I am not sure of the semantics either.
Regards, Mirsad
On 6/14/23 10:47, Guillaume Nault wrote:
On Sat, Jun 10, 2023 at 08:04:02PM +0200, Mirsad Goran Todorovac wrote:
This also works on the Lenovo IdeaPad 3 Ubuntu 22.10 laptop, but on the AlmaLinux 8.8 Lenovo desktop I have a problem:
[root@pc-mtodorov net]# grep FAIL ../fcnal-test-4.log TEST: ping local, VRF bind - ns-A IP [FAIL] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, device bind - ns-A IP [FAIL] TEST: ping local, VRF bind - ns-A IP [FAIL] TEST: ping local, VRF bind - VRF IP [FAIL] TEST: ping local, device bind - ns-A IP [FAIL] [root@pc-mtodorov net]#
Kernel is the recent one:
[root@pc-mtodorov net]# uname -rms Linux 6.4.0-rc5-testnet-00003-g5b23878f7ed9 x86_64 [root@pc-mtodorov net]#
Maybe a problem with the ping version used by the distribution. You can try "./fcnal-test.sh -t ipv4_ping -p -v" to view the commands run and make the script stop when there's a test failure (so that you can see the ping output and try your own commands in the testing environment).
Thank you for taking the time for the reply. And thanks for the hint. But I am sort of on ebb tide on this.
It would be good to have the test run on both versions of Linux to test the actual kernel faults. Maybe pack a version of ping command w the test? But I cannot deploy too much time in this.
I hope then the upgrade AlmaLinux 8.8 -> 9.x (or CentOS clones in general) would solve the issue, but it is not guaranteed, and I would lose bisect to the old kernels. Which is why I do not upgrade to the latest releases in the first place. :-/
If it is just the AlmaLinux ping, then it is just an exotic distro, but it is a CentOS clone, so the issue might exist in the more popular Rocky, too.
I am not sure what is the right way to do in this case or I would already have done it. Presumptuous maybe, but true.
However, I have a question:
In the ping + "With VRF" section, the tests with net.ipv4.raw_l3mdev_accept=1 are repeated twice, while "No VRF" section has the versions:
SYSCTL: net.ipv4.raw_l3mdev_accept=0
and
SYSCTL: net.ipv4.raw_l3mdev_accept=1
The same happens with the IPv6 ping tests.
In that case, it could be that we have only 2 actual FAIL cases, because the error is reported twice.
Is this intentional?
I don't know why the non-VRF tests are run once with raw_l3mdev_accept=0 and once with raw_l3mdev_accept=1. Unless I'm missing something, this option shouldn't affect non-VRF users. Maybe the objective is to make sure that it really doesn't affect them. David certainly knows better.
The problem appears to be that non-VRF tests are being ran with raw_l3mdev_accept={0|1}, while VRF tests w raw_l3mdev_accept={1|1} ...
The reason the VRF tests run twice is to test both raw and ping sockets (using the "net.ipv4.ping_group_range" sysctl). It doesn't seem anyone ever intended to run the VRF tests with raw_l3mdev_accept=0.
Only the non-VRF tests were intended to be tested with raw_l3mdev_accept=0 (see commit c032dd8cc7e2 ("selftests: Add ipv4 ping tests to fcnal-test")). But I have no idea why.
Well, you are not to blame if it is not documented.
This thing doesn't come out of the testsuite save by prayer and fasting, I'm afraid ;-)
Best regards, Mirsad
linux-kselftest-mirror@lists.linaro.org