The rtnetlink test for preferred lifetime of an address is quite flaky. Problems started around the 6.16 merge window in May. The test fails with:
FAIL: preferred_lft addresses remaining
and unlike most of our flakes this one fails on the "normal" kernel builds, not the builds with kernel/configs/debug.config. I suspect the flakes may be related to power saving, since the expirations run from a "power efficient" workqueue. Adding a short sleep seems to decrease the flakes by 8x but they still happen. With this patch in place we get a flake every couple of weeks, not every couple of days. Better ideas welcome..
Signed-off-by: Jakub Kicinski kuba@kernel.org --- CC: liuhangbin@gmail.com CC: shuah@kernel.org CC: linux-kselftest@vger.kernel.org --- tools/testing/selftests/net/rtnetlink.sh | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index 2e8243a65b50..b9e1497ea27a 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -299,6 +299,11 @@ kci_test_addrlft() done
sleep 5 + # Schedule out for a bit, address GC runs from the power efficient WQ + # if the long sleep above has put the whole system into sleep state + # the WQ may have not had a chance to run. + sleep 0.1 + run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy" if [ $? -eq 0 ]; then check_err 1
On Thu, Jul 10, 2025 at 07:53:12AM -0700, Jakub Kicinski wrote:
The rtnetlink test for preferred lifetime of an address is quite flaky. Problems started around the 6.16 merge window in May. The test fails with:
FAIL: preferred_lft addresses remaining
and unlike most of our flakes this one fails on the "normal" kernel builds, not the builds with kernel/configs/debug.config. I suspect the flakes may be related to power saving, since the expirations run from a "power efficient" workqueue. Adding a short sleep seems to decrease the flakes by 8x but they still happen. With this patch in place we get a flake every couple of weeks, not every couple of days. Better ideas welcome..
Signed-off-by: Jakub Kicinski kuba@kernel.org
CC: liuhangbin@gmail.com CC: shuah@kernel.org CC: linux-kselftest@vger.kernel.org
tools/testing/selftests/net/rtnetlink.sh | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index 2e8243a65b50..b9e1497ea27a 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -299,6 +299,11 @@ kci_test_addrlft() done sleep 5
- # Schedule out for a bit, address GC runs from the power efficient WQ
- # if the long sleep above has put the whole system into sleep state
- # the WQ may have not had a chance to run.
- sleep 0.1
How about use slowwait to check if the address still exists. e.g.
check_addr_not_exist() { dev=$1 addr=$2 if ip addr show dev $dev | grep -q $addr; then return 1 else return 0 }
slowwait 5 check_addr_not_exist "$devdummy" "10.23.11."
run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy" if [ $? -eq 0 ]; then check_err 1 -- 2.50.0
On Fri, 11 Jul 2025 02:14:03 +0000 Hangbin Liu wrote:
sleep 5
- # Schedule out for a bit, address GC runs from the power efficient WQ
- # if the long sleep above has put the whole system into sleep state
- # the WQ may have not had a chance to run.
- sleep 0.1
How about use slowwait to check if the address still exists.
Weirdly if we read the addresses twice they disappear, I haven't looked into the code for the why, but seemed like using slowwait could potentially mask the addresses sticking around when nobody runs the Netlink handlers for a while? Dunno..
I queued this debug patch a couple of months ago:
sleep 5 - run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy" + ip addr show dev "$devdummy" > /tmp/a + run_cmd_grep_fail "10.23.11." cat /tmp/a if [ $? -eq 0 ]; then - check_err 1 - end_test "FAIL: preferred_lft addresses remaining" + check_err 1 + cat /tmp/a + echo "===" + ip addr show dev "$devdummy" + end_test "FAIL: preferred_lft addresses remaining ($lft)" return fi
And when it flakes the output looks like this:
# 7.23 [+7.00] 297: test-dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 # 7.23 [+0.00] link/ether 9e:a6:c4:c2:1b:16 brd ff:ff:ff:ff:ff:ff # 7.23 [+0.00] inet 10.23.11.81/32 scope global deprecated dynamic test-dummy0 # 7.23 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.23 [+0.00] inet 10.23.11.84/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet 10.23.11.93/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet 10.23.11.94/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet 10.23.11.97/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet 10.23.11.99/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet6 fe80::9ca6:c4ff:fec2:1b16/64 scope link proto kernel_ll # 7.24 [+0.00] valid_lft forever preferred_lft forever # 7.24 [+0.00] === # 7.25 [+0.00] 297: test-dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 # 7.25 [+0.00] link/ether 9e:a6:c4:c2:1b:16 brd ff:ff:ff:ff:ff:ff # 7.25 [+0.00] inet6 fe80::9ca6:c4ff:fec2:1b16/64 scope link proto kernel_ll # 7.25 [+0.00] valid_lft forever preferred_lft forever # 7.25 [+0.00] FAIL: preferred_lft addresses remaining (1)
On Fri, Jul 11, 2025 at 07:17:29AM -0700, Jakub Kicinski wrote:
On Fri, 11 Jul 2025 02:14:03 +0000 Hangbin Liu wrote:
sleep 5
- # Schedule out for a bit, address GC runs from the power efficient WQ
- # if the long sleep above has put the whole system into sleep state
- # the WQ may have not had a chance to run.
- sleep 0.1
How about use slowwait to check if the address still exists.
Weirdly if we read the addresses twice they disappear, I haven't looked into the code for the why, but seemed like using slowwait could potentially mask the addresses sticking around when nobody runs the Netlink handlers for a while? Dunno..
Not sure if I understand correctly. Do you mean the addresses will keep there if we use slowwait?
Thanks Hangbin
I queued this debug patch a couple of months ago:
sleep 5
- run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy"
- ip addr show dev "$devdummy" > /tmp/a
- run_cmd_grep_fail "10.23.11." cat /tmp/a if [ $? -eq 0 ]; then
check_err 1
end_test "FAIL: preferred_lft addresses remaining"
check_err 1
cat /tmp/a
echo "==="
ip addr show dev "$devdummy"
return fiend_test "FAIL: preferred_lft addresses remaining ($lft)"
And when it flakes the output looks like this:
# 7.23 [+7.00] 297: test-dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 # 7.23 [+0.00] link/ether 9e:a6:c4:c2:1b:16 brd ff:ff:ff:ff:ff:ff # 7.23 [+0.00] inet 10.23.11.81/32 scope global deprecated dynamic test-dummy0 # 7.23 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.23 [+0.00] inet 10.23.11.84/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet 10.23.11.93/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet 10.23.11.94/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet 10.23.11.97/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet 10.23.11.99/32 scope global deprecated dynamic test-dummy0 # 7.24 [+0.00] valid_lft 0sec preferred_lft 0sec # 7.24 [+0.00] inet6 fe80::9ca6:c4ff:fec2:1b16/64 scope link proto kernel_ll # 7.24 [+0.00] valid_lft forever preferred_lft forever # 7.24 [+0.00] === # 7.25 [+0.00] 297: test-dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 # 7.25 [+0.00] link/ether 9e:a6:c4:c2:1b:16 brd ff:ff:ff:ff:ff:ff # 7.25 [+0.00] inet6 fe80::9ca6:c4ff:fec2:1b16/64 scope link proto kernel_ll # 7.25 [+0.00] valid_lft forever preferred_lft forever # 7.25 [+0.00] FAIL: preferred_lft addresses remaining (1)
On Mon, 14 Jul 2025 07:19:09 +0000 Hangbin Liu wrote:
How about use slowwait to check if the address still exists.
Weirdly if we read the addresses twice they disappear, I haven't looked into the code for the why, but seemed like using slowwait could potentially mask the addresses sticking around when nobody runs the Netlink handlers for a while? Dunno..
Not sure if I understand correctly. Do you mean the addresses will keep there if we use slowwait?
No, I mean there may be false negatives, not false positive. But maybe it's fine, it will definitely prevent flakes. Could you post the slowwait patch officially?
linux-kselftest-mirror@lists.linaro.org