We keep seeing flakes on packetdrill on debug kernels, while non-debug kernels are stable, not a single flake in 200 runs. Time to give up, debug kernels appear to suffer from 10msec latency spikes and any timing-sensitive test is bound to flake.
Signed-off-by: Jakub Kicinski kuba@kernel.org --- CC: shuah@kernel.org CC: willemb@google.com CC: matttbe@kernel.org CC: linux-kselftest@vger.kernel.org --- .../selftests/net/packetdrill/ksft_runner.sh | 19 +------------------ 1 file changed, 1 insertion(+), 18 deletions(-)
diff --git a/tools/testing/selftests/net/packetdrill/ksft_runner.sh b/tools/testing/selftests/net/packetdrill/ksft_runner.sh index c5b01e1bd4c7..a7e790af38ff 100755 --- a/tools/testing/selftests/net/packetdrill/ksft_runner.sh +++ b/tools/testing/selftests/net/packetdrill/ksft_runner.sh @@ -35,24 +35,7 @@ failfunc=ktap_test_fail
if [[ -n "${KSFT_MACHINE_SLOW}" ]]; then optargs+=('--tolerance_usecs=14000') - - # xfail tests that are known flaky with dbg config, not fixable. - # still run them for coverage (and expect 100% pass without dbg). - declare -ar xfail_list=( - "tcp_blocking_blocking-connect.pkt" - "tcp_blocking_blocking-read.pkt" - "tcp_eor_no-coalesce-retrans.pkt" - "tcp_fast_recovery_prr-ss.*.pkt" - "tcp_sack_sack-route-refresh-ip-tos.pkt" - "tcp_slow_start_slow-start-after-win-update.pkt" - "tcp_timestamping.*.pkt" - "tcp_user_timeout_user-timeout-probe.pkt" - "tcp_zerocopy_cl.*.pkt" - "tcp_zerocopy_epoll_.*.pkt" - "tcp_tcp_info_tcp-info-.*-limited.pkt" - ) - readonly xfail_regex="^($(printf '%s|' "${xfail_list[@]}"))$" - [[ "$script" =~ ${xfail_regex} ]] && failfunc=ktap_test_xfail + failfunc=ktap_test_xfail fi
ktap_print_header
Jakub Kicinski wrote:
We keep seeing flakes on packetdrill on debug kernels, while non-debug kernels are stable, not a single flake in 200 runs. Time to give up, debug kernels appear to suffer from 10msec latency spikes and any timing-sensitive test is bound to flake.
Signed-off-by: Jakub Kicinski kuba@kernel.org
Reviewed-by: Willem de Bruijn willemb@google.com
On Fri, 01 Aug 2025 17:00:35 -0400 Willem de Bruijn wrote:
Jakub Kicinski wrote:
We keep seeing flakes on packetdrill on debug kernels, while non-debug kernels are stable, not a single flake in 200 runs. Time to give up, debug kernels appear to suffer from 10msec latency spikes and any timing-sensitive test is bound to flake.
Signed-off-by: Jakub Kicinski kuba@kernel.org
Reviewed-by: Willem de Bruijn willemb@google.com
I should have added "Willem was right" 'cause you suggested this a while back. But didn't know how to phrase it in the commit msg :)
Jakub Kicinski wrote:
On Fri, 01 Aug 2025 17:00:35 -0400 Willem de Bruijn wrote:
Jakub Kicinski wrote:
We keep seeing flakes on packetdrill on debug kernels, while non-debug kernels are stable, not a single flake in 200 runs. Time to give up, debug kernels appear to suffer from 10msec latency spikes and any timing-sensitive test is bound to flake.
Signed-off-by: Jakub Kicinski kuba@kernel.org
Reviewed-by: Willem de Bruijn willemb@google.com
I should have added "Willem was right" 'cause you suggested this a while back. But didn't know how to phrase it in the commit msg :)
Ha, did I? I was hoping that the short allow-list would work. But if latency spikes can happen anytime, then that clearly not.
Hi Jakub, Willem,
On 01/08/2025 20:16, Jakub Kicinski wrote:
We keep seeing flakes on packetdrill on debug kernels, while non-debug kernels are stable, not a single flake in 200 runs. Time to give up, debug kernels appear to suffer from 10msec latency spikes and any timing-sensitive test is bound to flake.
Thank you for the patch!
Another solution might be to increase the tolerance, but I don't think it will fix all issues. I quickly looked at the last 100 runs, and I think most failures might be fixed by a higher tolerance, e.g.
# tcp_ooo-before-and-after-accept.pkt:19: timing error: expected inbound packet at 0.101619 sec but happened at 0.115894 sec; tolerance 0.014000 sec
(0.275ms above the limit!)
On MPTCP, we used to have a very high tolerance with debug kernels (>0.5s) when public CIs were very limited in terms of CPU resources. I guess having a tolerance of 0.1s would be enough, but for these MPTCP packetdrill tests, I put 0.2s for the tolerance with a debug kernel, just to be on the safe side.
Still, I think increasing the tolerance would not fix all issues. On MPTCP side, the latency introduced by debug kernel caused unexpected retransmissions due to too low RTO. I took the time to make sure injected packets were always done with enough delay, but with the TCP packetdrill tests here, it is possibly not enough to do that when I look at some recent errors, e.g.
tcp_zerocopy_batch.pkt:26: error handling packet: live packet payload: expected 4000 bytes vs actual 5000 bytes
At the end, and as previously mentioned, these adaptations for debug kernel are perhaps not worth it: in this environment, it is probably enough to ignore packetdrill results and focus on kernel warnings.
Acked-by: Matthieu Baerts (NGI0) matttbe@kernel.org
Cheers, Matt
Hello:
This patch was applied to netdev/net.git (main) by Jakub Kicinski kuba@kernel.org:
On Fri, 1 Aug 2025 11:16:38 -0700 you wrote:
We keep seeing flakes on packetdrill on debug kernels, while non-debug kernels are stable, not a single flake in 200 runs. Time to give up, debug kernels appear to suffer from 10msec latency spikes and any timing-sensitive test is bound to flake.
Signed-off-by: Jakub Kicinski kuba@kernel.org
[...]
Here is the summary with links: - [net] selftests: net: packetdrill: xfail all problems on slow machines https://git.kernel.org/netdev/net/c/5ef7fdf52c0f
You are awesome, thank you!
linux-kselftest-mirror@lists.linaro.org