Changes in v2: - Dropped "selftests/net: Clean-up double assignment", going to send it to net-next with other changes (Simon) - Added a patch to rectify RST selftests. - Link to v1: https://lore.kernel.org/r/20240118-tcp-ao-test-key-mgmt-v1-0-3583ca147113@ar...
Two typo fixes, noticed by Mohammad's review. And a fix for an issue that got uncovered.
Signed-off-by: Dmitry Safonov dima@arista.com --- Dmitry Safonov (2): selftests/net: Rectify key counters checks selftests/net: Repair RST passive reset selftest
Mohammad Nassiri (1): selftests/net: Argument value mismatch when calling verify_counters()
.../testing/selftests/net/tcp_ao/key-management.c | 46 ++++--- tools/testing/selftests/net/tcp_ao/lib/sock.c | 12 +- tools/testing/selftests/net/tcp_ao/rst.c | 138 ++++++++++++++------- 3 files changed, 124 insertions(+), 72 deletions(-) --- base-commit: ecb1b8288dc7ccbdcb3b9df005fa1c0e0c0388a7 change-id: 20240118-tcp-ao-test-key-mgmt-bb51a5fe15a2
Best regards,
From: Mohammad Nassiri mnassiri@ciena.com
The end_server() function only operates in the server thread and always takes an accept socket instead of a listen socket as its input argument. To align with this, invert the boolean values used when calling verify_counters() within the end_server() function.
As a result of this typo, the test didn't correctly check for the non-symmetrical scenario, where i.e. peer-A uses a key <100:200> to send data, but peer-B uses another key <105:205> to send its data. So, in simple words, different keys for TX and RX.
Fixes: 3c3ead555648 ("selftests/net: Add TCP-AO key-management test") Signed-off-by: Mohammad Nassiri mnassiri@ciena.com Link: https://lore.kernel.org/all/934627c5-eebb-4626-be23-cfb134c01d1a@arista.com/ [amended 'Fixes' tag, added the issue description and carried-over to lkml] Signed-off-by: Dmitry Safonov dima@arista.com --- tools/testing/selftests/net/tcp_ao/key-management.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/tcp_ao/key-management.c b/tools/testing/selftests/net/tcp_ao/key-management.c index c48b4970ca17..f6a9395e3cd7 100644 --- a/tools/testing/selftests/net/tcp_ao/key-management.c +++ b/tools/testing/selftests/net/tcp_ao/key-management.c @@ -843,7 +843,7 @@ static void end_server(const char *tst_name, int sk, synchronize_threads(); /* 4: verified => closed */ close(sk);
- verify_counters(tst_name, true, false, begin, &end); + verify_counters(tst_name, false, true, begin, &end); synchronize_threads(); /* 5: counters */ }
As the names of (struct test_key) members didn't reflect whether the key was used for TX or RX, the verification for the counters was done incorrectly for asymmetrical selftests.
Rename these with _tx appendix and fix checks in verify_counters(). While at it, as the checks are now correct, introduce skip_counters_checks, which is intended for tests where it's expected that a key that was set with setsockopt(sk, IPPROTO_TCP, TCP_AO_INFO, ...) might had no chance of getting used on the wire.
Fixes the following failures, exposed by the previous commit:
not ok 51 server: Check current != rnext keys set before connect(): Counter pkt_good was expected to increase 0 => 0 for key 132:5 not ok 52 server: Check current != rnext keys set before connect(): Counter pkt_good was not expected to increase 0 => 21 for key 137:10
not ok 63 server: Check current flapping back on peer's RnextKey request: Counter pkt_good was expected to increase 0 => 0 for key 132:5 not ok 64 server: Check current flapping back on peer's RnextKey request: Counter pkt_good was not expected to increase 0 => 40 for key 137:10
Cc: Mohammad Nassiri mnassiri@ciena.com Fixes: 3c3ead555648 ("selftests/net: Add TCP-AO key-management test") Signed-off-by: Dmitry Safonov dima@arista.com --- .../testing/selftests/net/tcp_ao/key-management.c | 44 ++++++++++++---------- 1 file changed, 25 insertions(+), 19 deletions(-)
diff --git a/tools/testing/selftests/net/tcp_ao/key-management.c b/tools/testing/selftests/net/tcp_ao/key-management.c index f6a9395e3cd7..24e62120b792 100644 --- a/tools/testing/selftests/net/tcp_ao/key-management.c +++ b/tools/testing/selftests/net/tcp_ao/key-management.c @@ -417,9 +417,9 @@ struct test_key { matches_vrf : 1, is_current : 1, is_rnext : 1, - used_on_handshake : 1, - used_after_accept : 1, - used_on_client : 1; + used_on_server_tx : 1, + used_on_client_tx : 1, + skip_counters_checks : 1; };
struct key_collection { @@ -609,16 +609,14 @@ static int key_collection_socket(bool server, unsigned int port) addr = &this_ip_dest; sndid = key->client_keyid; rcvid = key->server_keyid; - set_current = key->is_current; - set_rnext = key->is_rnext; + key->used_on_client_tx = set_current = key->is_current; + key->used_on_server_tx = set_rnext = key->is_rnext; }
if (test_add_key_cr(sk, key->password, key->len, *addr, vrf, sndid, rcvid, key->maclen, key->alg, set_current, set_rnext)) test_key_error("setsockopt(TCP_AO_ADD_KEY)", key); - if (set_current || set_rnext) - key->used_on_handshake = 1; #ifdef DEBUG test_print("%s [%u/%u] key: { %s, %u:%u, %u, %u:%u:%u:%u (%u)}", server ? "server" : "client", i, collection.nr_keys, @@ -640,22 +638,22 @@ static void verify_counters(const char *tst_name, bool is_listen_sk, bool server for (i = 0; i < collection.nr_keys; i++) { struct test_key *key = &collection.keys[i]; uint8_t sndid, rcvid; - bool was_used; + bool rx_cnt_expected;
+ if (key->skip_counters_checks) + continue; if (server) { sndid = key->server_keyid; rcvid = key->client_keyid; - if (is_listen_sk) - was_used = key->used_on_handshake; - else - was_used = key->used_after_accept; + rx_cnt_expected = key->used_on_client_tx; } else { sndid = key->client_keyid; rcvid = key->server_keyid; - was_used = key->used_on_client; + rx_cnt_expected = key->used_on_server_tx; }
- test_tcp_ao_key_counters_cmp(tst_name, a, b, was_used, + test_tcp_ao_key_counters_cmp(tst_name, a, b, + rx_cnt_expected ? TEST_CNT_KEY_GOOD : 0, sndid, rcvid); } test_tcp_ao_counters_free(a); @@ -916,9 +914,8 @@ static int run_client(const char *tst_name, unsigned int port, current_index = nr_keys - 1; if (rnext_index < 0) rnext_index = nr_keys - 1; - collection.keys[current_index].used_on_handshake = 1; - collection.keys[rnext_index].used_after_accept = 1; - collection.keys[rnext_index].used_on_client = 1; + collection.keys[current_index].used_on_client_tx = 1; + collection.keys[rnext_index].used_on_server_tx = 1;
synchronize_threads(); /* 3: accepted => send data */ if (test_client_verify(sk, msg_sz, msg_nr, TEST_TIMEOUT_SEC)) { @@ -1059,7 +1056,16 @@ static void check_current_back(const char *tst_name, unsigned int port, test_error("Can't change the current key"); if (test_client_verify(sk, msg_len, nr_packets, TEST_TIMEOUT_SEC)) test_fail("verify failed"); - collection.keys[rotate_to_index].used_after_accept = 1; + /* There is a race here: between setting the current_key with + * setsockopt(TCP_AO_INFO) and starting to send some data - there + * might have been a segment received with the desired + * RNext_key set. In turn that would mean that the first outgoing + * segment will have the desired current_key (flipped back). + * Which is what the user/test wants. As it's racy, skip checking + * the counters, yet check what are the resulting current/rnext + * keys on both sides. + */ + collection.keys[rotate_to_index].skip_counters_checks = 1;
end_client(tst_name, sk, nr_keys, current_index, rnext_index, &tmp); } @@ -1089,7 +1095,7 @@ static void roll_over_keys(const char *tst_name, unsigned int port, } verify_current_rnext(tst_name, sk, -1, collection.keys[i].server_keyid); - collection.keys[i].used_on_client = 1; + collection.keys[i].used_on_server_tx = 1; synchronize_threads(); /* verify current/rnext */ } end_client(tst_name, sk, nr_keys, current_index, rnext_index, &tmp);
Currently, the test is racy and seems to not pass anymore.
In order to rectify it, aim on TCP_TW_RST. Doesn't seem way too good with this sleep() part, but it seems as a reasonable compromise for the test. There is a plan in-line comment on how-to improve it, going to do it on the top, at this moment I want it to run on netdev/patchwork selftests dashboard.
It also slightly changes tcp_ao-lib in order to get SO_ERROR propagated to test_client_verify() return value.
Fixes: c6df7b2361d7 ("selftests/net: Add TCP-AO RST test") Signed-off-by: Dmitry Safonov dima@arista.com --- tools/testing/selftests/net/tcp_ao/lib/sock.c | 12 ++- tools/testing/selftests/net/tcp_ao/rst.c | 138 +++++++++++++++++--------- 2 files changed, 98 insertions(+), 52 deletions(-)
diff --git a/tools/testing/selftests/net/tcp_ao/lib/sock.c b/tools/testing/selftests/net/tcp_ao/lib/sock.c index c75d82885a2e..15aeb0963058 100644 --- a/tools/testing/selftests/net/tcp_ao/lib/sock.c +++ b/tools/testing/selftests/net/tcp_ao/lib/sock.c @@ -62,7 +62,9 @@ int test_wait_fd(int sk, time_t sec, bool write) return -ETIMEDOUT; }
- if (getsockopt(sk, SOL_SOCKET, SO_ERROR, &ret, &slen) || ret) + if (getsockopt(sk, SOL_SOCKET, SO_ERROR, &ret, &slen)) + return -errno; + if (ret) return -ret; return 0; } @@ -584,9 +586,11 @@ int test_client_verify(int sk, const size_t msg_len, const size_t nr, { size_t buf_sz = msg_len * nr; char *buf = alloca(buf_sz); + ssize_t ret;
randomize_buffer(buf, buf_sz); - if (test_client_loop(sk, buf, buf_sz, msg_len, timeout_sec) != buf_sz) - return -1; - return 0; + ret = test_client_loop(sk, buf, buf_sz, msg_len, timeout_sec); + if (ret < 0) + return (int)ret; + return ret != buf_sz ? -1 : 0; } diff --git a/tools/testing/selftests/net/tcp_ao/rst.c b/tools/testing/selftests/net/tcp_ao/rst.c index ac06009a7f5f..7df8b8700e39 100644 --- a/tools/testing/selftests/net/tcp_ao/rst.c +++ b/tools/testing/selftests/net/tcp_ao/rst.c @@ -1,10 +1,33 @@ // SPDX-License-Identifier: GPL-2.0 -/* Author: Dmitry Safonov dima@arista.com */ +/* + * The test checks that both active and passive reset have correct TCP-AO + * signature. An "active" reset (abort) here is procured from closing + * listen() socket with non-accepted connections in the queue: + * inet_csk_listen_stop() => inet_child_forget() => + * => tcp_disconnect() => tcp_send_active_reset() + * + * The passive reset is quite hard to get on established TCP connections. + * It could be procured from non-established states, but the synchronization + * part from userspace in order to reliably get RST seems uneasy. + * So, instead it's procured by corrupting SEQ number on TIMED-WAIT state. + * + * It's important to test both passive and active RST as they go through + * different code-paths: + * - tcp_send_active_reset() makes no-data skb, sends it with tcp_transmit_skb() + * - tcp_v*_send_reset() create their reply skbs and send them with + * ip_send_unicast_reply() + * + * In both cases TCP-AO signatures have to be correct, which is verified by + * (1) checking that the TCP-AO connection was reset and (2) TCP-AO counters. + * + * Author: Dmitry Safonov dima@arista.com + */ #include <inttypes.h> #include "../../../../include/linux/kernel.h" #include "aolib.h"
const size_t quota = 1000; +const size_t packet_sz = 100; /* * Backlog == 0 means 1 connection in queue, see: * commit 64a146513f8f ("[NET]: Revert incorrect accept queue...") @@ -59,26 +82,6 @@ static void close_forced(int sk) close(sk); }
-static int test_wait_for_exception(int sk, time_t sec) -{ - struct timeval tv = { .tv_sec = sec }; - struct timeval *ptv = NULL; - fd_set efds; - int ret; - - FD_ZERO(&efds); - FD_SET(sk, &efds); - - if (sec) - ptv = &tv; - - errno = 0; - ret = select(sk + 1, NULL, NULL, &efds, ptv); - if (ret < 0) - return -errno; - return ret ? sk : 0; -} - static void test_server_active_rst(unsigned int port) { struct tcp_ao_counters cnt1, cnt2; @@ -155,17 +158,16 @@ static void test_server_passive_rst(unsigned int port) test_fail("server returned %zd", bytes); }
- synchronize_threads(); /* 3: chekpoint/restore the connection */ + synchronize_threads(); /* 3: checkpoint the client */ + synchronize_threads(); /* 4: close the server, creating twsk */ if (test_get_tcp_ao_counters(sk, &ao2)) test_error("test_get_tcp_ao_counters()"); - - synchronize_threads(); /* 4: terminate server + send more on client */ - bytes = test_server_run(sk, quota, TEST_RETRANSMIT_SEC); close(sk); + + synchronize_threads(); /* 5: restore the socket, send more data */ test_tcp_ao_counters_cmp("passive RST server", &ao1, &ao2, TEST_CNT_GOOD);
- synchronize_threads(); /* 5: verified => closed */ - close(sk); + synchronize_threads(); /* 6: server exits */ }
static void *server_fn(void *arg) @@ -284,7 +286,7 @@ static void test_client_active_rst(unsigned int port) test_error("test_wait_fds(): %d", err);
synchronize_threads(); /* 3: close listen socket */ - if (test_client_verify(sk[0], 100, quota / 100, TEST_TIMEOUT_SEC)) + if (test_client_verify(sk[0], packet_sz, quota / packet_sz, TEST_TIMEOUT_SEC)) test_fail("Failed to send data on connected socket"); else test_ok("Verified established tcp connection"); @@ -323,7 +325,6 @@ static void test_client_passive_rst(unsigned int port) struct tcp_sock_state img; sockaddr_af saddr; int sk, err; - socklen_t slen = sizeof(err);
sk = socket(test_family, SOCK_STREAM, IPPROTO_TCP); if (sk < 0) @@ -337,18 +338,51 @@ static void test_client_passive_rst(unsigned int port) test_error("failed to connect()");
synchronize_threads(); /* 2: accepted => send data */ - if (test_client_verify(sk, 100, quota / 100, TEST_TIMEOUT_SEC)) + if (test_client_verify(sk, packet_sz, quota / packet_sz, TEST_TIMEOUT_SEC)) test_fail("Failed to send data on connected socket"); else test_ok("Verified established tcp connection");
- synchronize_threads(); /* 3: chekpoint/restore the connection */ + synchronize_threads(); /* 3: checkpoint the client */ test_enable_repair(sk); test_sock_checkpoint(sk, &img, &saddr); test_ao_checkpoint(sk, &ao_img); - test_kill_sk(sk); + test_disable_repair(sk);
- img.out.seq += quota; + synchronize_threads(); /* 4: close the server, creating twsk */ + + /* + * The "corruption" in SEQ has to be small enough to fit into TCP + * window, see tcp_timewait_state_process() for out-of-window + * segments. + */ + img.out.seq += 5; /* 5 is more noticeable in tcpdump than 1 */ + + /* + * FIXME: This is kind-of ugly and dirty, but it works. + * + * At this moment, the server has close'ed(sk). + * The passive RST that is being targeted here is new data after + * half-duplex close, see tcp_timewait_state_process() => TCP_TW_RST + * + * What is needed here is: + * (1) wait for FIN from the server + * (2) make sure that the ACK from the client went out + * (3) make sure that the ACK was received and processed by the server + * + * Otherwise, the data that will be sent from "repaired" socket + * post SEQ corruption may get to the server before it's in + * TCP_FIN_WAIT2. + * + * (1) is easy with select()/poll() + * (2) is possible by polling tcpi_state from TCP_INFO + * (3) is quite complex: as server's socket was already closed, + * probably the way to do it would be tcp-diag. + */ + sleep(TEST_RETRANSMIT_SEC); + + synchronize_threads(); /* 5: restore the socket, send more data */ + test_kill_sk(sk);
sk = socket(test_family, SOCK_STREAM, IPPROTO_TCP); if (sk < 0) @@ -366,25 +400,33 @@ static void test_client_passive_rst(unsigned int port) test_disable_repair(sk); test_sock_state_free(&img);
- synchronize_threads(); /* 4: terminate server + send more on client */ - if (test_client_verify(sk, 100, quota / 100, 2 * TEST_TIMEOUT_SEC)) - test_ok("client connection broken post-seq-adjust"); + /* + * This is how "passive reset" is acquired in this test from TCP_TW_RST: + * + * IP 10.0.254.1.7011 > 10.0.1.1.59772: Flags [P.], seq 901:1001, ack 1001, win 249, + * options [tcp-ao keyid 100 rnextkeyid 100 mac 0x10217d6c36a22379086ef3b1], length 100 + * IP 10.0.254.1.7011 > 10.0.1.1.59772: Flags [F.], seq 1001, ack 1001, win 249, + * options [tcp-ao keyid 100 rnextkeyid 100 mac 0x104ffc99b98c10a5298cc268], length 0 + * IP 10.0.1.1.59772 > 10.0.254.1.7011: Flags [.], ack 1002, win 251, + * options [tcp-ao keyid 100 rnextkeyid 100 mac 0xe496dd4f7f5a8a66873c6f93,nop,nop,sack 1 {1001:1002}], length 0 + * IP 10.0.1.1.59772 > 10.0.254.1.7011: Flags [P.], seq 1006:1106, ack 1001, win 251, + * options [tcp-ao keyid 100 rnextkeyid 100 mac 0x1b5f3330fb23fbcd0c77d0ca], length 100 + * IP 10.0.254.1.7011 > 10.0.1.1.59772: Flags [R], seq 3215596252, win 0, + * options [tcp-ao keyid 100 rnextkeyid 100 mac 0x0bcfbbf497bce844312304b2], length 0 + */ + err = test_client_verify(sk, packet_sz, quota / packet_sz, 2 * TEST_TIMEOUT_SEC); + /* Make sure that the connection was reset, not timeouted */ + if (err && err == -ECONNRESET) + test_ok("client sock was passively reset post-seq-adjust"); + else if (err) + test_fail("client sock was not reset post-seq-adjust: %d", err); else - test_fail("client connection still works post-seq-adjust"); - - test_wait_for_exception(sk, TEST_TIMEOUT_SEC); - - if (getsockopt(sk, SOL_SOCKET, SO_ERROR, &err, &slen)) - test_error("getsockopt()"); - if (err != ECONNRESET && err != EPIPE) - test_fail("client connection was not reset: %d", err); - else - test_ok("client connection was reset"); + test_fail("client sock is yet connected post-seq-adjust");
if (test_get_tcp_ao_counters(sk, &ao2)) test_error("test_get_tcp_ao_counters()");
- synchronize_threads(); /* 5: verified => closed */ + synchronize_threads(); /* 6: server exits */ close(sk); test_tcp_ao_counters_cmp("client passive RST", &ao1, &ao2, TEST_CNT_GOOD); } @@ -410,6 +452,6 @@ static void *client_fn(void *arg)
int main(int argc, char *argv[]) { - test_init(15, server_fn, client_fn); + test_init(14, server_fn, client_fn); return 0; }
On 2/1/24 00:36, Jakub Kicinski wrote:
On Tue, 30 Jan 2024 03:51:51 +0000 Dmitry Safonov wrote:
Two typo fixes, noticed by Mohammad's review. And a fix for an issue that got uncovered.
I can confirm that all tests pass now :) Thank you!
Thanks Jakub!
Please, let me know if there will be other issues with tcp-ao tests :)
Going to work on tracepoints and some other TCP-AO stuff for net-next.
Thanks, Dmitry
On Thu, 1 Feb 2024 00:50:46 +0000 Dmitry Safonov wrote:
Please, let me know if there will be other issues with tcp-ao tests :)
Going to work on tracepoints and some other TCP-AO stuff for net-next.
Since you're being nice and helpful I figured I'll try testing TCP-AO with debug options enabled :) (kernel/configs/debug.config and kernel/configs/x86_debug.config included), that slows things down and causes a bit of flakiness in unsigned-md5-* tests:
https://netdev.bots.linux.dev/flakes.html?br-cnt=75&tn-needle=tcp-ao
This has links to outputs: https://netdev.bots.linux.dev/contest.html?executor=vmksft-tcp-ao-dbg&pa...
If it's a timing thing - FWIW we started exporting KSFT_MACHINE_SLOW=yes on the slow runners.
Hi Jakub,
On 2/1/24 21:21, Jakub Kicinski wrote:
On Thu, 1 Feb 2024 00:50:46 +0000 Dmitry Safonov wrote:
Please, let me know if there will be other issues with tcp-ao tests :)
Going to work on tracepoints and some other TCP-AO stuff for net-next.
Since you're being nice and helpful I figured I'll try testing TCP-AO with debug options enabled :) (kernel/configs/debug.config and kernel/configs/x86_debug.config included),
Haha :)
that slows things down and causes a bit of flakiness in unsigned-md5-* tests:
https://netdev.bots.linux.dev/flakes.html?br-cnt=75&tn-needle=tcp-ao
This has links to outputs: https://netdev.bots.linux.dev/contest.html?executor=vmksft-tcp-ao-dbg&pa...
If it's a timing thing - FWIW we started exporting KSFT_MACHINE_SLOW=yes on the slow runners.
I think, I know what happens here:
# ok 8 AO server (AO_REQUIRED): AO client: counter TCPAOGood increased 4 => 6 # ok 9 AO server (AO_REQUIRED): unsigned client # ok 10 AO server (AO_REQUIRED): unsigned client: counter TCPAORequired increased 1 => 2 # not ok 11 AO server (AO_REQUIRED): unsigned client: Counter netns_ao_good was not expected to increase 7 => 8
for each of tests the server listens at a new port, but re-uses the same namespaces+veth. If the node/machine is quite slow, I guess a segment might have been retransmitted and the test that initiated it had already finished. And as result, the per-namespace counters are incremented, which makes the test fail (IOW, the test expects all segments in ns being dropped).
So, I should do one of the options:
1. relax per-namespace checks (the per-socket and per-key counters are checked) 2. unshare(net) + veth setup for each test 3. split the selftest on smaller ones (as they create new net-ns in initialization)
I'd probably prefer (2), albeit it slows down that slow machine even more, but I don't think creating 2 net-ns + veth pair per each test would add a lot more overhead even on some rpi board. But let's see, maybe I'll just go with (1) as that's really easy.
I'll cook a patch this week.
Thanks, Dmitry
On 2/1/24 22:25, Dmitry Safonov wrote:
Hi Jakub,
On 2/1/24 21:21, Jakub Kicinski wrote:
On Thu, 1 Feb 2024 00:50:46 +0000 Dmitry Safonov wrote:
Please, let me know if there will be other issues with tcp-ao tests :)
Going to work on tracepoints and some other TCP-AO stuff for net-next.
Since you're being nice and helpful I figured I'll try testing TCP-AO with debug options enabled :) (kernel/configs/debug.config and kernel/configs/x86_debug.config included),
Haha :)
that slows things down and causes a bit of flakiness in unsigned-md5-* tests:
https://netdev.bots.linux.dev/flakes.html?br-cnt=75&tn-needle=tcp-ao
This has links to outputs: https://netdev.bots.linux.dev/contest.html?executor=vmksft-tcp-ao-dbg&pa...
If it's a timing thing - FWIW we started exporting KSFT_MACHINE_SLOW=yes on the slow runners.
I think, I know what happens here:
# ok 8 AO server (AO_REQUIRED): AO client: counter TCPAOGood increased 4 => 6 # ok 9 AO server (AO_REQUIRED): unsigned client # ok 10 AO server (AO_REQUIRED): unsigned client: counter TCPAORequired increased 1 => 2 # not ok 11 AO server (AO_REQUIRED): unsigned client: Counter netns_ao_good was not expected to increase 7 => 8
for each of tests the server listens at a new port, but re-uses the same namespaces+veth. If the node/machine is quite slow, I guess a segment might have been retransmitted and the test that initiated it had already finished. And as result, the per-namespace counters are incremented, which makes the test fail (IOW, the test expects all segments in ns being dropped).
So, I should do one of the options:
- relax per-namespace checks (the per-socket and per-key counters are checked)
- unshare(net) + veth setup for each test
- split the selftest on smaller ones (as they create new net-ns in initialization)
Actually, I think there may be an easier fix:
4. Make sure that client close()s TCP-AO first, making it twsk. And also make sure that net-ns counters read post server's close().
Will do this, let's see if this fixes the flakiness on the netdev bot :)
I'd probably prefer (2), albeit it slows down that slow machine even more, but I don't think creating 2 net-ns + veth pair per each test would add a lot more overhead even on some rpi board. But let's see, maybe I'll just go with (1) as that's really easy.
I'll cook a patch this week.
Thanks, Dmitry
On 2/1/24 23:37, Dmitry Safonov wrote:
On 2/1/24 22:25, Dmitry Safonov wrote:
Hi Jakub,
On 2/1/24 21:21, Jakub Kicinski wrote:
On Thu, 1 Feb 2024 00:50:46 +0000 Dmitry Safonov wrote:
Please, let me know if there will be other issues with tcp-ao tests :)
Going to work on tracepoints and some other TCP-AO stuff for net-next.
Since you're being nice and helpful I figured I'll try testing TCP-AO with debug options enabled :) (kernel/configs/debug.config and kernel/configs/x86_debug.config included),
Haha :)
that slows things down and causes a bit of flakiness in unsigned-md5-* tests:
https://netdev.bots.linux.dev/flakes.html?br-cnt=75&tn-needle=tcp-ao
This has links to outputs: https://netdev.bots.linux.dev/contest.html?executor=vmksft-tcp-ao-dbg&pa...
If it's a timing thing - FWIW we started exporting KSFT_MACHINE_SLOW=yes on the slow runners.
I think, I know what happens here:
# ok 8 AO server (AO_REQUIRED): AO client: counter TCPAOGood increased 4 => 6 # ok 9 AO server (AO_REQUIRED): unsigned client # ok 10 AO server (AO_REQUIRED): unsigned client: counter TCPAORequired increased 1 => 2 # not ok 11 AO server (AO_REQUIRED): unsigned client: Counter netns_ao_good was not expected to increase 7 => 8
for each of tests the server listens at a new port, but re-uses the same namespaces+veth. If the node/machine is quite slow, I guess a segment might have been retransmitted and the test that initiated it had already finished. And as result, the per-namespace counters are incremented, which makes the test fail (IOW, the test expects all segments in ns being dropped).
So, I should do one of the options:
- relax per-namespace checks (the per-socket and per-key counters are checked)
- unshare(net) + veth setup for each test
- split the selftest on smaller ones (as they create new net-ns in initialization)
Actually, I think there may be an easier fix:
- Make sure that client close()s TCP-AO first, making it twsk. And also make sure that net-ns counters read post server's close().
Will do this, let's see if this fixes the flakiness on the netdev bot :)
FWIW, I ended up with this: https://lore.kernel.org/all/20240202-unsigned-md5-netns-counters-v1-1-8b90c3...
I reproduced the issue once, running unsigned-md5* in a loop, while in another terminal building linux-next with all cores. With the patch above, it survived 77 iterations of both ipv4/ipv6 tests so far. So, there is a chance it fixes the issue :)
Thanks, Dmitry
On Fri, 2 Feb 2024 02:30:52 +0000 Dmitry Safonov wrote:
Actually, I think there may be an easier fix:
- Make sure that client close()s TCP-AO first, making it twsk. And also make sure that net-ns counters read post server's close().
Will do this, let's see if this fixes the flakiness on the netdev bot :)
FWIW, I ended up with this: https://lore.kernel.org/all/20240202-unsigned-md5-netns-counters-v1-1-8b90c3...
I reproduced the issue once, running unsigned-md5* in a loop, while in another terminal building linux-next with all cores. With the patch above, it survived 77 iterations of both ipv4/ipv6 tests so far. So, there is a chance it fixes the issue :)
That was quick! Fingers crossed :)
Hello:
This series was applied to netdev/net.git (main) by Jakub Kicinski kuba@kernel.org:
On Tue, 30 Jan 2024 03:51:51 +0000 you wrote:
Changes in v2:
- Dropped "selftests/net: Clean-up double assignment", going to send it to net-next with other changes (Simon)
- Added a patch to rectify RST selftests.
- Link to v1: https://lore.kernel.org/r/20240118-tcp-ao-test-key-mgmt-v1-0-3583ca147113@ar...
Two typo fixes, noticed by Mohammad's review. And a fix for an issue that got uncovered.
[...]
Here is the summary with links: - [v2,1/3] selftests/net: Argument value mismatch when calling verify_counters() https://git.kernel.org/netdev/net/c/d8f5df1fcea5 - [v2,2/3] selftests/net: Rectify key counters checks https://git.kernel.org/netdev/net/c/384aa16d3776 - [v2,3/3] selftests/net: Repair RST passive reset selftest https://git.kernel.org/netdev/net/c/6caf3adcc877
You are awesome, thank you!
linux-kselftest-mirror@lists.linaro.org