On 23/01/30 12:35PM, Willem de Bruijn wrote:
On Mon, Jan 30, 2023 at 12:31 PM Andrei Gherzan andrei.gherzan@canonical.com wrote:
On 23/01/30 11:29AM, Willem de Bruijn wrote:
On Mon, Jan 30, 2023 at 11:23 AM Andrei Gherzan andrei.gherzan@canonical.com wrote:
On 23/01/30 04:15PM, Andrei Gherzan wrote:
On 23/01/30 11:03AM, Willem de Bruijn wrote:
On Mon, Jan 30, 2023 at 9:28 AM Andrei Gherzan andrei.gherzan@canonical.com wrote: > > On 23/01/30 08:35AM, Willem de Bruijn wrote: > > On Mon, Jan 30, 2023 at 7:51 AM Andrei Gherzan > > andrei.gherzan@canonical.com wrote: > > > > > > On 23/01/30 09:26AM, Paolo Abeni wrote: > > > > On Fri, 2023-01-27 at 17:03 -0500, Willem de Bruijn wrote: > > > > > On Fri, Jan 27, 2023 at 1:16 PM Andrei Gherzan > > > > > andrei.gherzan@canonical.com wrote: > > > > > > > > > > > > The tx and rx test programs are used in a couple of test scripts including > > > > > > "udpgro_bench.sh". Taking this as an example, when the rx/tx programs > > > > > > are invoked subsequently, there is a chance that the rx one is not ready to > > > > > > accept socket connections. This racing bug could fail the test with at > > > > > > least one of the following: > > > > > > > > > > > > ./udpgso_bench_tx: connect: Connection refused > > > > > > ./udpgso_bench_tx: sendmsg: Connection refused > > > > > > ./udpgso_bench_tx: write: Connection refused > > > > > > > > > > > > This change addresses this by adding routines that retry the socket > > > > > > operations with an exponential back off algorithm from 100ms to 2s. > > > > > > > > > > > > Fixes: 3a687bef148d ("selftests: udp gso benchmark") > > > > > > Signed-off-by: Andrei Gherzan andrei.gherzan@canonical.com > > > > > > > > > > Synchronizing the two processes is indeed tricky. > > > > > > > > > > Perhaps more robust is opening an initial TCP connection, with > > > > > SO_RCVTIMEO to bound the waiting time. That covers all tests in one > > > > > go. > > > > > > > > Another option would be waiting for the listener(tcp)/receiver(udp) > > > > socket to show up in 'ss' output before firing-up the client - quite > > > > alike what mptcp self-tests are doing. > > > > > > I like this idea. I have tested it and it works as expected with the > > > exeception of: > > > > > > ./udpgso_bench_tx: sendmsg: No buffer space available > > > > > > Any ideas on how to handle this? I could retry and that works. > > > > This happens (also) without the zerocopy flag, right? That > > > > It might mean reaching the sndbuf limit, which can be adjusted with > > SO_SNDBUF (or SO_SNDBUFFORCE if CAP_NET_ADMIN). Though I would not > > expect this test to bump up against that limit. > > > > A few zerocopy specific reasons are captured in > > https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html#transmis.... > > I have dug a bit more into this, and it does look like your hint was in > the right direction. The fails I'm seeing are only with the zerocopy > flag. > > From the reasons (doc) above I can only assume optmem limit as I've > reproduced it with unlimited locked pages and the fails are transient. > That leaves optmem limit. Bumping the value I have by default (20480) to > (2048000) made the sendmsg succeed as expected. On the other hand, the > tests started to fail with something like: > > ./udpgso_bench_tx: Unexpected number of Zerocopy completions: 774783 > expected 773707 received
More zerocopy completions than number of sends. I have not seen this before.
The completions are ranges of IDs, one per send call for datagram sockets.
Even with segmentation offload, the counter increases per call, not per segment.
Do you experience this without any other changes to udpgso_bench_tx.c. Or are there perhaps additional sendmsg calls somewhere (during initial sync) that are not accounted to num_sends?
Indeed, that looks off. No, I have run into this without any changes in the tests (besides the retry routine in the shell script that waits for rx to come up). Also, as a data point.
Actually wait. I don't think that is the case here. "expected" is the number of sends. In this case we sent 1076 more messages than completions. Am I missing something obvious?
Oh indeed.
Receiving fewer completions than transmission is more likely.
Exactly, yes.
This should be the result of datagrams still being somewhere in the system. In a qdisc, or waiting for the network interface to return a completion notification, say.
Does this remain if adding a longer wait before the final flush_errqueue?
Yes and no. But not realiably unless I go overboard.
Or, really, the right fix is to keep polling there until the two are equal, up to some timeout. Currently flush_errqueue calls poll only once.
That makes sense. I have implemented a retry and this ran for a good while now.
flush_errqueue(fd, true);
while (true) {
flush_errqueue(fd, true);
if ((stat_zcopies == num_sends) || (delay >= MAX_DELAY))
break;
usleep(delay);
delay *= 2;
}
What do you think?
Thanks for running experiments.
We can avoid the unconditional sleep, as the poll() inside flush_errqueue already takes a timeout.
One option is to use start_time = clock_gettime(..) or gettimeofday before poll, and restart poll until either the exit condition or timeout is reached, with timeout = orig_time - elapsed_time.
Yes, this was more of a quick draft. I was thinking to move it into the flush function (while making it aware of num_sends via a parameter):
if (do_poll) { struct pollfd fds = {0}; int ret; unsigned long tnow, tstop;
fds.fd = fd; tnow = gettimeofday_ms(); tstop = tnow + POLL_LOOP_TIMEOUT_MS; while ((stat_zcopies != num_sends) && (tnow < tstop)) { ret = poll(&fds, 1, 500); if (ret == 0) { if (cfg_verbose) fprintf(stderr, "poll timeout\n"); } else if (ret < 0) { error(1, errno, "poll"); } tnow = gettimeofday_ms(); } }
Does this make more sense?