Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue

27 May 2019


      ...
On May 27, 2019, at 2:46 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
...
Also, I my v2 fix in net is still up for debate. In its current state, it
meets my application’s requirements, but may not meet all of yours.
...
I gave more specific feedback on issues with it (referencing zerocopy
and IP_TOS, say).
Unfortunately I don’t have a very good email setup, and I found a
bunch of your comments in my junk folder. That was on Saturday,
and on Sunday I spent some time implementing your suggestions.
I have not pushed the changes up yet.
I wanted to discuss whether or not to attach a buffer to the 
recvmsg(fd, &msg, MSG_ERRQUEUE). Without it, I have
MSG_TRUNC errors in my msg_flags. Either I have to add
a buffer, or ignore that error flag.
...
Also, it is safer to update only the relevant timestamp bits in
tx_flags, rather that blanket overwrite, given that some bits are
already set in skb_segment. I have not checked whether this is
absolutely necessary.
I agree. See tcp_fragment_tstamp().
I think this should work.
skb_shinfo(seg)->tx_flags |=
    		(skb_shinfo(gso_skb)->tx_flags & SKBTX_ANY_TSTAMP);
...
...
I am still open to suggestions, but so far I don’t have an alternate
solution that doesn’t break what I need working.
Did you see my response yesterday? I can live with the first segment.
Even if I don't think that it buys much in practice given xmit_more
(and it does cost something, e.g., during requeueing).
I’m sorry, I didn’t receive a response. Once again, I am struggling
with crappy email setup. Hopefully as of today my junk mail filters are
set up properly.
I’d like to see that comment. I have been wondering about xmit_more
myself. I don’t think it changes anything for software timestamps,
but it may with hardware timestamps.
I have service contracts with Intel and Mellanox. I can open up a ticket
with them to see exactly when the timestamp is taken. I believe you
mentioned before that this is vendor specific.
...
It is not strictly necessary, but indeed often a nice to have. We
generally reference by SHA1, so wait with submitting the test until
the fix is merged. See also the ipv6 flowlabel test that I just sent
for one example.
Thanks. I will hold off with the test until I get a final fix in net, and I’ll use
your example.
...
...
Below is a sample output of the
test, including a failure on IPv6 TCP Zerocopy audit (a failure that may
lead to a memory leak).
Can you elaborate on this suspected memory leak?
A user program cannot free a zerocopy buffer until it is reported as free.
If zerocopy events are not reported, that could be a memory leak.
I may have a fix. I have added a -P option when I am running an audit.
It doesn’t appear to affect performance, and since implementing it I have
received all error messages expected for both timestamp and zerocopy.
I am still testing.
...
...
I wanted to review the report with you before
I push up the v2 patch into net-next.
Are these extra tests what you were expecting? Is it OK that it doesn’t
flow well?
Do you mean how the output looks? That seems fine.
Good. Thanks.
...
...
Also, there is a failure about every 3rd time I run it,
indicating that some TX or Zerocopy messages are lost. Is that OK?
No that is not. These tests are run in a continuous test
infrastructure. We should try hard to avoid flakiness.
As per above comment, I think I removed the flakiness. I will run
overnight to confirm.
...
If this intermittent failure is due to a real kernel bug, please move
that part to a flag (or just comment out) to temporarily exclude it
from continuous testing.
More commonly it is an issue with the test itself. My SO_TXTIME test
from last week depends on timing, which has me somewhat worried when
run across a wide variety of (likely virtualized) platforms. I
purposely chose large timescales to minimize the risk.
On a related note, tests run as part of continuous testing should run
as briefly as possible. Perhaps we need to reduce the time per run to
accommodate for the new variants you are adding.
I could reduce testing from 4 to 2 seconds. Anything below that and I
miss some reports. When I found flakey results, I found I could reproduce
them in as little as 1 second.
...
...
Summary over 4.000 seconds...
sum tcp tx:   6921 MB/s     458580 calls (114645/s)     458580 msgs (114645/s)
./udpgso_bench_tx: Unexpected number of Zerocopy completions:    458580 expected    458578 received
Is this the issue you're referring to? Good catch. Clearly this is a
good test to have :) That is likely due to some timing issue in the
test, e.g., no waiting long enough to harvest all completions. That is
something I can look into after the code is merged.
Thanks.
Should the test have failed at this point? I did return an error(), but
the script kept running.
As stated, I don’t want to push up until I have tested more fully, and
the fix is accepted (which requires a v3). If you want to review what
I have, I can push it up now with the understanding that I may still
fine tune things.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue