Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
Also this updates the UDP GSO selftests to optionally stress test this condition, and report the reliability and performance of both TX Timestamping and ZEROCOPY messages.
Fred Klassen (4): net/udp_gso: Allow TX timestamp with UDP GSO net/udpgso_bench_tx: options to exercise TX CMSG net/udpgso_bench_tx: fix sendmmsg on unconnected socket net/udpgso_bench_tx: audit error queue
net/ipv4/udp_offload.c | 4 + tools/testing/selftests/net/udpgso_bench_tx.c | 376 ++++++++++++++++++++++++-- 2 files changed, 358 insertions(+), 22 deletions(-)
Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition.
./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s
The "poll timeout" message above indicates that TX timestamp never arrived.
It also appears that other TX CMSG types cause similar issues, for example trying to set SOL_IP/IP_TOS.
./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s
This patch preserves tx_flags for the first UDP GSO segment. This mirrors the stack's behaviour for IPv4 fragments.
Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Fred Klassen fklassen@appneta.com --- net/ipv4/udp_offload.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 065334b41d57..33de347695ae 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -228,6 +228,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, seg = segs; uh = udp_hdr(seg);
+ /* preserve TX timestamp and zero-copy info for first segment */ + skb_shinfo(seg)->tskey = skb_shinfo(gso_skb)->tskey; + skb_shinfo(seg)->tx_flags = skb_shinfo(gso_skb)->tx_flags; + /* compute checksum adjustment based on old length versus new */ newlen = htons(sizeof(*uh) + mss); check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
On Thu, May 23, 2019 at 5:09 PM Fred Klassen fklassen@appneta.com wrote:
Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition.
./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s
The "poll timeout" message above indicates that TX timestamp never arrived.
It also appears that other TX CMSG types cause similar issues, for example trying to set SOL_IP/IP_TOS.
./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s
This patch preserves tx_flags for the first UDP GSO segment. This mirrors the stack's behaviour for IPv4 fragments.
Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Fred Klassen fklassen@appneta.com
net/ipv4/udp_offload.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 065334b41d57..33de347695ae 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -228,6 +228,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, seg = segs; uh = udp_hdr(seg);
/* preserve TX timestamp and zero-copy info for first segment */
skb_shinfo(seg)->tskey = skb_shinfo(gso_skb)->tskey;
skb_shinfo(seg)->tx_flags = skb_shinfo(gso_skb)->tx_flags;
Thanks for the report.
Zerocopy notification reference count is managed in skb_segment. That should work.
Support for timestamping with the new GSO feature is indeed an oversight. The solution is similar to how TCP associates the timestamp with the right segment in tcp_gso_tstamp.
Only, I think we want to transfer the timestamp request to the last datagram, not the first. For send timestamp, the final byte leaving the host is usually more interesting.
Thanks for the report.
Zerocopy notification reference count is managed in skb_segment. That should work.
Support for timestamping with the new GSO feature is indeed an oversight. The solution is similar to how TCP associates the timestamp with the right segment in tcp_gso_tstamp.
Only, I think we want to transfer the timestamp request to the last datagram, not the first. For send timestamp, the final byte leaving the host is usually more interesting.
TX Timestamping the last packet of a datagram is something that would work poorly for our application. We need to measure the time it takes for the first bit that is sent until the first bit of the last packet is received. Timestaming the last packet of a burst seems somewhat random to me and would not be useful. Essentially we would be timestamping a random byte in a UDP GSO buffer.
I believe there is a precedence for timestamping the first packet. With IPv4 packets, the first packet is timestamped and the remaining fragments are not.
On Thu, May 23, 2019 at 9:38 PM Fred Klassen fklassen@appneta.com wrote:
Thanks for the report.
Zerocopy notification reference count is managed in skb_segment. That should work.
Support for timestamping with the new GSO feature is indeed an oversight. The solution is similar to how TCP associates the timestamp with the right segment in tcp_gso_tstamp.
Only, I think we want to transfer the timestamp request to the last datagram, not the first. For send timestamp, the final byte leaving the host is usually more interesting.
TX Timestamping the last packet of a datagram is something that would work poorly for our application. We need to measure the time it takes for the first bit that is sent until the first bit of the last packet is received. Timestaming the last packet of a burst seems somewhat random to me and would not be useful. Essentially we would be timestamping a random byte in a UDP GSO buffer.
I believe there is a precedence for timestamping the first packet. With IPv4 packets, the first packet is timestamped and the remaining fragments are not.
Interesting. TCP timestamping takes the opposite choice and does timestamp the last byte in the sendmsg request.
It sounds like it depends on the workload. Perhaps this then needs to be configurable with an SOF_.. flag.
Another option would be to return a timestamp for every segment. But they would all return the same tskey. And it causes different behavior with and without hardware offload.
Interesting. TCP timestamping takes the opposite choice and does timestamp the last byte in the sendmsg request.
I have a difficult time with the philosophy of TX timestamping the last segment. The actual timestamp occurs just before the last segment is sent. This is neither the start nor the end of a GSO packet, which to me seems somewhat arbitrary. It is even more arbitrary when using software TX tiimestamping. These are timestamps represent the time that the packet is queued onto the NIC’s buffer, not actual time leaving the wire. Queuing to a ring buffer is usually much faster than wire rates. Therefore, say the timestamp of the last 1500 byte segment of a 64K GSO packet may in reality be representing a time about half way through the burst.
Since the timestamp of a TX packet occurs just before any data is sent, I have found it most valuable to timestamp just before the first byte of the packet or burst. Conversely, I find it most valuable to get an RX timestamp after the last byte arrives.
It sounds like it depends on the workload. Perhaps this then needs to be configurable with an SOF_.. flag.
It would be interesting if a practical case can be made for timestamping the last segment. In my mind, I don’t see how that would be valuable.
Another option would be to return a timestamp for every segment. But they would all return the same tskey. And it causes different behavior with and without hardware offload.
When it comes to RX packets, getting per-packet (or per segment) timestamps is invaluable. They represent actual wire times. However my previous research into TX timestamping has led me to conclude that there is no practical value when timestamping every packet of a back-to-back burst.
When using software TX timestamping, The inter-packet timestamps are typically much faster than line rate. Whereas you may be sending on a GigE link, you may measure 20Gbps. At higher rates, I have found that the overhead of per-packet software timestamping can produce gaps in packets.
When using hardware timestamping, I think you will find that nearly all adapters only allow one timestamp at a time. Therefore only one packet in a burst would get timestamped. There are exceptions, for example I am playing with a 100G Mellanox adapter that has per-packet TX timestamping. However, I suspect that when I am done testing, all I will see is timestamps that are representing wire rate (e.g. 123nsec per 1500 byte packet).
Beyond testing the accuracy of a NIC’s timestamping capabilities, I see very little value in doing per-segment timestamping.
On Fri, May 24, 2019 at 12:34 PM Fred Klassen fklassen@appneta.com wrote:
Interesting. TCP timestamping takes the opposite choice and does timestamp the last byte in the sendmsg request.
I have a difficult time with the philosophy of TX timestamping the last segment. The actual timestamp occurs just before the last segment is sent. This is neither the start nor the end of a GSO packet, which to me seems somewhat arbitrary. It is even more arbitrary when using software TX tiimestamping. These are timestamps represent the time that the packet is queued onto the NIC’s buffer, not actual time leaving the wire.
It is the last moment that a timestamp can be generated for the last byte, I don't see how that is "neither the start nor the end of a GSO packet".
Queuing to a ring buffer is usually much faster than wire rates. Therefore, say the timestamp of the last 1500 byte segment of a 64K GSO packet may in reality be representing a time about half way through the burst.
Since the timestamp of a TX packet occurs just before any data is sent, I have found it most valuable to timestamp just before the first byte of the packet or burst. Conversely, I find it most valuable to get an RX timestamp after the last byte arrives.
It sounds like it depends on the workload. Perhaps this then needs to be configurable with an SOF_.. flag.
It would be interesting if a practical case can be made for timestamping the last segment. In my mind, I don’t see how that would be valuable.
It depends whether you are interested in measuring network latency or host transmit path latency.
For the latter, knowing the time from the start of the sendmsg call to the moment the last byte hits the wire is most relevant. Or in absence of (well defined) hardware support, the last byte being queued to the device is the next best thing.
It would make sense for this software implementation to follow established hardware behavior. But as far as I know, the exact time a hardware timestamp is taken is not consistent across devices, either.
For fine grained timestamped data, perhaps GSO is simply not a good mechanism. That said, it still has to queue a timestamp if requested.
Another option would be to return a timestamp for every segment. But they would all return the same tskey. And it causes different behavior with and without hardware offload.
When it comes to RX packets, getting per-packet (or per segment) timestamps is invaluable. They represent actual wire times. However my previous research into TX timestamping has led me to conclude that there is no practical value when timestamping every packet of a back-to-back burst.
When using software TX timestamping, The inter-packet timestamps are typically much faster than line rate. Whereas you may be sending on a GigE link, you may measure 20Gbps. At higher rates, I have found that the overhead of per-packet software timestamping can produce gaps in packets.
When using hardware timestamping, I think you will find that nearly all adapters only allow one timestamp at a time. Therefore only one packet in a burst would get timestamped.
Can you elaborate? When the host queues N packets all with hardware timestamps requested, all N completions will have a timestamp? Or is that not guaranteed?
There are exceptions, for example I am playing with a 100G Mellanox adapter that has per-packet TX timestamping. However, I suspect that when I am done testing, all I will see is timestamps that are representing wire rate (e.g. 123nsec per 1500 byte packet).
Beyond testing the accuracy of a NIC’s timestamping capabilities, I see very little value in doing per-segment timestamping.
Ack. Great detailed argument, thanks.
On May 24, 2019, at 12:29 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
It is the last moment that a timestamp can be generated for the last byte, I don't see how that is "neither the start nor the end of a GSO packet”.
My misunderstanding. I thought TCP did last segment timestamping, not last byte. In that case, your statements make sense.
It would be interesting if a practical case can be made for timestamping the last segment. In my mind, I don’t see how that would be valuable.
It depends whether you are interested in measuring network latency or host transmit path latency.
For the latter, knowing the time from the start of the sendmsg call to the moment the last byte hits the wire is most relevant. Or in absence of (well defined) hardware support, the last byte being queued to the device is the next best thing.
It would make sense for this software implementation to follow established hardware behavior. But as far as I know, the exact time a hardware timestamp is taken is not consistent across devices, either.
For fine grained timestamped data, perhaps GSO is simply not a good mechanism. That said, it still has to queue a timestamp if requested.
I see your point. Makes sense to me.
When using hardware timestamping, I think you will find that nearly all adapters only allow one timestamp at a time. Therefore only one packet in a burst would get timestamped.
Can you elaborate? When the host queues N packets all with hardware timestamps requested, all N completions will have a timestamp? Or is that not guaranteed?
It is not guaranteed. The best example is in ixgbe_main.c and search for ‘SKBTX_HW_TSTAMP’. If there is a PTP TX timestamp in progress, ‘__IXGBE_PTP_TX_IN_PROGRESS’ is set and no other timestamps are possible. The flag is cleared after transmit softirq, and only then can another TX timestamp be taken.
There are exceptions, for example I am playing with a 100G Mellanox adapter that has per-packet TX timestamping. However, I suspect that when I am done testing, all I will see is timestamps that are representing wire rate (e.g. 123nsec per 1500 byte packet).
Beyond testing the accuracy of a NIC’s timestamping capabilities, I see very little value in doing per-segment timestamping.
Ack. Great detailed argument, thanks.
Thanks. I’m a timestamping nerd and have learned lots with this discussion.
On Fri, May 24, 2019 at 6:01 PM Fred Klassen fklassen@appneta.com wrote:
On May 24, 2019, at 12:29 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
It is the last moment that a timestamp can be generated for the last byte, I don't see how that is "neither the start nor the end of a GSO packet”.
My misunderstanding. I thought TCP did last segment timestamping, not last byte. In that case, your statements make sense.
It would be interesting if a practical case can be made for timestamping the last segment. In my mind, I don’t see how that would be valuable.
It depends whether you are interested in measuring network latency or host transmit path latency.
For the latter, knowing the time from the start of the sendmsg call to the moment the last byte hits the wire is most relevant. Or in absence of (well defined) hardware support, the last byte being queued to the device is the next best thing.
Sounds to me like both cases have a legitimate use case, and we want to support both.
Implementation constraints are that storage for this timestamp information is scarce and we cannot add new cold cacheline accesses in the datapath.
The simplest approach would be to unconditionally timestamp both the first and last segment. With the same ID. Not terribly elegant. But it works.
If conditional, tx_flags has only one bit left. I think we can harvest some, as not all defined bits are in use at the same stages in the datapath, but that is not a trivial change. Some might also better be set in the skb, instead of skb_shinfo. Which would also avoids touching that cacheline. We could possibly repurpose bits from u32 tskey.
All that can come later. Initially, unless we can come up with something more elegant, I would suggest that UDP follows the rule established by TCP and timestamps the last byte. And we add an explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in __sock_tx_timestamp and is interpreted in __udp_gso_segment.
It would make sense for this software implementation to follow established hardware behavior. But as far as I know, the exact time a hardware timestamp is taken is not consistent across devices, either.
For fine grained timestamped data, perhaps GSO is simply not a good mechanism. That said, it still has to queue a timestamp if requested.
I see your point. Makes sense to me.
When using hardware timestamping, I think you will find that nearly all adapters only allow one timestamp at a time. Therefore only one packet in a burst would get timestamped.
Can you elaborate? When the host queues N packets all with hardware timestamps requested, all N completions will have a timestamp? Or is that not guaranteed?
It is not guaranteed. The best example is in ixgbe_main.c and search for ‘SKBTX_HW_TSTAMP’. If there is a PTP TX timestamp in progress, ‘__IXGBE_PTP_TX_IN_PROGRESS’ is set and no other timestamps are possible. The flag is cleared after transmit softirq, and only then can another TX timestamp be taken.
Interesting, thanks. I had no idea.
On May 25, 2019, at 8:20 AM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
On Fri, May 24, 2019 at 6:01 PM Fred Klassen fklassen@appneta.com wrote:
On May 24, 2019, at 12:29 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
It is the last moment that a timestamp can be generated for the last byte, I don't see how that is "neither the start nor the end of a GSO packet”.
My misunderstanding. I thought TCP did last segment timestamping, not last byte. In that case, your statements make sense.
It would be interesting if a practical case can be made for timestamping the last segment. In my mind, I don’t see how that would be valuable.
It depends whether you are interested in measuring network latency or host transmit path latency.
For the latter, knowing the time from the start of the sendmsg call to the moment the last byte hits the wire is most relevant. Or in absence of (well defined) hardware support, the last byte being queued to the device is the next best thing.
Sounds to me like both cases have a legitimate use case, and we want to support both.
Implementation constraints are that storage for this timestamp information is scarce and we cannot add new cold cacheline accesses in the datapath.
The simplest approach would be to unconditionally timestamp both the first and last segment. With the same ID. Not terribly elegant. But it works.
If conditional, tx_flags has only one bit left. I think we can harvest some, as not all defined bits are in use at the same stages in the datapath, but that is not a trivial change. Some might also better be set in the skb, instead of skb_shinfo. Which would also avoids touching that cacheline. We could possibly repurpose bits from u32 tskey.
All that can come later. Initially, unless we can come up with something more elegant, I would suggest that UDP follows the rule established by TCP and timestamps the last byte. And we add an explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in __sock_tx_timestamp and is interpreted in __udp_gso_segment.
I don’t see how to practically TX timestamp the last byte of any packet (UDP GSO or otherwise). The best we could do is timestamp the last segment, or rather the time that the last segment is queued. Let me attempt to explain.
First let’s look at software TX timestamps which are for are generated by skb_tx_timestamp() in nearly every network driver’s xmit routine. It states:
—————————— cut ———————————— * Ethernet MAC Drivers should call this function in their hard_xmit() * function immediately before giving the sk_buff to the MAC hardware. —————————— cut ————————————
That means that the sk_buff will get timestamped just before rather than just after it is sent. To truly capture the timestamp of the last byte, this routine routine would have to be called a second time, right after sending to MAC hardware. Then the user program would have sort out the 2 timestamps. My guess is that this isn’t something that NIC vendors would be willing to implement in their drivers.
So, the best we can do is timestamp is just before the last segment. Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter. If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp occurs half way through the burst. But it may not be exactly half way because the segments may get queued much faster than wire rate. Therefore the time between segment 1 and segment 2 may be much much smaller than their spacing on the wire. I would not find this useful.
I propose that we stick with the method used for IP fragments, which is timestamping just before the first byte is sent. Put another way, I propose that we start the clock in an automobile race just before the front of the first car crosses the start line rather than when the front of the last car crosses the start line.
TX timestamping in hardware has even more limitations. For the most part, we can only do one timestamp per packet or burst. If we requested a timestamp of only the last segment of a packet, we would have work backwards to calculate the start time of the packet, but that would only be be a best guess. For extremely time sensitive applications (such as the one we develop), this would not be practical.
We could still consider setting a flag that would allow the timestamping the last segment rather than the first. However since we cannot truly measure the timestamp of the last byte, I would question the value in doing so.
On Sat, May 25, 2019 at 1:47 PM Fred Klassen fklassen@appneta.com wrote:
On May 25, 2019, at 8:20 AM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
On Fri, May 24, 2019 at 6:01 PM Fred Klassen fklassen@appneta.com wrote:
On May 24, 2019, at 12:29 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
It is the last moment that a timestamp can be generated for the last byte, I don't see how that is "neither the start nor the end of a GSO packet”.
My misunderstanding. I thought TCP did last segment timestamping, not last byte. In that case, your statements make sense.
It would be interesting if a practical case can be made for timestamping the last segment. In my mind, I don’t see how that would be valuable.
It depends whether you are interested in measuring network latency or host transmit path latency.
For the latter, knowing the time from the start of the sendmsg call to the moment the last byte hits the wire is most relevant. Or in absence of (well defined) hardware support, the last byte being queued to the device is the next best thing.
Sounds to me like both cases have a legitimate use case, and we want to support both.
Implementation constraints are that storage for this timestamp information is scarce and we cannot add new cold cacheline accesses in the datapath.
The simplest approach would be to unconditionally timestamp both the first and last segment. With the same ID. Not terribly elegant. But it works.
If conditional, tx_flags has only one bit left. I think we can harvest some, as not all defined bits are in use at the same stages in the datapath, but that is not a trivial change. Some might also better be set in the skb, instead of skb_shinfo. Which would also avoids touching that cacheline. We could possibly repurpose bits from u32 tskey.
All that can come later. Initially, unless we can come up with something more elegant, I would suggest that UDP follows the rule established by TCP and timestamps the last byte. And we add an explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in __sock_tx_timestamp and is interpreted in __udp_gso_segment.
I don’t see how to practically TX timestamp the last byte of any packet (UDP GSO or otherwise). The best we could do is timestamp the last segment, or rather the time that the last segment is queued. Let me attempt to explain.
First let’s look at software TX timestamps which are for are generated by skb_tx_timestamp() in nearly every network driver’s xmit routine. It states:
—————————— cut ————————————
- Ethernet MAC Drivers should call this function in their hard_xmit()
- function immediately before giving the sk_buff to the MAC hardware.
—————————— cut ————————————
That means that the sk_buff will get timestamped just before rather than just after it is sent. To truly capture the timestamp of the last byte, this routine routine would have to be called a second time, right after sending to MAC hardware. Then the user program would have sort out the 2 timestamps. My guess is that this isn’t something that NIC vendors would be willing to implement in their drivers.
So, the best we can do is timestamp is just before the last segment. Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter. If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp occurs half way through the burst. But it may not be exactly half way because the segments may get queued much faster than wire rate. Therefore the time between segment 1 and segment 2 may be much much smaller than their spacing on the wire. I would not find this useful.
For measuring host queueing latency, a timestamp at the existing skb_tx_timestamp() for the last segment is perfectly informative.
I propose that we stick with the method used for IP fragments, which is timestamping just before the first byte is sent.
I understand that this addresses your workload. It simply ignores the other identified earlier in this thread.
Put another way, I propose that we start the clock in an automobile race just before the front of the first car crosses the start line rather than when the front of the last car crosses the start line.
TX timestamping in hardware has even more limitations. For the most part, we can only do one timestamp per packet or burst. If we requested a timestamp of only the last segment of a packet, we would have work backwards to calculate the start time of the packet, but that would only be be a best guess. For extremely time sensitive applications (such as the one we develop), this would not be practical.
Note that for any particularly sensitive measurements, a segment can always be sent separately.
We could still consider setting a flag that would allow the timestamping the last segment rather than the first. However since we cannot truly measure the timestamp of the last byte, I would question the value in doing so.
On Sun, May 26, 2019 at 8:30 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
On Sat, May 25, 2019 at 1:47 PM Fred Klassen fklassen@appneta.com wrote:
On May 25, 2019, at 8:20 AM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
On Fri, May 24, 2019 at 6:01 PM Fred Klassen fklassen@appneta.com wrote:
On May 24, 2019, at 12:29 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
It is the last moment that a timestamp can be generated for the last byte, I don't see how that is "neither the start nor the end of a GSO packet”.
My misunderstanding. I thought TCP did last segment timestamping, not last byte. In that case, your statements make sense.
It would be interesting if a practical case can be made for timestamping the last segment. In my mind, I don’t see how that would be valuable.
It depends whether you are interested in measuring network latency or host transmit path latency.
For the latter, knowing the time from the start of the sendmsg call to the moment the last byte hits the wire is most relevant. Or in absence of (well defined) hardware support, the last byte being queued to the device is the next best thing.
Sounds to me like both cases have a legitimate use case, and we want to support both.
Implementation constraints are that storage for this timestamp information is scarce and we cannot add new cold cacheline accesses in the datapath.
The simplest approach would be to unconditionally timestamp both the first and last segment. With the same ID. Not terribly elegant. But it works.
If conditional, tx_flags has only one bit left. I think we can harvest some, as not all defined bits are in use at the same stages in the datapath, but that is not a trivial change. Some might also better be set in the skb, instead of skb_shinfo. Which would also avoids touching that cacheline. We could possibly repurpose bits from u32 tskey.
All that can come later. Initially, unless we can come up with something more elegant, I would suggest that UDP follows the rule established by TCP and timestamps the last byte. And we add an explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in __sock_tx_timestamp and is interpreted in __udp_gso_segment.
I don’t see how to practically TX timestamp the last byte of any packet (UDP GSO or otherwise). The best we could do is timestamp the last segment, or rather the time that the last segment is queued. Let me attempt to explain.
First let’s look at software TX timestamps which are for are generated by skb_tx_timestamp() in nearly every network driver’s xmit routine. It states:
—————————— cut ————————————
- Ethernet MAC Drivers should call this function in their hard_xmit()
- function immediately before giving the sk_buff to the MAC hardware.
—————————— cut ————————————
That means that the sk_buff will get timestamped just before rather than just after it is sent. To truly capture the timestamp of the last byte, this routine routine would have to be called a second time, right after sending to MAC hardware. Then the user program would have sort out the 2 timestamps. My guess is that this isn’t something that NIC vendors would be willing to implement in their drivers.
So, the best we can do is timestamp is just before the last segment. Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter. If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp occurs half way through the burst. But it may not be exactly half way because the segments may get queued much faster than wire rate. Therefore the time between segment 1 and segment 2 may be much much smaller than their spacing on the wire. I would not find this useful.
For measuring host queueing latency, a timestamp at the existing skb_tx_timestamp() for the last segment is perfectly informative.
In most cases all segments will be sent in a single xmit_more train. In which case the device doorbell is rung when the last segment is queued.
A device may also pause in the middle of a train, causing the rest of the list to be requeued and resent after a tx completion frees up descriptors and wakes the device. This seems like a relevant exception to be able to measure.
That said, I am not opposed to the first segment, if we have to make a binary choice for a default. Either option has cons. See more specific revision requests in the v2 patch.
On May 23, 2019, at 2:39 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote: Zerocopy notification reference count is managed in skb_segment. That should work.
I’m trying to understand the context of reference counting in skb_segment. I assume that there is an opportunity to optimize the count of outstanding zerocopy buffers, but I can’t see it. Please clarify.
On Thu, May 23, 2019 at 5:09 PM Fred Klassen fklassen@appneta.com wrote:
Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition.
./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s
The "poll timeout" message above indicates that TX timestamp never arrived.
It also appears that other TX CMSG types cause similar issues, for example trying to set SOL_IP/IP_TOS.
./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s
what exactly is the issue with IP_TOS?
If I understand correctly, the issue here is that the new 'P' option that polls on the error queue times out. This is unrelated to specifying TOS bits? Without zerocopy or timestamps, no message is expected on the error queue.
On May 23, 2019, at 2:59 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote: what exactly is the issue with IP_TOS?
If I understand correctly, the issue here is that the new 'P' option that polls on the error queue times out. This is unrelated to specifying TOS bits? Without zerocopy or timestamps, no message is expected on the error queue.
I was not able to get to the root cause, but I noticed that IP_TOS CMSG was lost until I applied this fix. I also found it confusing as to why that may be the case.
On May 23, 2019, at 2:59 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:what exactly is the issue with IP_TOS?
If I understand correctly, the issue here is that the new 'P' option that polls on the error queue times out. This is unrelated to specifying TOS bits? Without zerocopy or timestamps, no message is expected on the error queue.
Please disregard last message. I think I was chasing a non-issue with TOS bits. I will remove all references to TOS.
This enhancement adds options that facilitate load testing with additional TX CMSG options, and to optionally print results of various send CMSG operations.
These options are especially useful in isolating situations where error-queue messages are lost when combined with other CMSG operations (e.g. SO_ZEROCOPY).
New options:
-T - add TX CMSG that requests TX software timestamps -H - similar to -T except request TX hardware timestamps -q - add IP_TOS/IPV6_TCLASS TX CMSG -P - call poll() before reading error queue -v - print detailed results
Fixes: 3a687bef148d ("selftests: udp gso benchmark") Signed-off-by: Fred Klassen fklassen@appneta.com --- tools/testing/selftests/net/udpgso_bench_tx.c | 290 ++++++++++++++++++++++++-- 1 file changed, 273 insertions(+), 17 deletions(-)
diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c index 4074538b5df5..a900f016b9e7 100644 --- a/tools/testing/selftests/net/udpgso_bench_tx.c +++ b/tools/testing/selftests/net/udpgso_bench_tx.c @@ -5,6 +5,8 @@ #include <arpa/inet.h> #include <errno.h> #include <error.h> +#include <linux/errqueue.h> +#include <linux/net_tstamp.h> #include <netinet/if_ether.h> #include <netinet/in.h> #include <netinet/ip.h> @@ -19,6 +21,7 @@ #include <string.h> #include <sys/socket.h> #include <sys/time.h> +#include <sys/poll.h> #include <sys/types.h> #include <unistd.h>
@@ -34,6 +37,10 @@ #define SO_ZEROCOPY 60 #endif
+#ifndef SO_EE_ORIGIN_ZEROCOPY +#define SO_EE_ORIGIN_ZEROCOPY 5 +#endif + #ifndef MSG_ZEROCOPY #define MSG_ZEROCOPY 0x4000000 #endif @@ -48,9 +55,14 @@ static uint16_t cfg_mss; static int cfg_payload_len = (1472 * 42); static int cfg_port = 8000; static int cfg_runtime_ms = -1; +static bool cfg_poll; static bool cfg_segment; static bool cfg_sendmmsg; static bool cfg_tcp; +static uint32_t cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE; +static bool cfg_tx_tstamp; +static uint32_t cfg_tos; +static bool cfg_verbose; static bool cfg_zerocopy; static int cfg_msg_nr; static uint16_t cfg_gso_size; @@ -58,6 +70,10 @@ static uint16_t cfg_gso_size; static socklen_t cfg_alen; static struct sockaddr_storage cfg_dst_addr;
+struct my_scm_timestamping { + struct timespec ts[3]; +}; + static bool interrupted; static char buf[NUM_PKT][ETH_MAX_MTU];
@@ -89,20 +105,20 @@ static int set_cpu(int cpu)
static void setup_sockaddr(int domain, const char *str_addr, void *sockaddr) { - struct sockaddr_in6 *addr6 = (void *) sockaddr; - struct sockaddr_in *addr4 = (void *) sockaddr; + struct sockaddr_in6 *addr6 = (void *)sockaddr; + struct sockaddr_in *addr4 = (void *)sockaddr;
switch (domain) { case PF_INET: addr4->sin_family = AF_INET; addr4->sin_port = htons(cfg_port); - if (inet_pton(AF_INET, str_addr, &(addr4->sin_addr)) != 1) + if (inet_pton(AF_INET, str_addr, &addr4->sin_addr) != 1) error(1, 0, "ipv4 parse error: %s", str_addr); break; case PF_INET6: addr6->sin6_family = AF_INET6; addr6->sin6_port = htons(cfg_port); - if (inet_pton(AF_INET6, str_addr, &(addr6->sin6_addr)) != 1) + if (inet_pton(AF_INET6, str_addr, &addr6->sin6_addr) != 1) error(1, 0, "ipv6 parse error: %s", str_addr); break; default: @@ -110,23 +126,143 @@ static void setup_sockaddr(int domain, const char *str_addr, void *sockaddr) } }
-static void flush_zerocopy(int fd) +static void flush_cmsg(struct cmsghdr *cmsg) +{ + int i; + + switch (cmsg->cmsg_level) { + case SOL_SOCKET: + if (cmsg->cmsg_type == SO_TIMESTAMPING) { + i = (cfg_tx_ts == SOF_TIMESTAMPING_TX_HARDWARE) ? 2 : 0; + struct my_scm_timestamping *tss; + + tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg); + fprintf(stderr, "tx timestamp = %lu.%09lu\n", + tss->ts[i].tv_sec, tss->ts[i].tv_nsec); + } else { + error(1, 0, + "unknown SOL_SOCKET cmsg type=%u level=%u\n", + cmsg->cmsg_type, cmsg->cmsg_level); + } + break; + case SOL_IP: + case SOL_IPV6: + switch (cmsg->cmsg_type) { + case IP_RECVERR: + case IPV6_RECVERR: + { + struct sock_extended_err *err; + + err = (struct sock_extended_err *)CMSG_DATA(cmsg); + switch (err->ee_origin) { + case SO_EE_ORIGIN_TIMESTAMPING: + // Got a TX timestamp from error queue + fprintf(stderr, + "got SO_EE_ORIGIN_TIMESTAMPING\n"); + break; + case SO_EE_ORIGIN_ICMP: + case SO_EE_ORIGIN_ICMP6: + fprintf(stderr, + "received ICMP error: type=%u, code=%u\n", + err->ee_type, err->ee_code); + break; + case SO_EE_ORIGIN_ZEROCOPY: + { + __u32 lo = err->ee_info; + __u32 hi = err->ee_data; + + if (hi == lo - 1) + // TX was aborted + fprintf(stderr, + "Zerocopy TX aborted: lo=%u hi=%u\n", + lo, hi); + if (hi == lo) + // single ID acknowledged + fprintf(stderr, + "Zerocopy TX ack ID: %u\n", + lo); + else + // range of IDs acknowledged + fprintf(stderr, + "Zerocopy TX ack %u IDs %u to %u\n", + hi - lo + 1, lo, hi); + break; + } + case SO_EE_ORIGIN_LOCAL: + fprintf(stderr, + "received packet with local origin: %u\n", + err->ee_origin); + break; + default: + error(0, 1, + "received packet with origin: %u\n", + err->ee_origin); + } + + break; + } + default: + error(0, 1, "unknown IP msg type=%u level=%u\n", + cmsg->cmsg_type, cmsg->cmsg_level); + break; + } + break; + default: + error(0, 1, "unknown cmsg type=%u level=%u\n", + cmsg->cmsg_type, cmsg->cmsg_level); + } +} + +static void flush_errqueue_recv(int fd) { struct msghdr msg = {0}; /* flush */ + struct cmsghdr *cmsg; + struct iovec entry; + char control[1024]; + char buf[1500]; int ret;
+ entry.iov_base = buf; + entry.iov_len = sizeof(buf); + msg.msg_iovlen = 1; + msg.msg_iov = &entry; + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + while (1) { ret = recvmsg(fd, &msg, MSG_ERRQUEUE); if (ret == -1 && errno == EAGAIN) break; if (ret == -1) error(1, errno, "errqueue"); - if (msg.msg_flags != (MSG_ERRQUEUE | MSG_CTRUNC)) - error(1, 0, "errqueue: flags 0x%x\n", msg.msg_flags); msg.msg_flags = 0; + if (cfg_verbose) { + for (cmsg = CMSG_FIRSTHDR(&msg); + cmsg; + cmsg = CMSG_NXTHDR(&msg, cmsg)) + flush_cmsg(cmsg); + } } }
+static void flush_errqueue(int fd) +{ + if (cfg_poll) { + struct pollfd fds = { 0 }; + int ret; + + fds.fd = fd; + fds.events = POLLERR; + ret = poll(&fds, 1, 1000); + if (ret == 0) + error(1, 0, "poll timeout"); + else if (ret < 0) + error(1, errno, "poll"); + } + + flush_errqueue_recv(fd); +} + static int send_tcp(int fd, char *data) { int ret, done = 0, count = 0; @@ -168,16 +304,70 @@ static int send_udp(int fd, char *data) return count; }
+static void send_ts_cmsg(struct cmsghdr *cm) +{ + uint32_t *valp; + + cm->cmsg_level = SOL_SOCKET; + cm->cmsg_type = SO_TIMESTAMPING; + cm->cmsg_len = CMSG_LEN(sizeof(cfg_tx_ts)); + valp = (void *)CMSG_DATA(cm); + *valp = cfg_tx_ts; +} + +static void send_tos_cmsg(struct cmsghdr *cm) +{ + uint32_t *valp; + int level, type; + + if (cfg_family == PF_INET) { + level = SOL_IP; + type = IP_TOS; + } else { + level = SOL_IPV6; + type = IPV6_TCLASS; + } + + cm->cmsg_level = level; + cm->cmsg_type = type; + cm->cmsg_len = CMSG_LEN(sizeof(cfg_tos)); + valp = (void *)CMSG_DATA(cm); + *valp = cfg_tos; +} + static int send_udp_sendmmsg(int fd, char *data) { + char control[CMSG_SPACE(sizeof(cfg_tos)) + + CMSG_SPACE(sizeof(cfg_tx_ts))] = {0}; const int max_nr_msg = ETH_MAX_MTU / ETH_DATA_LEN; struct mmsghdr mmsgs[max_nr_msg]; struct iovec iov[max_nr_msg]; unsigned int off = 0, left; + size_t msg_controllen = 0; int i = 0, ret;
memset(mmsgs, 0, sizeof(mmsgs));
+ if (cfg_tx_tstamp || cfg_tos) { + struct msghdr msg = {0}; + struct cmsghdr *cmsg; + + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + cmsg = CMSG_FIRSTHDR(&msg); + if (cfg_tos) { + send_tos_cmsg(cmsg); + msg_controllen += CMSG_SPACE(sizeof(cfg_tos)); + } + + if (cfg_tx_tstamp) { + if (msg_controllen) + cmsg = CMSG_NXTHDR(&msg, cmsg); + send_ts_cmsg(cmsg); + msg_controllen += CMSG_SPACE(sizeof(cfg_tx_ts)); + } + } + left = cfg_payload_len; while (left) { if (i == max_nr_msg) @@ -188,6 +378,10 @@ static int send_udp_sendmmsg(int fd, char *data)
mmsgs[i].msg_hdr.msg_iov = iov + i; mmsgs[i].msg_hdr.msg_iovlen = 1; + if (msg_controllen) { + mmsgs[i].msg_hdr.msg_control = control; + mmsgs[i].msg_hdr.msg_controllen = msg_controllen; + }
off += iov[i].iov_len; left -= iov[i].iov_len; @@ -214,9 +408,13 @@ static void send_udp_segment_cmsg(struct cmsghdr *cm)
static int send_udp_segment(int fd, char *data) { - char control[CMSG_SPACE(sizeof(cfg_gso_size))] = {0}; + char control[CMSG_SPACE(sizeof(cfg_gso_size)) + + CMSG_SPACE(sizeof(cfg_tos)) + + CMSG_SPACE(sizeof(cfg_tx_ts))] = {0}; struct msghdr msg = {0}; struct iovec iov = {0}; + size_t msg_controllen; + struct cmsghdr *cmsg; int ret;
iov.iov_base = data; @@ -227,8 +425,22 @@ static int send_udp_segment(int fd, char *data)
msg.msg_control = control; msg.msg_controllen = sizeof(control); - send_udp_segment_cmsg(CMSG_FIRSTHDR(&msg)); + cmsg = CMSG_FIRSTHDR(&msg); + send_udp_segment_cmsg(cmsg); + msg_controllen = CMSG_SPACE(sizeof(cfg_mss)); + if (cfg_tx_tstamp) { + cmsg = CMSG_NXTHDR(&msg, cmsg); + send_ts_cmsg(cmsg); + msg_controllen += CMSG_SPACE(sizeof(cfg_tx_ts)); + }
+ if (cfg_tos) { + cmsg = CMSG_NXTHDR(&msg, cmsg); + send_tos_cmsg(cmsg); + msg_controllen += CMSG_SPACE(sizeof(cfg_tos)); + } + + msg.msg_controllen = msg_controllen; msg.msg_name = (void *)&cfg_dst_addr; msg.msg_namelen = cfg_alen;
@@ -236,15 +448,16 @@ static int send_udp_segment(int fd, char *data) if (ret == -1) error(1, errno, "sendmsg"); if (ret != iov.iov_len) - error(1, 0, "sendmsg: %u != %lu\n", ret, iov.iov_len); + error(1, 0, "sendmsg: %u != %lu", ret, iov.iov_len);
return 1; }
static void usage(const char *filepath) { - error(1, 0, "Usage: %s [-46cmtuz] [-C cpu] [-D dst ip] [-l secs] [-m messagenr] [-p port] [-s sendsize] [-S gsosize]", - filepath); + error(1, 0, + "Usage: %s [-46cmHPtTuvz] [-C cpu] [-D dst ip] [-l secs] [-M messagenr] [-p port] [-q tos] [-s sendsize] [-S gsosize]", + filepath); }
static void parse_opts(int argc, char **argv) @@ -252,7 +465,7 @@ static void parse_opts(int argc, char **argv) int max_len, hdrlen; int c;
- while ((c = getopt(argc, argv, "46cC:D:l:mM:p:s:S:tuz")) != -1) { + while ((c = getopt(argc, argv, "46cC:D:Hl:mM:p:s:q:PS:tTuvz")) != -1) { switch (c) { case '4': if (cfg_family != PF_UNSPEC) @@ -287,19 +500,35 @@ static void parse_opts(int argc, char **argv) case 'p': cfg_port = strtoul(optarg, NULL, 0); break; + case 'P': + cfg_poll = true; + break; case 's': cfg_payload_len = strtoul(optarg, NULL, 0); break; + case 'q': + cfg_tos = strtoul(optarg, NULL, 0); + break; case 'S': cfg_gso_size = strtoul(optarg, NULL, 0); cfg_segment = true; break; + case 'H': + cfg_tx_ts = SOF_TIMESTAMPING_TX_HARDWARE; + cfg_tx_tstamp = true; + break; case 't': cfg_tcp = true; break; + case 'T': + cfg_tx_tstamp = true; + break; case 'u': cfg_connected = false; break; + case 'v': + cfg_verbose = true; + break; case 'z': cfg_zerocopy = true; break; @@ -315,6 +544,12 @@ static void parse_opts(int argc, char **argv) error(1, 0, "connectionless tcp makes no sense"); if (cfg_segment && cfg_sendmmsg) error(1, 0, "cannot combine segment offload and sendmmsg"); + if (cfg_tx_tstamp && !(cfg_segment || cfg_sendmmsg)) + error(1, 0, "Options -T and -H require either -S or -m option"); + if (cfg_tos && !(cfg_segment || cfg_sendmmsg)) + error(1, 0, "Option -q requires either -S or -m option"); + if (cfg_poll && !(cfg_segment || cfg_sendmmsg)) + error(1, 0, "Poll option -P requires either -S or -m option");
if (cfg_family == PF_INET) hdrlen = sizeof(struct iphdr) + sizeof(struct udphdr); @@ -349,6 +584,19 @@ static void set_pmtu_discover(int fd, bool is_ipv4) error(1, errno, "setsockopt path mtu"); }
+static void set_tx_timestamping(int fd) +{ + int val = SOF_TIMESTAMPING_OPT_CMSG | SOF_TIMESTAMPING_OPT_ID; + + if (cfg_tx_ts == SOF_TIMESTAMPING_TX_SOFTWARE) + val |= SOF_TIMESTAMPING_SOFTWARE; + else + val |= SOF_TIMESTAMPING_RAW_HARDWARE; + + if (setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val))) + error(1, errno, "setsockopt tx timestamping"); +} + int main(int argc, char **argv) { unsigned long num_msgs, num_sends; @@ -384,7 +632,11 @@ int main(int argc, char **argv) if (cfg_segment) set_pmtu_discover(fd, cfg_family == PF_INET);
- num_msgs = num_sends = 0; + if (cfg_tx_tstamp) + set_tx_timestamping(fd); + + num_msgs = 0; + num_sends = 0; tnow = gettimeofday_ms(); tstop = tnow + cfg_runtime_ms; treport = tnow + 1000; @@ -400,8 +652,8 @@ int main(int argc, char **argv) else num_sends += send_udp(fd, buf[i]); num_msgs++; - if (cfg_zerocopy && ((num_msgs & 0xF) == 0)) - flush_zerocopy(fd); + if ((cfg_zerocopy && (num_msgs & 0xF) == 0) || cfg_tx_tstamp) + flush_errqueue(fd);
if (cfg_msg_nr && num_msgs >= cfg_msg_nr) break; @@ -413,7 +665,8 @@ int main(int argc, char **argv) cfg_tcp ? "tcp" : "udp", (num_msgs * cfg_payload_len) >> 20, num_sends, num_msgs); - num_msgs = num_sends = 0; + num_msgs = 0; + num_sends = 0; treport = tnow + 1000; }
@@ -423,6 +676,9 @@ int main(int argc, char **argv)
} while (!interrupted && (cfg_runtime_ms == -1 || tnow < tstop));
+ if (cfg_zerocopy || cfg_tx_tstamp) + flush_errqueue(fd); + if (close(fd)) error(1, errno, "close");
On Thu, May 23, 2019 at 5:11 PM Fred Klassen fklassen@appneta.com wrote:
This enhancement adds options that facilitate load testing with additional TX CMSG options, and to optionally print results of various send CMSG operations.
These options are especially useful in isolating situations where error-queue messages are lost when combined with other CMSG operations (e.g. SO_ZEROCOPY).
New options:
-T - add TX CMSG that requests TX software timestamps -H - similar to -T except request TX hardware timestamps -q - add IP_TOS/IPV6_TCLASS TX CMSG -P - call poll() before reading error queue -v - print detailed results
Fixes: 3a687bef148d ("selftests: udp gso benchmark")
This is not a fix, but an extension. Fixes tags help with backporting to stable kernels. There is something to be said to backport the main change and support SO_TIMESTAMPING + UDP_GSO on older kernels, especially since it is very concise. But the tests should probably be in a separate patch set targeting net-next.
Signed-off-by: Fred Klassen fklassen@appneta.com
tools/testing/selftests/net/udpgso_bench_tx.c | 290 ++++++++++++++++++++++++-- 1 file changed, 273 insertions(+), 17 deletions(-)
diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c index 4074538b5df5..a900f016b9e7 100644 --- a/tools/testing/selftests/net/udpgso_bench_tx.c +++ b/tools/testing/selftests/net/udpgso_bench_tx.c @@ -5,6 +5,8 @@ #include <arpa/inet.h> #include <errno.h> #include <error.h> +#include <linux/errqueue.h> +#include <linux/net_tstamp.h> #include <netinet/if_ether.h> #include <netinet/in.h> #include <netinet/ip.h> @@ -19,6 +21,7 @@ #include <string.h> #include <sys/socket.h> #include <sys/time.h> +#include <sys/poll.h> #include <sys/types.h> #include <unistd.h>
@@ -34,6 +37,10 @@ #define SO_ZEROCOPY 60 #endif
+#ifndef SO_EE_ORIGIN_ZEROCOPY +#define SO_EE_ORIGIN_ZEROCOPY 5 +#endif
#ifndef MSG_ZEROCOPY #define MSG_ZEROCOPY 0x4000000 #endif @@ -48,9 +55,14 @@ static uint16_t cfg_mss; static int cfg_payload_len = (1472 * 42); static int cfg_port = 8000; static int cfg_runtime_ms = -1; +static bool cfg_poll; static bool cfg_segment; static bool cfg_sendmmsg; static bool cfg_tcp; +static uint32_t cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE; +static bool cfg_tx_tstamp; +static uint32_t cfg_tos; +static bool cfg_verbose; static bool cfg_zerocopy; static int cfg_msg_nr; static uint16_t cfg_gso_size; @@ -58,6 +70,10 @@ static uint16_t cfg_gso_size; static socklen_t cfg_alen; static struct sockaddr_storage cfg_dst_addr;
+struct my_scm_timestamping {
struct timespec ts[3];
+};
This and the above should not be needed if including <linux/errqueue.h>
It may be absent if relying on the host header files, but the kselftest build system should correctly use the files from the kernel source tree.
static bool interrupted; static char buf[NUM_PKT][ETH_MAX_MTU];
@@ -89,20 +105,20 @@ static int set_cpu(int cpu)
static void setup_sockaddr(int domain, const char *str_addr, void *sockaddr) {
struct sockaddr_in6 *addr6 = (void *) sockaddr;
struct sockaddr_in *addr4 = (void *) sockaddr;
struct sockaddr_in6 *addr6 = (void *)sockaddr;
struct sockaddr_in *addr4 = (void *)sockaddr; switch (domain) { case PF_INET: addr4->sin_family = AF_INET; addr4->sin_port = htons(cfg_port);
if (inet_pton(AF_INET, str_addr, &(addr4->sin_addr)) != 1)
if (inet_pton(AF_INET, str_addr, &addr4->sin_addr) != 1) error(1, 0, "ipv4 parse error: %s", str_addr); break; case PF_INET6: addr6->sin6_family = AF_INET6; addr6->sin6_port = htons(cfg_port);
if (inet_pton(AF_INET6, str_addr, &(addr6->sin6_addr)) != 1)
if (inet_pton(AF_INET6, str_addr, &addr6->sin6_addr) != 1)
Please do not include style changes like these. Try to minimize changes required to add the new feature.
On Thu, May 23, 2019 at 5:11 PM Fred Klassen fklassen@appneta.com wrote:
This enhancement adds options that facilitate load testing with additional TX CMSG options, and to optionally print results of various send CMSG operations.
These options are especially useful in isolating situations where error-queue messages are lost when combined with other CMSG operations (e.g. SO_ZEROCOPY).
New options:
-T - add TX CMSG that requests TX software timestamps -H - similar to -T except request TX hardware timestamps -q - add IP_TOS/IPV6_TCLASS TX CMSG
To ensure that we do not regress, when adding options, please consider (just a general suggestion, not a strong request for this patch set) updating the kselftest to run a variant of the test with the new code coverage. In this case, make the code pass/fail instead of only user interpretable and add variants to udpgso.sh.
-P - call poll() before reading error queue -v - print detailed results
Fixes: 3a687bef148d ("selftests: udp gso benchmark") Signed-off-by: Fred Klassen fklassen@appneta.com
+static void flush_errqueue_recv(int fd) { struct msghdr msg = {0}; /* flush */
struct cmsghdr *cmsg;
struct iovec entry;
char control[1024];
can use more precise CMSG_SPACE based on worst case expectations, like in udp_sendmmsg
char buf[1500];
no need for payload
+static void flush_errqueue(int fd) +{
if (cfg_poll) {
struct pollfd fds = { 0 };
int ret;
fds.fd = fd;
fds.events = POLLERR;
no need to pass POLLERR, it is always returned in revents.
To ensure that we do not regress, when adding options, please consider (just a general suggestion, not a strong request for this patch set) updating the kselftest to run a variant of the test with the new code coverage. In this case, make the code pass/fail instead of only user interpretable and add variants to udpgso.sh.
I had a look at how kselftest works, and I absolutely want to see this work with these changes. I’ll investigate and implement in v2 patch.
I will most likely do a 5 second test. This seems to be sufficient to get meaningful results
can use more precise CMSG_SPACE based on worst case expectations, like in udp_sendmmsg
char buf[1500];
no need for payload
+static void flush_errqueue(int fd) +{
if (cfg_poll) {
struct pollfd fds = { 0 };
int ret;
fds.fd = fd;
fds.events = POLLERR;
no need to pass POLLERR, it is always returned in revents.
Fixes the following error if using both -m and -u options:
# ./udpgso_bench -4um -D 172.16.120.189 ./udpgso_bench: sendmmsg: Destination address required
Fixes: 3a687bef148d ("selftests: udp gso benchmark") Signed-off-by: Fred Klassen fklassen@appneta.com --- tools/testing/selftests/net/udpgso_bench_tx.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c index a900f016b9e7..56e0d890b066 100644 --- a/tools/testing/selftests/net/udpgso_bench_tx.c +++ b/tools/testing/selftests/net/udpgso_bench_tx.c @@ -378,6 +378,8 @@ static int send_udp_sendmmsg(int fd, char *data)
mmsgs[i].msg_hdr.msg_iov = iov + i; mmsgs[i].msg_hdr.msg_iovlen = 1; + mmsgs[i].msg_hdr.msg_name = (void *)&cfg_dst_addr; + mmsgs[i].msg_hdr.msg_namelen = cfg_alen; if (msg_controllen) { mmsgs[i].msg_hdr.msg_control = control; mmsgs[i].msg_hdr.msg_controllen = msg_controllen;
This enhancement adds the '-a' option, which will count all CMSG messages on the error queue and print a summary report.
Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Example:
# ./udpgso_bench_tx -4uT -a -l5 -S 1472 -D 172.16.120.189 udp tx: 492 MB/s 8354 calls/s 8354 msg/s udp tx: 477 MB/s 8106 calls/s 8106 msg/s udp tx: 488 MB/s 8288 calls/s 8288 msg/s udp tx: 882 MB/s 14975 calls/s 14975 msg/s Summary over 5.000 seconds ... sum udp tx: 696 MB/s 57696 calls (11539/s) 57696 msgs (11539/s) Tx Timestamps: received: 57696 errors: 0
This can be useful in tracking loss of messages when under load. For example, adding the '-z' option results in loss of TX timestamp messages:
# ./udpgso_bench_tx -4ucT -a -l5 -S 1472 -D 172.16.120.189 -p 3239 -z udp tx: 490 MB/s 8325 calls/s 8325 msg/s udp tx: 500 MB/s 8492 calls/s 8492 msg/s udp tx: 883 MB/s 14985 calls/s 14985 msg/s udp tx: 756 MB/s 12823 calls/s 12823 msg/s Summary over 5.000 seconds ... sum udp tx: 657 MB/s 54429 calls (10885/s) 54429 msgs (10885/s) Tx Timestamps: received: 34046 errors: 0 Zerocopy acks: received: 54422 errors: 0
Fixes: 3a687bef148d ("selftests: udp gso benchmark") Signed-off-by: Fred Klassen fklassen@appneta.com --- tools/testing/selftests/net/udpgso_bench_tx.c | 152 +++++++++++++++++++------- 1 file changed, 113 insertions(+), 39 deletions(-)
diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c index 56e0d890b066..9924342a0b03 100644 --- a/tools/testing/selftests/net/udpgso_bench_tx.c +++ b/tools/testing/selftests/net/udpgso_bench_tx.c @@ -62,10 +62,19 @@ static bool cfg_tcp; static uint32_t cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE; static bool cfg_tx_tstamp; static uint32_t cfg_tos; +static bool cfg_audit; static bool cfg_verbose; static bool cfg_zerocopy; static int cfg_msg_nr; static uint16_t cfg_gso_size; +static unsigned long total_num_msgs; +static unsigned long total_num_sends; +static unsigned long stat_tx_ts; +static unsigned long stat_tx_ts_errors; +static unsigned long tstart; +static unsigned long tend; +static unsigned long stat_zcopies; +static unsigned long stat_zcopy_errors;
static socklen_t cfg_alen; static struct sockaddr_storage cfg_dst_addr; @@ -137,8 +146,11 @@ static void flush_cmsg(struct cmsghdr *cmsg) struct my_scm_timestamping *tss;
tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg); - fprintf(stderr, "tx timestamp = %lu.%09lu\n", - tss->ts[i].tv_sec, tss->ts[i].tv_nsec); + if (tss->ts[i].tv_sec == 0) + stat_tx_ts_errors++; + if (cfg_verbose) + fprintf(stderr, "tx timestamp = %lu.%09lu\n", + tss->ts[i].tv_sec, tss->ts[i].tv_nsec); } else { error(1, 0, "unknown SOL_SOCKET cmsg type=%u level=%u\n", @@ -157,41 +169,52 @@ static void flush_cmsg(struct cmsghdr *cmsg) switch (err->ee_origin) { case SO_EE_ORIGIN_TIMESTAMPING: // Got a TX timestamp from error queue - fprintf(stderr, - "got SO_EE_ORIGIN_TIMESTAMPING\n"); + stat_tx_ts++; + if (cfg_verbose) + fprintf(stderr, + "got SO_EE_ORIGIN_TIMESTAMPING\n"); break; case SO_EE_ORIGIN_ICMP: case SO_EE_ORIGIN_ICMP6: - fprintf(stderr, - "received ICMP error: type=%u, code=%u\n", - err->ee_type, err->ee_code); + if (cfg_verbose) + fprintf(stderr, + "received ICMP error: type=%u, code=%u\n", + err->ee_type, err->ee_code); break; case SO_EE_ORIGIN_ZEROCOPY: { __u32 lo = err->ee_info; __u32 hi = err->ee_data;
- if (hi == lo - 1) + if (hi == lo - 1) { // TX was aborted - fprintf(stderr, - "Zerocopy TX aborted: lo=%u hi=%u\n", - lo, hi); - if (hi == lo) + stat_zcopy_errors++; + if (cfg_verbose) + fprintf(stderr, + "Zerocopy TX aborted: lo=%u hi=%u\n", + lo, hi); + } else if (hi == lo) { // single ID acknowledged - fprintf(stderr, - "Zerocopy TX ack ID: %u\n", - lo); - else + stat_zcopies++; + if (cfg_verbose) + fprintf(stderr, + "Zerocopy TX ack ID: %u\n", + lo); + } else { // range of IDs acknowledged - fprintf(stderr, - "Zerocopy TX ack %u IDs %u to %u\n", - hi - lo + 1, lo, hi); + stat_zcopies += hi - lo + 1; + if (cfg_verbose) + fprintf(stderr, + "Zerocopy TX ack %u IDs %u to %u\n", + hi - lo + 1, lo, hi); + } break; } case SO_EE_ORIGIN_LOCAL: - fprintf(stderr, - "received packet with local origin: %u\n", - err->ee_origin); + if (cfg_verbose) + fprintf(stderr, + "received packet with local origin: %u\n", + err->ee_origin); break; default: error(0, 1, @@ -236,7 +259,7 @@ static void flush_errqueue_recv(int fd) if (ret == -1) error(1, errno, "errqueue"); msg.msg_flags = 0; - if (cfg_verbose) { + if (cfg_audit || cfg_verbose) { for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) @@ -245,19 +268,21 @@ static void flush_errqueue_recv(int fd) } }
-static void flush_errqueue(int fd) +static void flush_errqueue(int fd, const bool do_poll) { - if (cfg_poll) { + if (do_poll) { struct pollfd fds = { 0 }; int ret;
fds.fd = fd; fds.events = POLLERR; - ret = poll(&fds, 1, 1000); - if (ret == 0) - error(1, 0, "poll timeout"); - else if (ret < 0) + ret = poll(&fds, 1, 500); + if (ret == 0) { + if (cfg_verbose) + fprintf(stderr, "poll timeout\n"); + } else if (ret < 0) { error(1, errno, "poll"); + } }
flush_errqueue_recv(fd); @@ -458,7 +483,7 @@ static int send_udp_segment(int fd, char *data) static void usage(const char *filepath) { error(1, 0, - "Usage: %s [-46cmHPtTuvz] [-C cpu] [-D dst ip] [-l secs] [-M messagenr] [-p port] [-q tos] [-s sendsize] [-S gsosize]", + "Usage: %s [-46acmHPtTuvz] [-C cpu] [-D dst ip] [-l secs] [-M messagenr] [-p port] [-q tos] [-s sendsize] [-S gsosize]", filepath); }
@@ -467,7 +492,7 @@ static void parse_opts(int argc, char **argv) int max_len, hdrlen; int c;
- while ((c = getopt(argc, argv, "46cC:D:Hl:mM:p:s:q:PS:tTuvz")) != -1) { + while ((c = getopt(argc, argv, "46acC:D:Hl:mM:p:s:q:PS:tTuvz")) != -1) { switch (c) { case '4': if (cfg_family != PF_UNSPEC) @@ -481,6 +506,9 @@ static void parse_opts(int argc, char **argv) cfg_family = PF_INET6; cfg_alen = sizeof(struct sockaddr_in6); break; + case 'a': + cfg_audit = true; + break; case 'c': cfg_cache_trash = true; break; @@ -599,6 +627,51 @@ static void set_tx_timestamping(int fd) error(1, errno, "setsockopt tx timestamping"); }
+static void print_final_report(unsigned long num_msgs, unsigned long num_sends) +{ + unsigned long tdelta; + + tdelta = tend - tstart; + if (!tdelta) + return; + + fprintf(stderr, "Summary over %lu.%03lu seconds ...\n", tdelta / 1000, + tdelta % 1000); + fprintf(stderr, + "sum %s tx: %6lu MB/s %10lu calls (%lu/s) %10lu msgs (%lu/s)\n", + cfg_tcp ? "tcp" : "udp", + ((num_msgs * cfg_payload_len) >> 10) / tdelta, + num_sends, num_sends * 1000 / tdelta, + num_msgs, num_msgs * 1000 / tdelta); + if (cfg_tx_tstamp) + fprintf(stderr, + "Tx Timestamps: received: %9lu errors: %lu\n", + stat_tx_ts, stat_tx_ts_errors); + + if (cfg_zerocopy) + fprintf(stderr, + "Zerocopy acks: received: %9lu errors: %lu\n", + stat_zcopies, stat_zcopy_errors); +} + +static void print_report(unsigned long num_msgs, unsigned long num_sends, + const bool final) +{ + if (!final) + fprintf(stderr, + "%s tx: %6lu MB/s %8lu calls/s %6lu msg/s\n", + cfg_tcp ? "tcp" : "udp", + (num_msgs * cfg_payload_len) >> 20, + num_sends, num_msgs); + + if (cfg_audit) { + total_num_msgs += num_msgs; + total_num_sends += num_sends; + if (final) + print_final_report(total_num_msgs, total_num_sends); + } +} + int main(int argc, char **argv) { unsigned long num_msgs, num_sends; @@ -640,6 +713,8 @@ int main(int argc, char **argv) num_msgs = 0; num_sends = 0; tnow = gettimeofday_ms(); + tstart = tnow; + tend = tnow; tstop = tnow + cfg_runtime_ms; treport = tnow + 1000;
@@ -654,19 +729,15 @@ int main(int argc, char **argv) else num_sends += send_udp(fd, buf[i]); num_msgs++; - if ((cfg_zerocopy && (num_msgs & 0xF) == 0) || cfg_tx_tstamp) - flush_errqueue(fd); + if (cfg_tx_tstamp || (cfg_zerocopy && (num_msgs & 0xF) == 0)) + flush_errqueue(fd, cfg_poll);
if (cfg_msg_nr && num_msgs >= cfg_msg_nr) break;
tnow = gettimeofday_ms(); if (tnow > treport) { - fprintf(stderr, - "%s tx: %6lu MB/s %8lu calls/s %6lu msg/s\n", - cfg_tcp ? "tcp" : "udp", - (num_msgs * cfg_payload_len) >> 20, - num_sends, num_msgs); + print_report(num_msgs, num_sends, false); num_msgs = 0; num_sends = 0; treport = tnow + 1000; @@ -679,10 +750,13 @@ int main(int argc, char **argv) } while (!interrupted && (cfg_runtime_ms == -1 || tnow < tstop));
if (cfg_zerocopy || cfg_tx_tstamp) - flush_errqueue(fd); + flush_errqueue(fd, true);
if (close(fd)) error(1, errno, "close");
+ tend = tnow; + print_report(num_msgs, num_sends, true); + return 0; }
On Thu, May 23, 2019 at 5:11 PM Fred Klassen fklassen@appneta.com wrote:
This enhancement adds the '-a' option, which will count all CMSG messages on the error queue and print a summary report.
Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Also not a fix, but an extension.
Example:
# ./udpgso_bench_tx -4uT -a -l5 -S 1472 -D 172.16.120.189 udp tx: 492 MB/s 8354 calls/s 8354 msg/s udp tx: 477 MB/s 8106 calls/s 8106 msg/s udp tx: 488 MB/s 8288 calls/s 8288 msg/s udp tx: 882 MB/s 14975 calls/s 14975 msg/s Summary over 5.000 seconds ... sum udp tx: 696 MB/s 57696 calls (11539/s) 57696 msgs (11539/s) Tx Timestamps: received: 57696 errors: 0
This can be useful in tracking loss of messages when under load. For example, adding the '-z' option results in loss of TX timestamp messages:
# ./udpgso_bench_tx -4ucT -a -l5 -S 1472 -D 172.16.120.189 -p 3239 -z udp tx: 490 MB/s 8325 calls/s 8325 msg/s udp tx: 500 MB/s 8492 calls/s 8492 msg/s udp tx: 883 MB/s 14985 calls/s 14985 msg/s udp tx: 756 MB/s 12823 calls/s 12823 msg/s Summary over 5.000 seconds ... sum udp tx: 657 MB/s 54429 calls (10885/s) 54429 msgs (10885/s) Tx Timestamps: received: 34046 errors: 0 Zerocopy acks: received: 54422 errors: 0
This would probably also be more useful as regression test if it is in the form of a pass/fail test: if timestamps are requested and total count is zero, then the feature is broken and the process should exit with an error.
Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Repeated
Signed-off-by: Fred Klassen fklassen@appneta.com
tools/testing/selftests/net/udpgso_bench_tx.c | 152 +++++++++++++++++++------- 1 file changed, 113 insertions(+), 39 deletions(-)
diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c index 56e0d890b066..9924342a0b03 100644 --- a/tools/testing/selftests/net/udpgso_bench_tx.c +++ b/tools/testing/selftests/net/udpgso_bench_tx.c @@ -62,10 +62,19 @@ static bool cfg_tcp; static uint32_t cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE; static bool cfg_tx_tstamp; static uint32_t cfg_tos; +static bool cfg_audit; static bool cfg_verbose; static bool cfg_zerocopy; static int cfg_msg_nr; static uint16_t cfg_gso_size; +static unsigned long total_num_msgs; +static unsigned long total_num_sends; +static unsigned long stat_tx_ts; +static unsigned long stat_tx_ts_errors; +static unsigned long tstart; +static unsigned long tend; +static unsigned long stat_zcopies; +static unsigned long stat_zcopy_errors;
static socklen_t cfg_alen; static struct sockaddr_storage cfg_dst_addr; @@ -137,8 +146,11 @@ static void flush_cmsg(struct cmsghdr *cmsg) struct my_scm_timestamping *tss;
tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg);
fprintf(stderr, "tx timestamp = %lu.%09lu\n",
tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
if (tss->ts[i].tv_sec == 0)
stat_tx_ts_errors++;
if (cfg_verbose)
fprintf(stderr, "tx timestamp = %lu.%09lu\n",
tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
changes unrelated to this feature?
Willem, this is only my 2nd patch, and my last one was a one liner. I’ll try to work through this, but let me know if I am doing a rookie mistake (learning curve and all).
On May 23, 2019, at 2:56 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
On Thu, May 23, 2019 at 5:11 PM Fred Klassen fklassen@appneta.com wrote:
This enhancement adds the '-a' option, which will count all CMSG messages on the error queue and print a summary report.
Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Also not a fix, but an extension.
I’ll make a v2 patch and remove “Fixes:".
Example:
# ./udpgso_bench_tx -4uT -a -l5 -S 1472 -D 172.16.120.189 udp tx: 492 MB/s 8354 calls/s 8354 msg/s udp tx: 477 MB/s 8106 calls/s 8106 msg/s udp tx: 488 MB/s 8288 calls/s 8288 msg/s udp tx: 882 MB/s 14975 calls/s 14975 msg/s Summary over 5.000 seconds ... sum udp tx: 696 MB/s 57696 calls (11539/s) 57696 msgs (11539/s) Tx Timestamps: received: 57696 errors: 0
This can be useful in tracking loss of messages when under load. For example, adding the '-z' option results in loss of TX timestamp messages:
# ./udpgso_bench_tx -4ucT -a -l5 -S 1472 -D 172.16.120.189 -p 3239 -z udp tx: 490 MB/s 8325 calls/s 8325 msg/s udp tx: 500 MB/s 8492 calls/s 8492 msg/s udp tx: 883 MB/s 14985 calls/s 14985 msg/s udp tx: 756 MB/s 12823 calls/s 12823 msg/s Summary over 5.000 seconds ... sum udp tx: 657 MB/s 54429 calls (10885/s) 54429 msgs (10885/s) Tx Timestamps: received: 34046 errors: 0 Zerocopy acks: received: 54422 errors: 0
This would probably also be more useful as regression test if it is in the form of a pass/fail test: if timestamps are requested and total count is zero, then the feature is broken and the process should exit with an error.
I’ll add a hard failure for zero response for TX Timestamps or Zerocopy, or if any errors occur.
Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Repeated
Will fix.
Signed-off-by: Fred Klassen fklassen@appneta.com
tools/testing/selftests/net/udpgso_bench_tx.c | 152 +++++++++++++++++++------- 1 file changed, 113 insertions(+), 39 deletions(-)
diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c index 56e0d890b066..9924342a0b03 100644 --- a/tools/testing/selftests/net/udpgso_bench_tx.c +++ b/tools/testing/selftests/net/udpgso_bench_tx.c @@ -62,10 +62,19 @@ static bool cfg_tcp; static uint32_t cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE; static bool cfg_tx_tstamp; static uint32_t cfg_tos; +static bool cfg_audit; static bool cfg_verbose; static bool cfg_zerocopy; static int cfg_msg_nr; static uint16_t cfg_gso_size; +static unsigned long total_num_msgs; +static unsigned long total_num_sends; +static unsigned long stat_tx_ts; +static unsigned long stat_tx_ts_errors; +static unsigned long tstart; +static unsigned long tend; +static unsigned long stat_zcopies; +static unsigned long stat_zcopy_errors;
static socklen_t cfg_alen; static struct sockaddr_storage cfg_dst_addr; @@ -137,8 +146,11 @@ static void flush_cmsg(struct cmsghdr *cmsg) struct my_scm_timestamping *tss;
tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg);
fprintf(stderr, "tx timestamp = %lu.%09lu\n",
tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
if (tss->ts[i].tv_sec == 0)
stat_tx_ts_errors++;
if (cfg_verbose)
fprintf(stderr, "tx timestamp = %lu.%09lu\n",
tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
changes unrelated to this feature?
I’ll remove. Do you think that I should pull out any messages related to “cfg_verbose”?
On Thu, May 23, 2019 at 9:27 PM Fred Klassen fklassen@appneta.com wrote:
Willem, this is only my 2nd patch, and my last one was a one liner. I’ll try to work through this, but let me know if I am doing a rookie mistake (learning curve and all).
Not at all. The fix makes perfect sense.
The test patches 2 and 4 are not fixes, so are better suited to to net-next. Perhaps the changes to the test can also be more concise, just the minimal changes needed to demonstrate the bug and fix.
tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg);
fprintf(stderr, "tx timestamp = %lu.%09lu\n",
tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
if (tss->ts[i].tv_sec == 0)
stat_tx_ts_errors++;
if (cfg_verbose)
fprintf(stderr, "tx timestamp = %lu.%09lu\n",
tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
changes unrelated to this feature?
I’ll remove. Do you think that I should pull out any messages related to “cfg_verbose”?
This change did not seem relevant to the main feature of the patch.
Willem,
Thanks for spending so much time with me on this patch. I’m pretty new to this, so I know I am making lots of mistakes. I have been working on a v2 of the selftests in net-next, but want to review the layout of the report before I submit (see below).
Also, I my v2 fix in net is still up for debate. In its current state, it meets my application’s requirements, but may not meet all of yours. I am still open to suggestions, but so far I don’t have an alternate solution that doesn’t break what I need working.
I also have a question about submitting a fix in net and a related enhancement in net-next. I feel I should be referencing the fix in net in my net-next commit, but I don’t know how it should be done. Any suggestions?
On May 23, 2019, at 2:56 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
Example:
# ./udpgso_bench_tx -4uT -a -l5 -S 1472 -D 172.16.120.189 udp tx: 492 MB/s 8354 calls/s 8354 msg/s udp tx: 477 MB/s 8106 calls/s 8106 msg/s udp tx: 488 MB/s 8288 calls/s 8288 msg/s udp tx: 882 MB/s 14975 calls/s 14975 msg/s Summary over 5.000 seconds ... sum udp tx: 696 MB/s 57696 calls (11539/s) 57696 msgs (11539/s) Tx Timestamps: received: 57696 errors: 0
This can be useful in tracking loss of messages when under load. For example, adding the '-z' option results in loss of TX timestamp messages:
# ./udpgso_bench_tx -4ucT -a -l5 -S 1472 -D 172.16.120.189 -p 3239 -z udp tx: 490 MB/s 8325 calls/s 8325 msg/s udp tx: 500 MB/s 8492 calls/s 8492 msg/s udp tx: 883 MB/s 14985 calls/s 14985 msg/s udp tx: 756 MB/s 12823 calls/s 12823 msg/s Summary over 5.000 seconds ... sum udp tx: 657 MB/s 54429 calls (10885/s) 54429 msgs (10885/s) Tx Timestamps: received: 34046 errors: 0 Zerocopy acks: received: 54422 errors: 0
This would probably also be more useful as regression test if it is in the form of a pass/fail test: if timestamps are requested and total count is zero, then the feature is broken and the process should exit with an error.
I have it set up as a pass/fail test. Below is a sample output of the test, including a failure on IPv6 TCP Zerocopy audit (a failure that may lead to a memory leak). I wanted to review the report with you before I push up the v2 patch into net-next.
Are these extra tests what you were expecting? Is it OK that it doesn’t flow well? Also, there is a failure about every 3rd time I run it, indicating that some TX or Zerocopy messages are lost. Is that OK?
ipv4 tcp tcp rx: 11665 MB/s 189014 calls/s tcp tx: 11665 MB/s 197859 calls/s 197859 msg/s tcp rx: 11706 MB/s 190749 calls/s tcp tx: 11706 MB/s 198545 calls/s 198545 msg/s tcp tx: 11653 MB/s 197645 calls/s 197645 msg/s tcp rx: 11653 MB/s 189292 calls/s tcp zerocopy tcp rx: 6658 MB/s 111849 calls/s tcp tx: 6658 MB/s 112940 calls/s 112940 msg/s tcp tx: 6824 MB/s 115742 calls/s 115742 msg/s tcp rx: 6824 MB/s 115711 calls/s tcp rx: 6865 MB/s 116402 calls/s tcp tx: 6865 MB/s 116440 calls/s 116440 msg/s tcp zerocopy audit tcp tx: 6646 MB/s 112729 calls/s 112729 msg/s tcp rx: 6646 MB/s 111445 calls/s tcp rx: 6672 MB/s 112833 calls/s tcp tx: 6672 MB/s 113164 calls/s 113164 msg/s tcp tx: 6624 MB/s 112355 calls/s 112355 msg/s tcp rx: 6624 MB/s 110877 calls/s Summary over 4.000 seconds... sum tcp tx: 6813 MB/s 451402 calls (112850/s) 451402 msgs (112850/s) Zerocopy acks: 451402 received 0 errors udp udp rx: 914 MB/s 651407 calls/s udp tx: 925 MB/s 659274 calls/s 15697 msg/s udp rx: 919 MB/s 654859 calls/s udp tx: 919 MB/s 654864 calls/s 15592 msg/s udp rx: 914 MB/s 651668 calls/s udp tx: 914 MB/s 651630 calls/s 15515 msg/s udp rx: 918 MB/s 654642 calls/s udp gso udp rx: 2271 MB/s 1618319 calls/s udp tx: 2515 MB/s 42658 calls/s 42658 msg/s udp rx: 2337 MB/s 1665280 calls/s udp tx: 2551 MB/s 43276 calls/s 43276 msg/s udp rx: 2338 MB/s 1665792 calls/s udp tx: 2557 MB/s 43384 calls/s 43384 msg/s udp rx: 2339 MB/s 1666560 calls/s udp gso zerocopy udp rx: 1725 MB/s 1229087 calls/s udp tx: 1840 MB/s 31219 calls/s 31219 msg/s udp rx: 1850 MB/s 1318460 calls/s udp tx: 1850 MB/s 31388 calls/s 31388 msg/s udp rx: 1845 MB/s 1314428 calls/s udp tx: 1845 MB/s 31299 calls/s 31299 msg/s udp gso timestamp udp rx: 2286 MB/s 1628928 calls/s udp tx: 2446 MB/s 41499 calls/s 41499 msg/s udp rx: 2354 MB/s 1677312 calls/s udp tx: 2473 MB/s 41952 calls/s 41952 msg/s udp rx: 2342 MB/s 1668864 calls/s udp tx: 2471 MB/s 41925 calls/s 41925 msg/s udp rx: 2333 MB/s 1662464 calls/s udp gso zerocopy audit udp rx: 1787 MB/s 1273481 calls/s udp tx: 1832 MB/s 31082 calls/s 31082 msg/s udp rx: 1881 MB/s 1340476 calls/s udp tx: 1881 MB/s 31916 calls/s 31916 msg/s udp rx: 1880 MB/s 1339880 calls/s udp tx: 1881 MB/s 31904 calls/s 31904 msg/s udp rx: 1881 MB/s 1340010 calls/s Summary over 4.000 seconds... sum udp tx: 1912 MB/s 126694 calls (31673/s) 126694 msgs (31673/s) Zerocopy acks: 126694 received 0 errors udp gso timestamp audit udp rx: 2259 MB/s 1609327 calls/s udp tx: 2410 MB/s 40879 calls/s 40879 msg/s udp rx: 2346 MB/s 1671168 calls/s udp tx: 2446 MB/s 41491 calls/s 41491 msg/s udp rx: 2358 MB/s 1680128 calls/s udp tx: 2448 MB/s 41522 calls/s 41522 msg/s udp rx: 2356 MB/s 1678848 calls/s Summary over 4.000 seconds... sum udp tx: 2494 MB/s 165276 calls (41319/s) 165276 msgs (41319/s) Tx Timestamps: 165276 received 0 errors udp gso zerocopy timestamp audit udp rx: 1658 MB/s 1181647 calls/s udp tx: 1678 MB/s 28466 calls/s 28466 msg/s udp rx: 1718 MB/s 1224010 calls/s udp tx: 1718 MB/s 29146 calls/s 29146 msg/s udp rx: 1718 MB/s 1224086 calls/s udp tx: 1718 MB/s 29144 calls/s 29144 msg/s udp rx: 1717 MB/s 1223464 calls/s Summary over 4.000 seconds... sum udp tx: 1747 MB/s 115771 calls (28942/s) 115771 msgs (28942/s) Tx Timestamps: 115771 received 0 errors Zerocopy acks: 115771 received 0 errors ipv6 tcp tcp tx: 11470 MB/s 194547 calls/s 194547 msg/s tcp rx: 11470 MB/s 188442 calls/s tcp tx: 11803 MB/s 200193 calls/s 200193 msg/s tcp rx: 11803 MB/s 194526 calls/s tcp tx: 11832 MB/s 200681 calls/s 200681 msg/s tcp rx: 11832 MB/s 194459 calls/s tcp zerocopy tcp rx: 7482 MB/s 80176 calls/s tcp tx: 7482 MB/s 126908 calls/s 126908 msg/s tcp rx: 6641 MB/s 112609 calls/s tcp tx: 6641 MB/s 112649 calls/s 112649 msg/s tcp tx: 6584 MB/s 111683 calls/s 111683 msg/s tcp rx: 6584 MB/s 111617 calls/s tcp zerocopy audit tcp tx: 6719 MB/s 113968 calls/s 113968 msg/s tcp rx: 6719 MB/s 113893 calls/s tcp rx: 6772 MB/s 114831 calls/s tcp tx: 6772 MB/s 114872 calls/s 114872 msg/s tcp tx: 6793 MB/s 115226 calls/s 115226 msg/s tcp rx: 7075 MB/s 116595 calls/s Summary over 4.000 seconds... sum tcp tx: 6921 MB/s 458580 calls (114645/s) 458580 msgs (114645/s) ./udpgso_bench_tx: Unexpected number of Zerocopy completions: 458580 expected 458578 received udp udp rx: 833 MB/s 607639 calls/s udp tx: 851 MB/s 621264 calls/s 14448 msg/s udp rx: 860 MB/s 627246 calls/s udp tx: 860 MB/s 627284 calls/s 14588 msg/s udp rx: 862 MB/s 628718 calls/s udp tx: 861 MB/s 628574 calls/s 14618 msg/s udp rx: 863 MB/s 630058 calls/s udp gso udp rx: 2310 MB/s 1683314 calls/s udp tx: 2472 MB/s 41931 calls/s 41931 msg/s udp rx: 2343 MB/s 1708032 calls/s udp tx: 2493 MB/s 42298 calls/s 42298 msg/s udp rx: 2322 MB/s 1692160 calls/s udp tx: 2496 MB/s 42347 calls/s 42347 msg/s udp gso zerocopy udp rx: 1752 MB/s 1278423 calls/s udp tx: 1778 MB/s 30162 calls/s 30162 msg/s udp rx: 1804 MB/s 1316058 calls/s udp tx: 1804 MB/s 30605 calls/s 30605 msg/s udp rx: 1807 MB/s 1318120 calls/s udp tx: 1808 MB/s 30681 calls/s 30681 msg/s udp rx: 1730 MB/s 1261819 calls/s udp gso timestamp udp rx: 2296 MB/s 1673728 calls/s udp tx: 2375 MB/s 40284 calls/s 40284 msg/s udp rx: 2334 MB/s 1701632 calls/s udp tx: 2377 MB/s 40319 calls/s 40319 msg/s udp rx: 2335 MB/s 1702093 calls/s udp tx: 2377 MB/s 40319 calls/s 40319 msg/s udp rx: 2336 MB/s 1702656 calls/s udp gso zerocopy audit udp rx: 1717 MB/s 1252554 calls/s udp tx: 1759 MB/s 29839 calls/s 29839 msg/s udp rx: 1811 MB/s 1321003 calls/s udp tx: 1811 MB/s 30722 calls/s 30722 msg/s udp rx: 1811 MB/s 1320917 calls/s udp tx: 1811 MB/s 30719 calls/s 30719 msg/s udp rx: 1810 MB/s 1320606 calls/s Summary over 4.000 seconds... sum udp tx: 1839 MB/s 121868 calls (30467/s) 121868 msgs (30467/s) Zerocopy acks: 121868 received 0 errors udp gso timestamp audit udp rx: 2231 MB/s 1626112 calls/s udp tx: 2337 MB/s 39646 calls/s 39646 msg/s udp rx: 2292 MB/s 1670400 calls/s udp tx: 2365 MB/s 40122 calls/s 40122 msg/s udp rx: 2287 MB/s 1666816 calls/s udp tx: 2363 MB/s 40084 calls/s 40084 msg/s udp rx: 2288 MB/s 1667840 calls/s Summary over 4.000 seconds... sum udp tx: 2412 MB/s 159818 calls (39954/s) 159818 msgs (39954/s) Tx Timestamps: 159818 received 0 errors udp gso zerocopy timestamp audit udp rx: 1592 MB/s 1161479 calls/s udp tx: 1624 MB/s 27560 calls/s 27560 msg/s udp rx: 1657 MB/s 1208472 calls/s udp tx: 1656 MB/s 28103 calls/s 28103 msg/s udp rx: 1653 MB/s 1206064 calls/s udp tx: 1653 MB/s 28047 calls/s 28047 msg/s udp rx: 1648 MB/s 1202151 calls/s Summary over 4.000 seconds... sum udp tx: 1683 MB/s 111556 calls (27889/s) 111556 msgs (27889/s) Tx Timestamps: 111556 received 0 errors Zerocopy acks: 111556 received 0 errors
On Mon, May 27, 2019 at 5:30 PM Fred Klassen fklassen@appneta.com wrote:
Willem,
Thanks for spending so much time with me on this patch. I’m pretty new to this, so I know I am making lots of mistakes. I have been working on a v2 of the selftests in net-next, but want to review the layout of the report before I submit (see below).
Also, I my v2 fix in net is still up for debate. In its current state, it meets my application’s requirements, but may not meet all of yours.
I gave more specific feedback on issues with it (referencing zerocopy and IP_TOS, say).
Also, it is safer to update only the relevant timestamp bits in tx_flags, rather that blanket overwrite, given that some bits are already set in skb_segment. I have not checked whether this is absolutely necessary.
I am still open to suggestions, but so far I don’t have an alternate solution that doesn’t break what I need working.
Did you see my response yesterday? I can live with the first segment. Even if I don't think that it buys much in practice given xmit_more (and it does cost something, e.g., during requeueing).
I also have a question about submitting a fix in net and a related enhancement in net-next. I feel I should be referencing the fix in net in my net-next commit, but I don’t know how it should be done. Any suggestions?
It is not strictly necessary, but indeed often a nice to have. We generally reference by SHA1, so wait with submitting the test until the fix is merged. See also the ipv6 flowlabel test that I just sent for one example.
This would probably also be more useful as regression test if it is in the form of a pass/fail test: if timestamps are requested and total count is zero, then the feature is broken and the process should exit with an error.
I have it set up as a pass/fail test.
Great, thanks.
Below is a sample output of the test, including a failure on IPv6 TCP Zerocopy audit (a failure that may lead to a memory leak).
Can you elaborate on this suspected memory leak?
I wanted to review the report with you before I push up the v2 patch into net-next.
Are these extra tests what you were expecting? Is it OK that it doesn’t flow well?
Do you mean how the output looks? That seems fine.
Also, there is a failure about every 3rd time I run it, indicating that some TX or Zerocopy messages are lost. Is that OK?
No that is not. These tests are run in a continuous test infrastructure. We should try hard to avoid flakiness.
If this intermittent failure is due to a real kernel bug, please move that part to a flag (or just comment out) to temporarily exclude it from continuous testing.
More commonly it is an issue with the test itself. My SO_TXTIME test from last week depends on timing, which has me somewhat worried when run across a wide variety of (likely virtualized) platforms. I purposely chose large timescales to minimize the risk.
On a related note, tests run as part of continuous testing should run as briefly as possible. Perhaps we need to reduce the time per run to accommodate for the new variants you are adding.
ipv4 tcp tcp rx: 11665 MB/s 189014 calls/s tcp tx: 11665 MB/s 197859 calls/s 197859 msg/s tcp rx: 11706 MB/s 190749 calls/s tcp tx: 11706 MB/s 198545 calls/s 198545 msg/s tcp tx: 11653 MB/s 197645 calls/s 197645 msg/s tcp rx: 11653 MB/s 189292 calls/s tcp zerocopy tcp rx: 6658 MB/s 111849 calls/s tcp tx: 6658 MB/s 112940 calls/s 112940 msg/s tcp tx: 6824 MB/s 115742 calls/s 115742 msg/s tcp rx: 6824 MB/s 115711 calls/s tcp rx: 6865 MB/s 116402 calls/s tcp tx: 6865 MB/s 116440 calls/s 116440 msg/s tcp zerocopy audit tcp tx: 6646 MB/s 112729 calls/s 112729 msg/s tcp rx: 6646 MB/s 111445 calls/s tcp rx: 6672 MB/s 112833 calls/s tcp tx: 6672 MB/s 113164 calls/s 113164 msg/s tcp tx: 6624 MB/s 112355 calls/s 112355 msg/s tcp rx: 6624 MB/s 110877 calls/s Summary over 4.000 seconds... sum tcp tx: 6813 MB/s 451402 calls (112850/s) 451402 msgs (112850/s) Zerocopy acks: 451402 received 0 errors udp udp rx: 914 MB/s 651407 calls/s udp tx: 925 MB/s 659274 calls/s 15697 msg/s udp rx: 919 MB/s 654859 calls/s udp tx: 919 MB/s 654864 calls/s 15592 msg/s udp rx: 914 MB/s 651668 calls/s udp tx: 914 MB/s 651630 calls/s 15515 msg/s udp rx: 918 MB/s 654642 calls/s udp gso udp rx: 2271 MB/s 1618319 calls/s udp tx: 2515 MB/s 42658 calls/s 42658 msg/s udp rx: 2337 MB/s 1665280 calls/s udp tx: 2551 MB/s 43276 calls/s 43276 msg/s udp rx: 2338 MB/s 1665792 calls/s udp tx: 2557 MB/s 43384 calls/s 43384 msg/s udp rx: 2339 MB/s 1666560 calls/s udp gso zerocopy udp rx: 1725 MB/s 1229087 calls/s udp tx: 1840 MB/s 31219 calls/s 31219 msg/s udp rx: 1850 MB/s 1318460 calls/s udp tx: 1850 MB/s 31388 calls/s 31388 msg/s udp rx: 1845 MB/s 1314428 calls/s udp tx: 1845 MB/s 31299 calls/s 31299 msg/s udp gso timestamp udp rx: 2286 MB/s 1628928 calls/s udp tx: 2446 MB/s 41499 calls/s 41499 msg/s udp rx: 2354 MB/s 1677312 calls/s udp tx: 2473 MB/s 41952 calls/s 41952 msg/s udp rx: 2342 MB/s 1668864 calls/s udp tx: 2471 MB/s 41925 calls/s 41925 msg/s udp rx: 2333 MB/s 1662464 calls/s udp gso zerocopy audit udp rx: 1787 MB/s 1273481 calls/s udp tx: 1832 MB/s 31082 calls/s 31082 msg/s udp rx: 1881 MB/s 1340476 calls/s udp tx: 1881 MB/s 31916 calls/s 31916 msg/s udp rx: 1880 MB/s 1339880 calls/s udp tx: 1881 MB/s 31904 calls/s 31904 msg/s udp rx: 1881 MB/s 1340010 calls/s Summary over 4.000 seconds... sum udp tx: 1912 MB/s 126694 calls (31673/s) 126694 msgs (31673/s) Zerocopy acks: 126694 received 0 errors udp gso timestamp audit udp rx: 2259 MB/s 1609327 calls/s udp tx: 2410 MB/s 40879 calls/s 40879 msg/s udp rx: 2346 MB/s 1671168 calls/s udp tx: 2446 MB/s 41491 calls/s 41491 msg/s udp rx: 2358 MB/s 1680128 calls/s udp tx: 2448 MB/s 41522 calls/s 41522 msg/s udp rx: 2356 MB/s 1678848 calls/s Summary over 4.000 seconds... sum udp tx: 2494 MB/s 165276 calls (41319/s) 165276 msgs (41319/s) Tx Timestamps: 165276 received 0 errors udp gso zerocopy timestamp audit udp rx: 1658 MB/s 1181647 calls/s udp tx: 1678 MB/s 28466 calls/s 28466 msg/s udp rx: 1718 MB/s 1224010 calls/s udp tx: 1718 MB/s 29146 calls/s 29146 msg/s udp rx: 1718 MB/s 1224086 calls/s udp tx: 1718 MB/s 29144 calls/s 29144 msg/s udp rx: 1717 MB/s 1223464 calls/s Summary over 4.000 seconds... sum udp tx: 1747 MB/s 115771 calls (28942/s) 115771 msgs (28942/s) Tx Timestamps: 115771 received 0 errors Zerocopy acks: 115771 received 0 errors ipv6 tcp tcp tx: 11470 MB/s 194547 calls/s 194547 msg/s tcp rx: 11470 MB/s 188442 calls/s tcp tx: 11803 MB/s 200193 calls/s 200193 msg/s tcp rx: 11803 MB/s 194526 calls/s tcp tx: 11832 MB/s 200681 calls/s 200681 msg/s tcp rx: 11832 MB/s 194459 calls/s tcp zerocopy tcp rx: 7482 MB/s 80176 calls/s tcp tx: 7482 MB/s 126908 calls/s 126908 msg/s tcp rx: 6641 MB/s 112609 calls/s tcp tx: 6641 MB/s 112649 calls/s 112649 msg/s tcp tx: 6584 MB/s 111683 calls/s 111683 msg/s tcp rx: 6584 MB/s 111617 calls/s tcp zerocopy audit tcp tx: 6719 MB/s 113968 calls/s 113968 msg/s tcp rx: 6719 MB/s 113893 calls/s tcp rx: 6772 MB/s 114831 calls/s tcp tx: 6772 MB/s 114872 calls/s 114872 msg/s tcp tx: 6793 MB/s 115226 calls/s 115226 msg/s tcp rx: 7075 MB/s 116595 calls/s Summary over 4.000 seconds... sum tcp tx: 6921 MB/s 458580 calls (114645/s) 458580 msgs (114645/s) ./udpgso_bench_tx: Unexpected number of Zerocopy completions: 458580 expected 458578 received
Is this the issue you're referring to? Good catch. Clearly this is a good test to have :) That is likely due to some timing issue in the test, e.g., no waiting long enough to harvest all completions. That is something I can look into after the code is merged.
On May 27, 2019, at 2:46 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
Also, I my v2 fix in net is still up for debate. In its current state, it meets my application’s requirements, but may not meet all of yours.
I gave more specific feedback on issues with it (referencing zerocopy and IP_TOS, say).
Unfortunately I don’t have a very good email setup, and I found a bunch of your comments in my junk folder. That was on Saturday, and on Sunday I spent some time implementing your suggestions. I have not pushed the changes up yet.
I wanted to discuss whether or not to attach a buffer to the recvmsg(fd, &msg, MSG_ERRQUEUE). Without it, I have MSG_TRUNC errors in my msg_flags. Either I have to add a buffer, or ignore that error flag.
Also, it is safer to update only the relevant timestamp bits in tx_flags, rather that blanket overwrite, given that some bits are already set in skb_segment. I have not checked whether this is absolutely necessary.
I agree. See tcp_fragment_tstamp().
I think this should work.
skb_shinfo(seg)->tx_flags |= (skb_shinfo(gso_skb)->tx_flags & SKBTX_ANY_TSTAMP);
I am still open to suggestions, but so far I don’t have an alternate solution that doesn’t break what I need working.
Did you see my response yesterday? I can live with the first segment. Even if I don't think that it buys much in practice given xmit_more (and it does cost something, e.g., during requeueing).
I’m sorry, I didn’t receive a response. Once again, I am struggling with crappy email setup. Hopefully as of today my junk mail filters are set up properly.
I’d like to see that comment. I have been wondering about xmit_more myself. I don’t think it changes anything for software timestamps, but it may with hardware timestamps.
I have service contracts with Intel and Mellanox. I can open up a ticket with them to see exactly when the timestamp is taken. I believe you mentioned before that this is vendor specific.
It is not strictly necessary, but indeed often a nice to have. We generally reference by SHA1, so wait with submitting the test until the fix is merged. See also the ipv6 flowlabel test that I just sent for one example.
Thanks. I will hold off with the test until I get a final fix in net, and I’ll use your example.
Below is a sample output of the test, including a failure on IPv6 TCP Zerocopy audit (a failure that may lead to a memory leak).
Can you elaborate on this suspected memory leak?
A user program cannot free a zerocopy buffer until it is reported as free. If zerocopy events are not reported, that could be a memory leak.
I may have a fix. I have added a -P option when I am running an audit. It doesn’t appear to affect performance, and since implementing it I have received all error messages expected for both timestamp and zerocopy.
I am still testing.
I wanted to review the report with you before I push up the v2 patch into net-next.
Are these extra tests what you were expecting? Is it OK that it doesn’t flow well?
Do you mean how the output looks? That seems fine.
Good. Thanks.
Also, there is a failure about every 3rd time I run it, indicating that some TX or Zerocopy messages are lost. Is that OK?
No that is not. These tests are run in a continuous test infrastructure. We should try hard to avoid flakiness.
As per above comment, I think I removed the flakiness. I will run overnight to confirm.
If this intermittent failure is due to a real kernel bug, please move that part to a flag (or just comment out) to temporarily exclude it from continuous testing.
More commonly it is an issue with the test itself. My SO_TXTIME test from last week depends on timing, which has me somewhat worried when run across a wide variety of (likely virtualized) platforms. I purposely chose large timescales to minimize the risk.
On a related note, tests run as part of continuous testing should run as briefly as possible. Perhaps we need to reduce the time per run to accommodate for the new variants you are adding.
I could reduce testing from 4 to 2 seconds. Anything below that and I miss some reports. When I found flakey results, I found I could reproduce them in as little as 1 second.
Summary over 4.000 seconds... sum tcp tx: 6921 MB/s 458580 calls (114645/s) 458580 msgs (114645/s) ./udpgso_bench_tx: Unexpected number of Zerocopy completions: 458580 expected 458578 received
Is this the issue you're referring to? Good catch. Clearly this is a good test to have :) That is likely due to some timing issue in the test, e.g., no waiting long enough to harvest all completions. That is something I can look into after the code is merged.
Thanks.
Should the test have failed at this point? I did return an error(), but the script kept running.
As stated, I don’t want to push up until I have tested more fully, and the fix is accepted (which requires a v3). If you want to review what I have, I can push it up now with the understanding that I may still fine tune things.
On Mon, May 27, 2019 at 6:56 PM Fred Klassen fklassen@appneta.com wrote:
On May 27, 2019, at 2:46 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
Also, I my v2 fix in net is still up for debate. In its current state, it meets my application’s requirements, but may not meet all of yours.
I gave more specific feedback on issues with it (referencing zerocopy and IP_TOS, say).
Unfortunately I don’t have a very good email setup, and I found a bunch of your comments in my junk folder. That was on Saturday, and on Sunday I spent some time implementing your suggestions. I have not pushed the changes up yet.
I wanted to discuss whether or not to attach a buffer to the recvmsg(fd, &msg, MSG_ERRQUEUE). Without it, I have MSG_TRUNC errors in my msg_flags. Either I have to add a buffer, or ignore that error flag.
Either sounds reasonable. It is an expected and well understood message if underprovisioning the receive data buffer.
Also, it is safer to update only the relevant timestamp bits in tx_flags, rather that blanket overwrite, given that some bits are already set in skb_segment. I have not checked whether this is absolutely necessary.
I agree. See tcp_fragment_tstamp().
I think this should work.
skb_shinfo(seg)->tx_flags |= (skb_shinfo(gso_skb)->tx_flags & SKBTX_ANY_TSTAMP);
Agreed. It is more obviously correct. Only drawback is that the RMW is more expensive than a straight assignment.
I am still open to suggestions, but so far I don’t have an alternate solution that doesn’t break what I need working.
Did you see my response yesterday? I can live with the first segment. Even if I don't think that it buys much in practice given xmit_more (and it does cost something, e.g., during requeueing).
I’m sorry, I didn’t receive a response. Once again, I am struggling with crappy email setup. Hopefully as of today my junk mail filters are set up properly.
I’d like to see that comment.
The netdev list is archived and available through various websites, like lore.kernel.org/netdev . As well as the patches with comments at patchwork.ozlabs.org/project/netdev/list
I have been wondering about xmit_more myself. I don’t think it changes anything for software timestamps, but it may with hardware timestamps.
It arguably makes the software timestamp too early if taken on the first segment, as the NIC is only informed of all the new descriptors when the last segment is written and the doorbell is rung.
Can you elaborate on this suspected memory leak?
A user program cannot free a zerocopy buffer until it is reported as free. If zerocopy events are not reported, that could be a memory leak.
I may have a fix. I have added a -P option when I am running an audit. It doesn’t appear to affect performance, and since implementing it I have received all error messages expected for both timestamp and zerocopy.
I am still testing.
I see, a userspace leak from lack of completion notification.
If the issue is a few missing notifications at the end of the run, then perhaps cfg_waittime_ms is too short.
On a related note, tests run as part of continuous testing should run as briefly as possible. Perhaps we need to reduce the time per run to accommodate for the new variants you are adding.
I could reduce testing from 4 to 2 seconds. Anything below that and I miss some reports. When I found flakey results, I found I could reproduce them in as little as 1 second.
Summary over 4.000 seconds... sum tcp tx: 6921 MB/s 458580 calls (114645/s) 458580 msgs (114645/s) ./udpgso_bench_tx: Unexpected number of Zerocopy completions: 458580 expected 458578 received
Is this the issue you're referring to? Good catch. Clearly this is a good test to have :) That is likely due to some timing issue in the test, e.g., no waiting long enough to harvest all completions. That is something I can look into after the code is merged.
Thanks.
Should the test have failed at this point? I did return an error(), but the script kept running.
This should normally be cause for test failure, I think yes. Though it's fine to send the code for review and possibly even merge, so that I can take a look.
As stated, I don’t want to push up until I have tested more fully, and the fix is accepted (which requires a v3). If you want to review what I have, I can push it up now with the understanding that I may still fine tune things.
Sounds good, thanks.
On May 27, 2019, at 6:15 PM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
I wanted to discuss whether or not to attach a buffer to the recvmsg(fd, &msg, MSG_ERRQUEUE). Without it, I have MSG_TRUNC errors in my msg_flags. Either I have to add a buffer, or ignore that error flag.
Either sounds reasonable. It is an expected and well understood message if underprovisioning the receive data buffer.
I’ll stick with setting up buffers. It will fail if there are any bugs introduced in buffer copy routines.
The netdev list is archived and available through various websites, like lore.kernel.org/netdev . As well as the patches with comments at patchwork.ozlabs.org/project/netdev/list
Much better. Sure beats hunting down lost emails.
I have been wondering about xmit_more myself. I don’t think it changes anything for software timestamps, but it may with hardware timestamps.
It arguably makes the software timestamp too early if taken on the first segment, as the NIC is only informed of all the new descriptors when the last segment is written and the doorbell is rung.
Totally makes sense. Possibly this can be improved software TX timestamps by delaying until just before ring buffer is advanced. It would have to be updated in each driver. I may have a look at this once I am complete this patch. Hopefully that one will be a bit smoother.
Can you elaborate on this suspected memory leak?
A user program cannot free a zerocopy buffer until it is reported as free. If zerocopy events are not reported, that could be a memory leak.
I may have a fix. I have added a -P option when I am running an audit. It doesn’t appear to affect performance, and since implementing it I have received all error messages expected for both timestamp and zerocopy.
I am still testing.
I see, a userspace leak from lack of completion notification.
If the issue is a few missing notifications at the end of the run, then perhaps cfg_waittime_ms is too short.
I’ll get back to you when I have tested this more thoroughly. Early results suggest that adding the -P poll() option has fixed it without any appreciable performance hit. I’ll share raw results with you, and we can make a final decision together.
Should the test have failed at this point? I did return an error(), but the script kept running.
This should normally be cause for test failure, I think yes. Though it's fine to send the code for review and possibly even merge, so that I can take a look.
Sounds like udpgso_bench.sh needs a ’set -e’ to ensure it stops on first error.
I have been wondering about xmit_more myself. I don’t think it changes anything for software timestamps, but it may with hardware timestamps.
It arguably makes the software timestamp too early if taken on the first segment, as the NIC is only informed of all the new descriptors when the last segment is written and the doorbell is rung.
Totally makes sense. Possibly this can be improved software TX timestamps by delaying until just before ring buffer is advanced. It would have to be updated in each driver. I may have a look at this once I am complete this patch. Hopefully that one will be a bit smoother.
How do you see that? The skb_tstamp_tx call currently is already the last action before ringing the doorbell, after setting up the descriptor. It cannot be set later.
The only issue specific to GSO is that xmit_more can forego this doorbell until the last segment. We want to complicate this logic with a special case based on tx_flags. A process that cares should either not use GSO, or the timestamp should be associated with the last segment as I've been arguing so far.
Can you elaborate on this suspected memory leak?
A user program cannot free a zerocopy buffer until it is reported as free. If zerocopy events are not reported, that could be a memory leak.
I may have a fix. I have added a -P option when I am running an audit. It doesn’t appear to affect performance, and since implementing it I have received all error messages expected for both timestamp and zerocopy.
I am still testing.
I see, a userspace leak from lack of completion notification.
If the issue is a few missing notifications at the end of the run, then perhaps cfg_waittime_ms is too short.
I’ll get back to you when I have tested this more thoroughly. Early results suggest that adding the -P poll() option has fixed it without any appreciable performance hit. I’ll share raw results with you, and we can make a final decision together.
In the main loop? It still is peculiar that notifications appear to go missing unless the process blocks waiting for them. Nothing in sock_zerocopy_callback or the queueing onto the error queue should cause drops, as far as I know.
Should the test have failed at this point? I did return an error(), but the script kept running.
This should normally be cause for test failure, I think yes. Though it's fine to send the code for review and possibly even merge, so that I can take a look.
Sounds like udpgso_bench.sh needs a ’set -e’ to ensure it stops on first error.
Indeed. Ideally even run all tests, but return error if any failed, like this recent patch
selftests/bpf: fail test_tunnel.sh if subtests fail https://patchwork.ozlabs.org/patch/1105221/
but that may be a lot of code churn and better left to a separate patch.
On May 28, 2019, at 8:08 AM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
I will push up latest patches soon.
I did some testing and discovered that only TCP audit tests failed. They failed much less often when enabling poll. Once in about 20 runs still failed. Therefore I commented out the TCP audit tests.
As for the other tests, this is what I got with poll() disabled…
udp gso zerocopy timestamp audit udp rx: 1611 MB/s 1148129 calls/s udp tx: 1659 MB/s 28146 calls/s 28146 msg/s udp rx: 1686 MB/s 1201494 calls/s udp tx: 1685 MB/s 28579 calls/s 28579 msg/s udp rx: 1685 MB/s 1200402 calls/s udp tx: 1683 MB/s 28552 calls/s 28552 msg/s Summary over 3.000 seconds... sum udp tx: 1716 MB/s 85277 calls (28425/s) 85277 msgs (28425/s) Tx Timestamps: 85277 received 0 errors Zerocopy acks: 85277 received 0 errors
Here you see that with poll() enabled, it is a bit slower, so I don’t have it enabled in udpgso_bench.sh …
udp gso zerocopy timestamp audit udp rx: 1591 MB/s 1133945 calls/s udp tx: 1613 MB/s 27358 calls/s 27358 msg/s udp rx: 1644 MB/s 1171674 calls/s udp tx: 1643 MB/s 27869 calls/s 27869 msg/s udp rx: 1643 MB/s 1170666 calls/s udp tx: 1641 MB/s 27845 calls/s 27845 msg/s Summary over 3.000 seconds... sum udp tx: 1671 MB/s 83072 calls (27690/s) 83072 msgs (27690/s) Tx Timestamps: 83072 received 0 errors Zerocopy acks: 83072 received 0 errors
You may be interested that I reduced test lengths from 4 to 3 seconds, but I am still getting 3 reports per test. I picked up the extra report by changing 'if (tnow > treport)’ to 'if (tnow >= treport)’
The only issue specific to GSO is that xmit_more can forego this doorbell until the last segment. We want to complicate this logic with a special case based on tx_flags. A process that cares should either not use GSO, or the timestamp should be associated with the last segment as I've been arguing so far.
This is the area I was thinking of looking into. I’m not sure it will work or that it will be too messy. It may be worth a little bit of digging to see if there is anything there. That will be down the road a bu
I’ll get back to you when I have tested this more thoroughly. Early results suggest that adding the -P poll() option has fixed it without any appreciable performance hit. I’ll share raw results with you, and we can make a final decision together.
In the main loop? It still is peculiar that notifications appear to go missing unless the process blocks waiting for them. Nothing in sock_zerocopy_callback or the queueing onto the error queue should cause drops, as far as I know.
Now that I know the issue is only in TCP, I can speculate that all bytes are being reported, but done with fewer messages. It may warrant some investigation in case there is some kind of bug.
Indeed. Ideally even run all tests, but return error if any failed, like this recent patch
selftests/bpf: fail test_tunnel.sh if subtests fail https://patchwork.ozlabs.org/patch/1105221/
but that may be a lot of code churn and better left to a separate patch.
I like it. I have it coded up, and it seems to work well. I’ll make a separate commit in the patch set so we can yank it out if you feel it is too much
On Tue, May 28, 2019 at 12:57 PM Fred Klassen fklassen@appneta.com wrote:
On May 28, 2019, at 8:08 AM, Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
I will push up latest patches soon.
I did some testing and discovered that only TCP audit tests failed. They failed much less often when enabling poll. Once in about 20 runs still failed. Therefore I commented out the TCP audit tests.
Sounds good, thanks.
You may be interested that I reduced test lengths from 4 to 3 seconds, but I am still getting 3 reports per test. I picked up the extra report by changing 'if (tnow > treport)’ to 'if (tnow >= treport)’
Nice!
The only issue specific to GSO is that xmit_more can forego this doorbell until the last segment. We want to complicate this logic with a special case based on tx_flags. A process that cares should either not use GSO, or the timestamp should be associated with the last segment as I've been arguing so far.
This is the area I was thinking of looking into. I’m not sure it will work or that it will be too messy. It may be worth a little bit of digging to see if there is anything there. That will be down the road a bu
Sorry, I meant we [do not (!)] want to complicate this logic. And definitely don't want to read skb_shinfo where that cacheline isn't accessed already.
Given xmit_more, do you still find the first segment the right one to move the timestamp tx_flags to in __udp_gso_segment?
I’ll get back to you when I have tested this more thoroughly. Early results suggest that adding the -P poll() option has fixed it without any appreciable performance hit. I’ll share raw results with you, and we can make a final decision together.
In the main loop? It still is peculiar that notifications appear to go missing unless the process blocks waiting for them. Nothing in sock_zerocopy_callback or the queueing onto the error queue should cause drops, as far as I know.
Now that I know the issue is only in TCP, I can speculate that all bytes are being reported, but done with fewer messages. It may warrant some investigation in case there is some kind of bug.
This would definitely still be a bug and should not happen. We have quite a bit of experience with TCP zerocopy and I have not run into this in practice, so I do think that it is somehow a test artifact.
Indeed. Ideally even run all tests, but return error if any failed, like this recent patch
selftests/bpf: fail test_tunnel.sh if subtests fail https://patchwork.ozlabs.org/patch/1105221/
but that may be a lot of code churn and better left to a separate patch.
I like it. I have it coded up, and it seems to work well. I’ll make a separate commit in the patch set so we can yank it out if you feel it is too much
Great. Yes, it sounds like an independent improvement, in which case it is easier to review as a separate patch. If you already have it, by all means send it as part of the larger patchset.
Now that I know the issue is only in TCP, I can speculate that all bytes are
being reported, but done with fewer messages. It may warrant some investigation in case there is some kind of bug.
This would definitely still be a bug and should not happen. We have quite a bit of experience with TCP zerocopy and I have not run into this in practice, so I do think that it is somehow a test artifact.
To be clear, I'm not saying that it is an artifact of your extensions. It's quite likely that the zerocopy benchmark was a bit flaky in that regard all along. No need to spend further time on that for this patchset.
linux-kselftest-mirror@lists.linaro.org