Re: [regression] UDP recv data corruption

2 Jul 2021


      Hello,
On Thu, 2021-07-01 at 20:31 -0400, Willem de Bruijn wrote:
...
On Thu, Jul 1, 2021 at 11:39 AM David Ahern dsahern@gmail.com wrote:
...
[ adding Paolo, author of 18f25dc39990 ]
On 7/1/21 4:47 AM, Matthias Treydte wrote:
...
Hello,
we recently upgraded the Linux kernel from 5.11.21 to 5.12.12 in our
video stream receiver appliance and noticed compression artifacts on
video streams that were previously looking fine. We are receiving UDP
multicast MPEG TS streams through an FFMpeg / libav layer which does the
socket and lower level protocol handling. For affected kernels it spills
the log with messages like
...
[mpegts @ 0x7fa130000900] Packet corrupt (stream = 0, dts = 6870802195).
[mpegts @ 0x7fa11c000900] Packet corrupt (stream = 0, dts = 6870821068).
Bisecting identified commit 18f25dc399901426dff61e676ba603ff52c666f7 as
the one introducing the problem in the mainline kernel. It was
backported to the 5.12 series in
450687386cd16d081b58cd7a342acff370a96078. Some random observations which
may help to understand what's going on:

the problem exists in Linux 5.13
reverting that commit on top of 5.13 makes the problem go away
Linux 5.10.45 is fine
no relevant output in dmesg
can be reproduced on different hardware (Intel, AMD, different

NICs, ...)

we do use the bonding driver on the systems (but I did not yet

verify that this is related)

we do not use vxlan (mentioned in the commit message)
the relevant code in FFMpeg identifying packet corruption is here:

https://github.com/FFmpeg/FFmpeg/blob/master/libavformat/mpegts.c#L2758
And the bonding configuration:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v5.10.45
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: enp2s0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
Slave Interface: enp2s0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 80:ee:73:XX:XX:XX
Slave queue ID: 0
Slave Interface: enp3s0
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 80:ee:73:XX:XX:XX
Slave queue ID: 0
If there is anything else I can do to help tracking this down please let
me know.
Thank you for the report!
According to the above I would wild guess that the GRO layer is
wrongly/unexpectly aggregating some UDP packets, but I do agree with
Willem, it looks like the code pointed by the bisecting should not do
anything evil here, but I'm likely missing some items.
Could you please:
- tell how frequent is the pkt corruption, even a rough estimate of the
frequency.
- provide the features exposed by the relevant devices: 
ethtool -k <nic name>
- check, if possibly, how exactly the pkts are corrupted. Wrong size?
bad csum? what else?
- ideally a short pcap trace comprising the problematic packets would
be great!
Thanks!
Paolo

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [regression] UDP recv data corruption