On Thu, Jul 1, 2021 at 11:39 AM David Ahern dsahern@gmail.com wrote:
[ adding Paolo, author of 18f25dc39990 ]
On 7/1/21 4:47 AM, Matthias Treydte wrote:
Hello,
we recently upgraded the Linux kernel from 5.11.21 to 5.12.12 in our video stream receiver appliance and noticed compression artifacts on video streams that were previously looking fine. We are receiving UDP multicast MPEG TS streams through an FFMpeg / libav layer which does the socket and lower level protocol handling. For affected kernels it spills the log with messages like
[mpegts @ 0x7fa130000900] Packet corrupt (stream = 0, dts = 6870802195). [mpegts @ 0x7fa11c000900] Packet corrupt (stream = 0, dts = 6870821068).
Bisecting identified commit 18f25dc399901426dff61e676ba603ff52c666f7 as the one introducing the problem in the mainline kernel. It was backported to the 5.12 series in 450687386cd16d081b58cd7a342acff370a96078. Some random observations which may help to understand what's going on:
- the problem exists in Linux 5.13
- reverting that commit on top of 5.13 makes the problem go away
- Linux 5.10.45 is fine
- no relevant output in dmesg
- can be reproduced on different hardware (Intel, AMD, different
NICs, ...)
- we do use the bonding driver on the systems (but I did not yet
verify that this is related)
- we do not use vxlan (mentioned in the commit message)
- the relevant code in FFMpeg identifying packet corruption is here:
https://github.com/FFmpeg/FFmpeg/blob/master/libavformat/mpegts.c#L2758
And the bonding configuration:
# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v5.10.45
Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp2s0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0
Slave Interface: enp2s0 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 80:ee:73:XX:XX:XX Slave queue ID: 0
Slave Interface: enp3s0 MII Status: down Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 80:ee:73:XX:XX:XX Slave queue ID: 0
If there is anything else I can do to help tracking this down please let me know.
That library does not enable UDP_GRO. You do not have any UDP based tunnel devices (besides vxlan) configured, either, right?
Then no socket lookup should take place, so sk is NULL.
It is also unlikely that the device has either of NETIF_F_GRO_FRAGLIST or NETIF_F_GRO_UDP_FWD configured. This can be checked with `ethtool -K $DEV`, shown as "rx-gro-list" and "rx-udp-gro-forwarding", respectively.
Then udp_gro_receive_segment is not called.
So this should just return the packet without applying any GRO.
I'm referring to this block of code in udp_gro_receive:
if (!sk || !udp_sk(sk)->gro_receive) { if (skb->dev->features & NETIF_F_GRO_FRAGLIST) NAPI_GRO_CB(skb)->is_flist = sk ? !udp_sk(sk)->gro_enabled : 1;
if ((!sk && (skb->dev->features & NETIF_F_GRO_UDP_FWD)) || (sk && udp_sk(sk)->gro_enabled) || NAPI_GRO_CB(skb)->is_flist) pp = call_gro_receive(udp_gro_receive_segment, head, skb); return pp; }
I don't see what could be up.
One possible short-term workaround is to disable GRO. If this commit is implicated, that should fix it. At some obvious possible cycle cost.