This series fixes a bug in the complete phase of UDP in GRO, in which socket lookup fails due to using network_header when parsing encapsulated packets. The fix is to pass p_off parameter in *_gro_complete.
Next, the fields network_offset and inner_network_offset are added to napi_gro_cb, and are both set during the receive phase of GRO. This is then leveraged in the next commit to remove flush_id state from napi_gro_cb, and stateful code in {ipv6,inet}_gro_receive which may be unnecessarily complicated due to encapsulation support in GRO.
In addition, udpgro_fwd selftest is adjusted to include the socket lookup case for vxlan. This selftest will test its supposed functionality once local bind support is merged (https://lore.kernel.org/netdev/df300a49-7811-4126-a56a-a77100c8841b@gmail.co...).
v3 -> v4: - Fix code comment and commit message typos - v3: https://lore.kernel.org/all/f939c84a-2322-4393-a5b0-9b1e0be8ed8e@gmail.com/
v2 -> v3: - Use napi_gro_cb instead of skb->{offset} - v2: https://lore.kernel.org/netdev/2ce1600b-e733-448b-91ac-9d0ae2b866a4@gmail.co...
v1 -> v2: - Pass p_off in *_gro_complete to fix UDP bug - Remove more conditionals and memory fetches from inet_gro_flush - v1: https://lore.kernel.org/netdev/e1d22505-c5f8-4c02-a997-64248480338b@gmail.co...
Richard Gobert (4): net: gro: add p_off param in *_gro_complete selftests/net: add local address bind in vxlan selftest net: gro: add {inner_}network_offset to napi_gro_cb net: gro: move L3 flush checks to tcp_gro_receive
drivers/net/geneve.c | 7 +- drivers/net/vxlan/vxlan_core.c | 11 ++-- include/linux/etherdevice.h | 2 +- include/linux/netdevice.h | 3 +- include/linux/udp.h | 2 +- include/net/gro.h | 36 +++++++---- include/net/inet_common.h | 2 +- include/net/tcp.h | 6 +- include/net/udp.h | 8 +-- include/net/udp_tunnel.h | 2 +- net/8021q/vlan_core.c | 6 +- net/core/gro.c | 6 +- net/ethernet/eth.c | 5 +- net/ipv4/af_inet.c | 49 ++------------ net/ipv4/fou_core.c | 9 +-- net/ipv4/gre_offload.c | 6 +- net/ipv4/tcp_offload.c | 79 ++++++++++++++++++----- net/ipv4/udp.c | 3 +- net/ipv4/udp_offload.c | 26 ++++---- net/ipv6/ip6_offload.c | 41 +++++------- net/ipv6/tcpv6_offload.c | 7 +- net/ipv6/udp.c | 3 +- net/ipv6/udp_offload.c | 13 ++-- tools/testing/selftests/net/udpgro_fwd.sh | 10 ++- 24 files changed, 187 insertions(+), 155 deletions(-)
Commits a602456 ("udp: Add GRO functions to UDP socket") and 57c67ff ("udp: additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the complete phase of gro. The functions always return skb->network_header, which in the case of encapsulated packets at the gro complete phase, is always set to the innermost L3 of the packet. That means that calling {ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_ L3/L4 may return an unexpected value.
This incorrect usage leads to a bug in GRO's UDP socket lookup. udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These *_hdr functions return network_header which will point to the innermost L3, resulting in the wrong offset being used in __udp{4,6}_lib_lookup with encapsulated packets.
To fix this issue p_off param is used in *_gro_complete to pass off the offset of the previous layer.
Reproduction example:
Endpoint configuration example (fou + local address bind)
# ip fou add port 6666 ipproto 4 # ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip # ip link set tun1 up # ip a add 1.1.1.2/24 dev tun1
Netperf TCP_STREAM result on net-next before patch is applied:
net-next main, GRO enabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 5.28 2.37
net-next main, GRO disabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 5.01 2745.06
patch applied, GRO enabled: $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 5.01 2877.38
Fixes: 57c67ff4bd92 ("udp: additional GRO support") Suggested-by: Eric Dumazet edumazet@google.com Signed-off-by: Richard Gobert richardbgobert@gmail.com --- drivers/net/geneve.c | 7 ++++--- drivers/net/vxlan/vxlan_core.c | 11 +++++++---- include/linux/etherdevice.h | 2 +- include/linux/netdevice.h | 3 ++- include/linux/udp.h | 2 +- include/net/gro.h | 11 ++++++----- include/net/inet_common.h | 2 +- include/net/tcp.h | 6 ++++-- include/net/udp.h | 8 ++++---- include/net/udp_tunnel.h | 2 +- net/8021q/vlan_core.c | 4 ++-- net/core/gro.c | 2 +- net/ethernet/eth.c | 4 ++-- net/ipv4/af_inet.c | 8 ++++---- net/ipv4/fou_core.c | 9 +++++---- net/ipv4/gre_offload.c | 5 +++-- net/ipv4/tcp_offload.c | 7 ++++--- net/ipv4/udp.c | 3 ++- net/ipv4/udp_offload.c | 26 ++++++++++++++------------ net/ipv6/ip6_offload.c | 22 ++++++++++++---------- net/ipv6/tcpv6_offload.c | 7 ++++--- net/ipv6/udp.c | 3 ++- net/ipv6/udp_offload.c | 13 +++++++------ 23 files changed, 93 insertions(+), 74 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 2f6739fe78af..9304cfe1c657 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -552,7 +552,7 @@ static struct sk_buff *geneve_gro_receive(struct sock *sk, }
static int geneve_gro_complete(struct sock *sk, struct sk_buff *skb, - int nhoff) + int p_off, int nhoff) { struct genevehdr *gh; struct packet_offload *ptype; @@ -566,11 +566,12 @@ static int geneve_gro_complete(struct sock *sk, struct sk_buff *skb,
/* since skb->encapsulation is set, eth_gro_complete() sets the inner mac header */ if (likely(type == htons(ETH_P_TEB))) - return eth_gro_complete(skb, nhoff + gh_len); + return eth_gro_complete(skb, p_off, nhoff + gh_len);
ptype = gro_find_complete_by_type(type); if (ptype) - err = ptype->callbacks.gro_complete(skb, nhoff + gh_len); + err = ptype->callbacks.gro_complete(skb, p_off, + nhoff + gh_len);
skb_set_inner_mac_header(skb, nhoff + gh_len);
diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c index 3495591a5c29..ff92fa2065c9 100644 --- a/drivers/net/vxlan/vxlan_core.c +++ b/drivers/net/vxlan/vxlan_core.c @@ -767,15 +767,17 @@ static struct sk_buff *vxlan_gpe_gro_receive(struct sock *sk, return pp; }
-static int vxlan_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) +static int vxlan_gro_complete(struct sock *sk, struct sk_buff *skb, + int p_off, int nhoff) { /* Sets 'skb->inner_mac_header' since we are always called with * 'skb->encapsulation' set. */ - return eth_gro_complete(skb, nhoff + sizeof(struct vxlanhdr)); + return eth_gro_complete(skb, p_off, nhoff + sizeof(struct vxlanhdr)); }
-static int vxlan_gpe_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) +static int vxlan_gpe_gro_complete(struct sock *sk, struct sk_buff *skb, + int p_off, int nhoff) { struct vxlanhdr *vh = (struct vxlanhdr *)(skb->data + nhoff); const struct packet_offload *ptype; @@ -786,7 +788,8 @@ static int vxlan_gpe_gro_complete(struct sock *sk, struct sk_buff *skb, int nhof return err; ptype = gro_find_complete_by_type(protocol); if (ptype) - err = ptype->callbacks.gro_complete(skb, nhoff + sizeof(struct vxlanhdr)); + err = ptype->callbacks.gro_complete(skb, p_off, nhoff + + sizeof(struct vxlanhdr)); return err; }
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index 224645f17c33..b081b43d9686 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -64,7 +64,7 @@ struct net_device *devm_alloc_etherdev_mqs(struct device *dev, int sizeof_priv, #define devm_alloc_etherdev(dev, sizeof_priv) devm_alloc_etherdev_mqs(dev, sizeof_priv, 1, 1)
struct sk_buff *eth_gro_receive(struct list_head *head, struct sk_buff *skb); -int eth_gro_complete(struct sk_buff *skb, int nhoff); +int eth_gro_complete(struct sk_buff *skb, int p_off, int nhoff);
/* Reserved Ethernet Addresses per IEEE 802.1Q */ static const u8 eth_reserved_addr_base[ETH_ALEN] __aligned(2) = diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cb37817d6382..b8773de720c8 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2707,7 +2707,8 @@ struct offload_callbacks { netdev_features_t features); struct sk_buff *(*gro_receive)(struct list_head *head, struct sk_buff *skb); - int (*gro_complete)(struct sk_buff *skb, int nhoff); + int (*gro_complete)(struct sk_buff *skb, int nhoff, + int thoff); };
struct packet_offload { diff --git a/include/linux/udp.h b/include/linux/udp.h index 3748e82b627b..a04d94a3b42f 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -82,7 +82,7 @@ struct udp_sock { struct sk_buff *skb); int (*gro_complete)(struct sock *sk, struct sk_buff *skb, - int nhoff); + int nhoff, int thoff);
/* udp_recvmsg try to use this before splicing sk_receive_queue */ struct sk_buff_head reader_queue ____cacheline_aligned_in_smp; diff --git a/include/net/gro.h b/include/net/gro.h index 50f1e403dbbb..ebead1d642b4 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -383,18 +383,18 @@ static inline void skb_gro_flush_final_remcsum(struct sk_buff *skb,
INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct list_head *, struct sk_buff *)); -INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *, int)); +INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *, int, int)); INDIRECT_CALLABLE_DECLARE(struct sk_buff *inet_gro_receive(struct list_head *, struct sk_buff *)); -INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *, int)); +INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *, int, int));
INDIRECT_CALLABLE_DECLARE(struct sk_buff *udp4_gro_receive(struct list_head *, struct sk_buff *)); -INDIRECT_CALLABLE_DECLARE(int udp4_gro_complete(struct sk_buff *, int)); +INDIRECT_CALLABLE_DECLARE(int udp4_gro_complete(struct sk_buff *, int, int));
INDIRECT_CALLABLE_DECLARE(struct sk_buff *udp6_gro_receive(struct list_head *, struct sk_buff *)); -INDIRECT_CALLABLE_DECLARE(int udp6_gro_complete(struct sk_buff *, int)); +INDIRECT_CALLABLE_DECLARE(int udp6_gro_complete(struct sk_buff *, int, int));
#define indirect_call_gro_receive_inet(cb, f2, f1, head, skb) \ ({ \ @@ -405,7 +405,8 @@ INDIRECT_CALLABLE_DECLARE(int udp6_gro_complete(struct sk_buff *, int));
struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, struct udphdr *uh, struct sock *sk); -int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup); +int udp_gro_complete(struct sk_buff *skb, int nhoff, int thoff, + udp_lookup_t lookup);
static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb) { diff --git a/include/net/inet_common.h b/include/net/inet_common.h index f50a644d87a9..605f917c830c 100644 --- a/include/net/inet_common.h +++ b/include/net/inet_common.h @@ -64,7 +64,7 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len);
struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb); -int inet_gro_complete(struct sk_buff *skb, int nhoff); +int inet_gro_complete(struct sk_buff *skb, int nhoff, int thoff); struct sk_buff *inet_gso_segment(struct sk_buff *skb, netdev_features_t features);
diff --git a/include/net/tcp.h b/include/net/tcp.h index 6ae35199d3b3..f36f15b6fb49 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2196,9 +2196,11 @@ void tcp_v4_destroy_sock(struct sock *sk); struct sk_buff *tcp_gso_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb); -INDIRECT_CALLABLE_DECLARE(int tcp4_gro_complete(struct sk_buff *skb, int thoff)); +INDIRECT_CALLABLE_DECLARE(int tcp4_gro_complete(struct sk_buff *skb, int nhoff, + int thoff)); INDIRECT_CALLABLE_DECLARE(struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)); -INDIRECT_CALLABLE_DECLARE(int tcp6_gro_complete(struct sk_buff *skb, int thoff)); +INDIRECT_CALLABLE_DECLARE(int tcp6_gro_complete(struct sk_buff *skb, int nhoff, + int thoff)); INDIRECT_CALLABLE_DECLARE(struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)); #ifdef CONFIG_INET void tcp_gro_complete(struct sk_buff *skb); diff --git a/include/net/udp.h b/include/net/udp.h index 488a6d2babcc..601d1c3b677a 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -166,8 +166,8 @@ static inline void udp_csum_pull_header(struct sk_buff *skb) UDP_SKB_CB(skb)->cscov -= sizeof(struct udphdr); }
-typedef struct sock *(*udp_lookup_t)(const struct sk_buff *skb, __be16 sport, - __be16 dport); +typedef struct sock *(*udp_lookup_t)(const struct sk_buff *skb, int nhoff, + __be16 sport, __be16 dport);
void udp_v6_early_demux(struct sk_buff *skb); INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *)); @@ -301,7 +301,7 @@ struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport, struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport, __be32 daddr, __be16 dport, int dif, int sdif, struct udp_table *tbl, struct sk_buff *skb); -struct sock *udp4_lib_lookup_skb(const struct sk_buff *skb, +struct sock *udp4_lib_lookup_skb(const struct sk_buff *skb, int nhoff, __be16 sport, __be16 dport); struct sock *udp6_lib_lookup(struct net *net, const struct in6_addr *saddr, __be16 sport, @@ -312,7 +312,7 @@ struct sock *__udp6_lib_lookup(struct net *net, const struct in6_addr *daddr, __be16 dport, int dif, int sdif, struct udp_table *tbl, struct sk_buff *skb); -struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, +struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, int nhoff, __be16 sport, __be16 dport); int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h index d716214fe03d..a641392e70b0 100644 --- a/include/net/udp_tunnel.h +++ b/include/net/udp_tunnel.h @@ -75,7 +75,7 @@ typedef struct sk_buff *(*udp_tunnel_gro_receive_t)(struct sock *sk, struct list_head *head, struct sk_buff *skb); typedef int (*udp_tunnel_gro_complete_t)(struct sock *sk, struct sk_buff *skb, - int nhoff); + int nhoff, int thoff);
struct udp_tunnel_sock_cfg { void *sk_user_data; /* user data used by encap_rcv call back */ diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c index f00158234505..247704cf70af 100644 --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -510,7 +510,7 @@ static struct sk_buff *vlan_gro_receive(struct list_head *head, return pp; }
-static int vlan_gro_complete(struct sk_buff *skb, int nhoff) +static int vlan_gro_complete(struct sk_buff *skb, int p_off, int nhoff) { struct vlan_hdr *vhdr = (struct vlan_hdr *)(skb->data + nhoff); __be16 type = vhdr->h_vlan_encapsulated_proto; @@ -521,7 +521,7 @@ static int vlan_gro_complete(struct sk_buff *skb, int nhoff) if (ptype) err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete, ipv6_gro_complete, inet_gro_complete, - skb, nhoff + sizeof(*vhdr)); + skb, p_off, nhoff + sizeof(*vhdr));
return err; } diff --git a/net/core/gro.c b/net/core/gro.c index ee30d4f0c038..167935396c50 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -251,7 +251,7 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete, ipv6_gro_complete, inet_gro_complete, - skb, 0); + skb, 0, 0); break; } rcu_read_unlock(); diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index 2edc8b796a4e..7515e6bcbb7d 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -453,7 +453,7 @@ struct sk_buff *eth_gro_receive(struct list_head *head, struct sk_buff *skb) } EXPORT_SYMBOL(eth_gro_receive);
-int eth_gro_complete(struct sk_buff *skb, int nhoff) +int eth_gro_complete(struct sk_buff *skb, int p_off, int nhoff) { struct ethhdr *eh = (struct ethhdr *)(skb->data + nhoff); __be16 type = eh->h_proto; @@ -467,7 +467,7 @@ int eth_gro_complete(struct sk_buff *skb, int nhoff) if (ptype != NULL) err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete, ipv6_gro_complete, inet_gro_complete, - skb, nhoff + sizeof(*eh)); + skb, p_off, nhoff + sizeof(*eh));
return err; } diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 55bd72997b31..5a747d91cd0a 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1641,7 +1641,7 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len) } EXPORT_SYMBOL(inet_recv_error);
-int inet_gro_complete(struct sk_buff *skb, int nhoff) +int inet_gro_complete(struct sk_buff *skb, int prior_off, int nhoff) { struct iphdr *iph = (struct iphdr *)(skb->data + nhoff); const struct net_offload *ops; @@ -1667,17 +1667,17 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff) */ err = INDIRECT_CALL_2(ops->callbacks.gro_complete, tcp4_gro_complete, udp4_gro_complete, - skb, nhoff + sizeof(*iph)); + skb, nhoff, nhoff + sizeof(*iph));
out: return err; }
-static int ipip_gro_complete(struct sk_buff *skb, int nhoff) +static int ipip_gro_complete(struct sk_buff *skb, int prior_off, int nhoff) { skb->encapsulation = 1; skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP4; - return inet_gro_complete(skb, nhoff); + return inet_gro_complete(skb, prior_off, nhoff); }
int inet_ctl_sock_create(struct sock **sk, unsigned short family, diff --git a/net/ipv4/fou_core.c b/net/ipv4/fou_core.c index a8494f796dca..7cf214d0b96d 100644 --- a/net/ipv4/fou_core.c +++ b/net/ipv4/fou_core.c @@ -260,7 +260,7 @@ static struct sk_buff *fou_gro_receive(struct sock *sk, }
static int fou_gro_complete(struct sock *sk, struct sk_buff *skb, - int nhoff) + int p_off, int nhoff) { const struct net_offload __rcu **offloads; u8 proto = fou_from_sock(sk)->protocol; @@ -272,7 +272,7 @@ static int fou_gro_complete(struct sock *sk, struct sk_buff *skb, if (WARN_ON(!ops || !ops->callbacks.gro_complete)) goto out;
- err = ops->callbacks.gro_complete(skb, nhoff); + err = ops->callbacks.gro_complete(skb, p_off, nhoff);
skb_set_inner_mac_header(skb, nhoff);
@@ -445,7 +445,8 @@ static struct sk_buff *gue_gro_receive(struct sock *sk, return pp; }
-static int gue_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) +static int gue_gro_complete(struct sock *sk, struct sk_buff *skb, + int p_off, int nhoff) { struct guehdr *guehdr = (struct guehdr *)(skb->data + nhoff); const struct net_offload __rcu **offloads; @@ -480,7 +481,7 @@ static int gue_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff) if (WARN_ON(!ops || !ops->callbacks.gro_complete)) goto out;
- err = ops->callbacks.gro_complete(skb, nhoff + guehlen); + err = ops->callbacks.gro_complete(skb, p_off, nhoff + guehlen);
skb_set_inner_mac_header(skb, nhoff + guehlen);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c index 5028c72d494a..d4520c3f7c09 100644 --- a/net/ipv4/gre_offload.c +++ b/net/ipv4/gre_offload.c @@ -233,7 +233,7 @@ static struct sk_buff *gre_gro_receive(struct list_head *head, return pp; }
-static int gre_gro_complete(struct sk_buff *skb, int nhoff) +static int gre_gro_complete(struct sk_buff *skb, int p_off, int nhoff) { struct gre_base_hdr *greh = (struct gre_base_hdr *)(skb->data + nhoff); struct packet_offload *ptype; @@ -253,7 +253,8 @@ static int gre_gro_complete(struct sk_buff *skb, int nhoff)
ptype = gro_find_complete_by_type(type); if (ptype) - err = ptype->callbacks.gro_complete(skb, nhoff + grehlen); + err = ptype->callbacks.gro_complete(skb, p_off, + nhoff + grehlen);
skb_set_inner_mac_header(skb, nhoff + grehlen);
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index ebe4722bb020..a63af9a6b0f7 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -330,10 +330,11 @@ struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb) return tcp_gro_receive(head, skb); }
-INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff) +INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int nhoff, + int thoff) { - const struct iphdr *iph = ip_hdr(skb); - struct tcphdr *th = tcp_hdr(skb); + const struct iphdr *iph = (const struct iphdr *)(skb->data + nhoff); + struct tcphdr *th = (struct tcphdr *)(skb->data + thoff);
th->check = ~tcp_v4_check(skb->len - thoff, iph->saddr, iph->daddr, 0); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 661d0e0d273f..26d9397cd373 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -530,9 +530,10 @@ static inline struct sock *__udp4_lib_lookup_skb(struct sk_buff *skb, }
struct sock *udp4_lib_lookup_skb(const struct sk_buff *skb, + int nhoff, __be16 sport, __be16 dport) { - const struct iphdr *iph = ip_hdr(skb); + const struct iphdr *iph = (const struct iphdr *)(skb->data + nhoff); struct net *net = dev_net(skb->dev); int iif, sdif;
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index b9880743765c..1814dbd9cea3 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -669,18 +669,19 @@ static int udp_gro_complete_segment(struct sk_buff *skb) return 0; }
-int udp_gro_complete(struct sk_buff *skb, int nhoff, +int udp_gro_complete(struct sk_buff *skb, int nhoff, int thoff, udp_lookup_t lookup) { - __be16 newlen = htons(skb->len - nhoff); - struct udphdr *uh = (struct udphdr *)(skb->data + nhoff); + struct udphdr *uh = (struct udphdr *)(skb->data + thoff); + __be16 newlen = htons(skb->len - thoff); struct sock *sk; int err;
uh->len = newlen;
sk = INDIRECT_CALL_INET(lookup, udp6_lib_lookup_skb, - udp4_lib_lookup_skb, skb, uh->source, uh->dest); + udp4_lib_lookup_skb, skb, nhoff, uh->source, + uh->dest); if (sk && udp_sk(sk)->gro_complete) { skb_shinfo(skb)->gso_type = uh->check ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; @@ -694,8 +695,8 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff, * functions to make them set up the inner offsets. */ skb->encapsulation = 1; - err = udp_sk(sk)->gro_complete(sk, skb, - nhoff + sizeof(struct udphdr)); + err = udp_sk(sk)->gro_complete(sk, skb, nhoff, + thoff + sizeof(struct udphdr)); } else { err = udp_gro_complete_segment(skb); } @@ -707,14 +708,15 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff, } EXPORT_SYMBOL(udp_gro_complete);
-INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff) +INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff, + int thoff) { - const struct iphdr *iph = ip_hdr(skb); - struct udphdr *uh = (struct udphdr *)(skb->data + nhoff); + const struct iphdr *iph = (const struct iphdr *)(skb->data + nhoff); + struct udphdr *uh = (struct udphdr *)(skb->data + thoff);
/* do fraglist only if there is no outer UDP encap (or we already processed it) */ if (NAPI_GRO_CB(skb)->is_flist && !NAPI_GRO_CB(skb)->encap_mark) { - uh->len = htons(skb->len - nhoff); + uh->len = htons(skb->len - thoff);
skb_shinfo(skb)->gso_type |= (SKB_GSO_FRAGLIST|SKB_GSO_UDP_L4); skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count; @@ -731,10 +733,10 @@ INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff) }
if (uh->check) - uh->check = ~udp_v4_check(skb->len - nhoff, iph->saddr, + uh->check = ~udp_v4_check(skb->len - thoff, iph->saddr, iph->daddr, 0);
- return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb); + return udp_gro_complete(skb, nhoff, thoff, udp4_lib_lookup_skb); }
int __init udpv4_offload_init(void) diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index b41e35af69ea..10ddbbc0e46d 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -346,12 +346,14 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head, return inet_gro_receive(head, skb); }
-INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff) +INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, + int p_off, int nhoff) { const struct net_offload *ops; struct ipv6hdr *iph; int err = -ENOSYS; u32 payload_len; + int nhlen;
if (skb->encapsulation) { skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6)); @@ -387,36 +389,36 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff) iph->payload_len = htons(payload_len); }
- nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops); - if (WARN_ON(!ops || !ops->callbacks.gro_complete)) + nhlen = sizeof(*iph) + ipv6_exthdrs_len(iph, &ops); + if (WARN_ON_ONCE(!ops || !ops->callbacks.gro_complete)) goto out;
err = INDIRECT_CALL_L4(ops->callbacks.gro_complete, tcp6_gro_complete, - udp6_gro_complete, skb, nhoff); + udp6_gro_complete, skb, nhoff, nhoff + nhlen);
out: return err; }
-static int sit_gro_complete(struct sk_buff *skb, int nhoff) +static int sit_gro_complete(struct sk_buff *skb, int p_off, int nhoff) { skb->encapsulation = 1; skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP4; - return ipv6_gro_complete(skb, nhoff); + return ipv6_gro_complete(skb, p_off, nhoff); }
-static int ip6ip6_gro_complete(struct sk_buff *skb, int nhoff) +static int ip6ip6_gro_complete(struct sk_buff *skb, int p_off, int nhoff) { skb->encapsulation = 1; skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6; - return ipv6_gro_complete(skb, nhoff); + return ipv6_gro_complete(skb, p_off, nhoff); }
-static int ip4ip6_gro_complete(struct sk_buff *skb, int nhoff) +static int ip4ip6_gro_complete(struct sk_buff *skb, int p_off, int nhoff) { skb->encapsulation = 1; skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6; - return inet_gro_complete(skb, nhoff); + return inet_gro_complete(skb, p_off, nhoff); }
diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c index 4b07d1e6c952..91a55372dcd4 100644 --- a/net/ipv6/tcpv6_offload.c +++ b/net/ipv6/tcpv6_offload.c @@ -27,10 +27,11 @@ struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb) return tcp_gro_receive(head, skb); }
-INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff) +INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, + int nhoff, int thoff) { - const struct ipv6hdr *iph = ipv6_hdr(skb); - struct tcphdr *th = tcp_hdr(skb); + const struct ipv6hdr *iph = (const struct ipv6hdr *)(skb->data + nhoff); + struct tcphdr *th = (struct tcphdr *)(skb->data + thoff);
th->check = ~tcp_v6_check(skb->len - thoff, &iph->saddr, &iph->daddr, 0); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 7c1e6469d091..2950e6d9a49c 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -270,9 +270,10 @@ static struct sock *__udp6_lib_lookup_skb(struct sk_buff *skb, }
struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, + int nhoff, __be16 sport, __be16 dport) { - const struct ipv6hdr *iph = ipv6_hdr(skb); + const struct ipv6hdr *iph = (const struct ipv6hdr *)(skb->data + nhoff); struct net *net = dev_net(skb->dev); int iif, sdif;
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c index 312bcaeea96f..5ad04ef50426 100644 --- a/net/ipv6/udp_offload.c +++ b/net/ipv6/udp_offload.c @@ -162,14 +162,15 @@ struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb) return NULL; }
-INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff) +INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff, + int thoff) { - const struct ipv6hdr *ipv6h = ipv6_hdr(skb); - struct udphdr *uh = (struct udphdr *)(skb->data + nhoff); + const struct ipv6hdr *ipv6h = (const struct ipv6hdr *)(skb->data + nhoff); + struct udphdr *uh = (struct udphdr *)(skb->data + thoff);
/* do fraglist only if there is no outer UDP encap (or we already processed it) */ if (NAPI_GRO_CB(skb)->is_flist && !NAPI_GRO_CB(skb)->encap_mark) { - uh->len = htons(skb->len - nhoff); + uh->len = htons(skb->len - thoff);
skb_shinfo(skb)->gso_type |= (SKB_GSO_FRAGLIST|SKB_GSO_UDP_L4); skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count; @@ -186,10 +187,10 @@ INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff) }
if (uh->check) - uh->check = ~udp_v6_check(skb->len - nhoff, &ipv6h->saddr, + uh->check = ~udp_v6_check(skb->len - thoff, &ipv6h->saddr, &ipv6h->daddr, 0);
- return udp_gro_complete(skb, nhoff, udp6_lib_lookup_skb); + return udp_gro_complete(skb, nhoff, thoff, udp6_lib_lookup_skb); }
int __init udpv6_offload_init(void)
Add local address bind support to existing udpgro_fwd.sh vxlan selftest, to ensure UDP socket lookup in GRO is working.
Signed-off-by: Richard Gobert richardbgobert@gmail.com --- tools/testing/selftests/net/udpgro_fwd.sh | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/net/udpgro_fwd.sh b/tools/testing/selftests/net/udpgro_fwd.sh index 380cb15e942e..0541c5da571d 100755 --- a/tools/testing/selftests/net/udpgro_fwd.sh +++ b/tools/testing/selftests/net/udpgro_fwd.sh @@ -62,11 +62,13 @@ create_vxlan_endpoint() { local -r bm_rem_addr=$3 local -r vxlan_dev=$4 local -r vxlan_id=$5 + local -r bm_local_addr=$6 local -r vxlan_port=4789
ip -n $netns link set dev $bm_dev up ip -n $netns link add dev $vxlan_dev type vxlan id $vxlan_id \ - dstport $vxlan_port remote $bm_rem_addr + dstport $vxlan_port local $bm_local_addr \ + remote $bm_rem_addr ip -n $netns link set dev $vxlan_dev up }
@@ -77,11 +79,13 @@ create_vxlan_pair() {
for ns in $SRC $DST; do # note that 3 - $SRC == $DST and 3 - $DST == $SRC - create_vxlan_endpoint $BASE$ns veth$ns $BM_NET_V4$((3 - $ns)) vxlan$ns 4 + create_vxlan_endpoint $BASE$ns veth$ns $BM_NET_V4$((3 - $ns)) \ + vxlan$ns 4 $BM_NET_V4$ns ip -n $BASE$ns addr add dev vxlan$ns $OL_NET_V4$ns/24 done for ns in $SRC $DST; do - create_vxlan_endpoint $BASE$ns veth$ns $BM_NET_V6$((3 - $ns)) vxlan6$ns 6 + create_vxlan_endpoint $BASE$ns veth$ns $BM_NET_V6$((3 - $ns)) \ + vxlan6$ns 6 $BM_NET_V6$ns ip -n $BASE$ns addr add dev vxlan6$ns $OL_NET_V6$ns/24 nodad done
This patch adds network_offset and inner_network_offset to napi_gro_cb, and makes sure both are set correctly. In the common path there's only one write (skb_gro_reset_offset).
Signed-off-by: Richard Gobert richardbgobert@gmail.com --- include/net/gro.h | 18 ++++++++++++++++-- net/8021q/vlan_core.c | 2 ++ net/core/gro.c | 1 + net/ethernet/eth.c | 1 + net/ipv4/af_inet.c | 5 +---- net/ipv4/gre_offload.c | 1 + net/ipv6/ip6_offload.c | 8 ++++---- 7 files changed, 26 insertions(+), 10 deletions(-)
diff --git a/include/net/gro.h b/include/net/gro.h index ebead1d642b4..a1cc8e8c2ebd 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -87,6 +87,15 @@ struct napi_gro_cb {
/* used to support CHECKSUM_COMPLETE for tunneling protocols */ __wsum csum; + + /* L3 offsets */ + union { + struct { + u16 network_offset; + u16 inner_network_offset; + }; + u16 network_offsets[2]; + }; };
#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb) @@ -172,12 +181,17 @@ static inline void *skb_gro_header(struct sk_buff *skb, unsigned int hlen, return ptr; }
+static inline int skb_gro_network_offset(const struct sk_buff *skb) +{ + return NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark]; +} + static inline void *skb_gro_network_header(const struct sk_buff *skb) { if (skb_gro_may_pull(skb, skb_gro_offset(skb))) - return skb_gro_header_fast(skb, skb_network_offset(skb)); + return skb_gro_header_fast(skb, skb_gro_network_offset(skb));
- return skb_network_header(skb); + return skb->data + skb_gro_network_offset(skb); }
static inline __wsum inet_gro_compute_pseudo(const struct sk_buff *skb, diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c index 247704cf70af..355cafe23329 100644 --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -478,6 +478,8 @@ static struct sk_buff *vlan_gro_receive(struct list_head *head, if (unlikely(!vhdr)) goto out;
+ NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark] = hlen; + type = vhdr->h_vlan_encapsulated_proto;
ptype = gro_find_receive_by_type(type); diff --git a/net/core/gro.c b/net/core/gro.c index 167935396c50..7c468ed805f4 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -370,6 +370,7 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff) const skb_frag_t *frag0; unsigned int headlen;
+ NAPI_GRO_CB(skb)->network_offset = 0; NAPI_GRO_CB(skb)->data_offset = 0; headlen = skb_headlen(skb); NAPI_GRO_CB(skb)->frag0 = skb->data; diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index 7515e6bcbb7d..e3eca605bcc7 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -441,6 +441,7 @@ struct sk_buff *eth_gro_receive(struct list_head *head, struct sk_buff *skb)
skb_gro_pull(skb, sizeof(*eh)); skb_gro_postpull_rcsum(skb, eh, sizeof(*eh)); + NAPI_GRO_CB(skb)->inner_network_offset = hlen;
pp = indirect_call_gro_receive_inet(ptype->callbacks.gro_receive, ipv6_gro_receive, inet_gro_receive, diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 5a747d91cd0a..6546bf376b24 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1568,10 +1568,6 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF)); NAPI_GRO_CB(skb)->flush |= flush; - skb_set_network_header(skb, off); - /* The above will be needed by the transport layer if there is one - * immediately following this IP hdr. - */
/* Note : No need to call skb_gro_postpull_rcsum() here, * as we already checked checksum over ipv4 header was 0 @@ -1597,6 +1593,7 @@ static struct sk_buff *ipip_gro_receive(struct list_head *head, }
NAPI_GRO_CB(skb)->encap_mark = 1; + NAPI_GRO_CB(skb)->inner_network_offset = skb_gro_offset(skb);
return inet_gro_receive(head, skb); } diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c index d4520c3f7c09..ae596285d78c 100644 --- a/net/ipv4/gre_offload.c +++ b/net/ipv4/gre_offload.c @@ -224,6 +224,7 @@ static struct sk_buff *gre_gro_receive(struct list_head *head, /* Adjusted NAPI_GRO_CB(skb)->csum after skb_gro_pull()*/ skb_gro_postpull_rcsum(skb, greh, grehlen);
+ NAPI_GRO_CB(skb)->inner_network_offset = hlen; pp = call_gro_receive(ptype->callbacks.gro_receive, head, skb); flush = 0;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 10ddbbc0e46d..ba41939537f2 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -67,7 +67,7 @@ static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto) off += len; }
- skb_gro_pull(skb, off - skb_network_offset(skb)); + skb_gro_pull(skb, off - skb_gro_network_offset(skb)); return proto; }
@@ -236,8 +236,6 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, if (unlikely(!iph)) goto out;
- skb_set_network_header(skb, off); - flush += ntohs(iph->payload_len) != skb->len - hlen;
proto = iph->nexthdr; @@ -259,7 +257,7 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, NAPI_GRO_CB(skb)->proto = proto;
flush--; - nlen = skb_network_header_len(skb); + nlen = skb_gro_offset(skb) - off;
list_for_each_entry(p, head, list) { const struct ipv6hdr *iph2; @@ -327,6 +325,7 @@ static struct sk_buff *sit_ip6ip6_gro_receive(struct list_head *head, }
NAPI_GRO_CB(skb)->encap_mark = 1; + NAPI_GRO_CB(skb)->inner_network_offset = skb_gro_offset(skb);
return ipv6_gro_receive(head, skb); } @@ -342,6 +341,7 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head, }
NAPI_GRO_CB(skb)->encap_mark = 1; + NAPI_GRO_CB(skb)->inner_network_offset = skb_gro_offset(skb);
return inet_gro_receive(head, skb); }
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used currently only in tcp flows in GRO.
These checks need to be done only once in tcp_gro_receive and only against the found p skb, since they only affect flush and not same_flow.
Levaraging the previous commit in the series, in which correct network header offsets are saved for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition - flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
This results in less parsing code for UDP flows and non-loop flush tests for TCP flows.
For example, running 40 IP/UDP netperf connections: ./super_netperf.sh 40 -H 1.1.1.2 -t UDP_STREAM -l 120
Running perf top for 90s we can see that relatively less time is spent on inet_gro_receive when GRO is not coalescing UDP:
net-next: 1.26% [kernel] [k] inet_gro_receive
patch applied: 0.85% [kernel] [k] inet_gro_receive
udpgro_bench.sh single connection GRO improvement: net-next: 0.76% [kernel] [k] inet_gro_receive
patch applied: 0.61% [kernel] [k] inet_gro_receive
Signed-off-by: Richard Gobert richardbgobert@gmail.com --- include/net/gro.h | 9 ++---- net/core/gro.c | 3 -- net/ipv4/af_inet.c | 36 --------------------- net/ipv4/tcp_offload.c | 72 ++++++++++++++++++++++++++++++++++-------- net/ipv6/ip6_offload.c | 11 ------- 5 files changed, 62 insertions(+), 69 deletions(-)
diff --git a/include/net/gro.h b/include/net/gro.h index a1cc8e8c2ebd..73e266194528 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -36,15 +36,15 @@ struct napi_gro_cb { /* This is non-zero if the packet cannot be merged with the new skb. */ u16 flush;
- /* Save the IP ID here and check when we get to the transport layer */ - u16 flush_id; - /* Number of segments aggregated. */ u16 count;
/* Used in ipv6_gro_receive() and foo-over-udp and esp-in-udp */ u16 proto;
+ /* used to support CHECKSUM_COMPLETE for tunneling protocols */ + __wsum csum; + /* Used in napi_gro_cb::free */ #define NAPI_GRO_FREE 1 #define NAPI_GRO_FREE_STOLEN_HEAD 2 @@ -85,9 +85,6 @@ struct napi_gro_cb { u8 is_flist:1; );
- /* used to support CHECKSUM_COMPLETE for tunneling protocols */ - __wsum csum; - /* L3 offsets */ union { struct { diff --git a/net/core/gro.c b/net/core/gro.c index 7c468ed805f4..f8cbc08197f7 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -329,8 +329,6 @@ static void gro_list_prepare(const struct list_head *head, list_for_each_entry(p, head, list) { unsigned long diffs;
- NAPI_GRO_CB(p)->flush = 0; - if (hash != skb_get_hash_raw(p)) { NAPI_GRO_CB(p)->same_flow = 0; continue; @@ -470,7 +468,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff sizeof(u32))); /* Avoid slow unaligned acc */ *(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0; NAPI_GRO_CB(skb)->flush = skb_has_frag_list(skb); - NAPI_GRO_CB(skb)->is_atomic = 1; NAPI_GRO_CB(skb)->count = 1; if (unlikely(skb_is_gso(skb))) { NAPI_GRO_CB(skb)->count = skb_shinfo(skb)->gso_segs; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 6546bf376b24..9d7fd79ee915 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1513,7 +1513,6 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
list_for_each_entry(p, head, list) { struct iphdr *iph2; - u16 flush_id;
if (!NAPI_GRO_CB(p)->same_flow) continue; @@ -1530,43 +1529,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb) NAPI_GRO_CB(p)->same_flow = 0; continue; } - - /* All fields must match except length and checksum. */ - NAPI_GRO_CB(p)->flush |= - (iph->ttl ^ iph2->ttl) | - (iph->tos ^ iph2->tos) | - ((iph->frag_off ^ iph2->frag_off) & htons(IP_DF)); - - NAPI_GRO_CB(p)->flush |= flush; - - /* We need to store of the IP ID check to be included later - * when we can verify that this packet does in fact belong - * to a given flow. - */ - flush_id = (u16)(id - ntohs(iph2->id)); - - /* This bit of code makes it much easier for us to identify - * the cases where we are doing atomic vs non-atomic IP ID - * checks. Specifically an atomic check can return IP ID - * values 0 - 0xFFFF, while a non-atomic check can only - * return 0 or 0xFFFF. - */ - if (!NAPI_GRO_CB(p)->is_atomic || - !(iph->frag_off & htons(IP_DF))) { - flush_id ^= NAPI_GRO_CB(p)->count; - flush_id = flush_id ? 0xFFFF : 0; - } - - /* If the previous IP ID value was based on an atomic - * datagram we can overwrite the value and ignore it. - */ - if (NAPI_GRO_CB(skb)->is_atomic) - NAPI_GRO_CB(p)->flush_id = flush_id; - else - NAPI_GRO_CB(p)->flush_id |= flush_id; }
- NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF)); NAPI_GRO_CB(skb)->flush |= flush;
/* Note : No need to call skb_gro_postpull_rcsum() here, diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index a63af9a6b0f7..ed9947bb903d 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -178,6 +178,55 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb, return segs; }
+static int inet_gro_flush(const struct iphdr *iph, const struct iphdr *iph2, + struct sk_buff *p, u32 outer) +{ + const u32 id = ntohl(*(__be32 *)&iph->id); + const u32 id2 = ntohl(*(__be32 *)&iph2->id); + const int flush_id = ntohs(id >> 16) - ntohs(id2 >> 16); + const u16 count = NAPI_GRO_CB(p)->count; + const u32 df = id & IP_DF; + u32 is_atomic; + int flush; + + /* All fields must match except length and checksum. */ + flush = (iph->ttl ^ iph2->ttl) | (iph->tos ^ iph2->tos) | (df ^ (id2 & IP_DF)); + + /* When we receive our second frame we can make a decision on if we + * continue this flow as an atomic flow with a fixed ID or if we use + * an incrementing ID. + */ + if (count == 1) { + is_atomic = df && flush_id == 0; + NAPI_GRO_CB(p)->is_atomic = is_atomic; + } else { + is_atomic = df && NAPI_GRO_CB(p)->is_atomic; + } + + /* Ignore outer IP ID value if based on atomic datagram. */ + outer = (outer && df) - 1; + is_atomic--; + + return flush | ((flush_id ^ (count & is_atomic)) & outer); +} + +static int ipv6_gro_flush(const struct ipv6hdr *iph, const struct ipv6hdr *iph2) +{ + /* Version:4<Traffic_Class:8><Flow_Label:20> */ + __be32 first_word = *(__be32 *)iph ^ *(__be32 *)iph2; + + /* Flush if Traffic Class fields are different. */ + return (first_word & htonl(0x0FF00000)) | + (__force __be32)(iph->hop_limit ^ iph2->hop_limit); +} + +static int gro_network_flush(const void *nh, const void *nh2, + struct sk_buff *p, u32 outer) +{ + return (((struct iphdr *)nh)->version == 6) ? ipv6_gro_flush(nh, nh2) : + inet_gro_flush(nh, nh2, p, outer); +} + struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb) { struct sk_buff *pp = NULL; @@ -190,6 +239,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb) unsigned int mss = 1; unsigned int hlen; unsigned int off; + bool encap_mark; int flush = 1; int i;
@@ -232,9 +282,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb) goto out_check_final;
found: - /* Include the IP ID check below from the inner most IP hdr */ - flush = NAPI_GRO_CB(p)->flush; - flush |= (__force int)(flags & TCP_FLAG_CWR); + flush = (__force int)(flags & TCP_FLAG_CWR); flush |= (__force int)((flags ^ tcp_flag_word(th2)) & ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); flush |= (__force int)(th->ack_seq ^ th2->ack_seq); @@ -242,16 +290,14 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb) flush |= *(u32 *)((u8 *)th + i) ^ *(u32 *)((u8 *)th2 + i);
- /* When we receive our second frame we can made a decision on if we - * continue this flow as an atomic flow with a fixed ID or if we use - * an incrementing ID. - */ - if (NAPI_GRO_CB(p)->flush_id != 1 || - NAPI_GRO_CB(p)->count != 1 || - !NAPI_GRO_CB(p)->is_atomic) - flush |= NAPI_GRO_CB(p)->flush_id; - else - NAPI_GRO_CB(p)->is_atomic = false; + encap_mark = NAPI_GRO_CB(skb)->encap_mark; + for (i = 0; i <= encap_mark; i++) { + const u16 diff = off - NAPI_GRO_CB(skb)->network_offsets[i]; + + flush |= gro_network_flush((void *)th - diff, + (void *)th2 - diff, + p, i != encap_mark); + }
mss = skb_shinfo(p)->gso_size;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index ba41939537f2..c9a6bc1afc9a 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -288,19 +288,8 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, nlen - sizeof(struct ipv6hdr))) goto not_same_flow; } - /* flush if Traffic Class fields are different */ - NAPI_GRO_CB(p)->flush |= !!((first_word & htonl(0x0FF00000)) | - (__force __be32)(iph->hop_limit ^ iph2->hop_limit)); - NAPI_GRO_CB(p)->flush |= flush; - - /* If the previous IP ID value was based on an atomic - * datagram we can overwrite the value and ignore it. - */ - if (NAPI_GRO_CB(skb)->is_atomic) - NAPI_GRO_CB(p)->flush_id = 0; }
- NAPI_GRO_CB(skb)->is_atomic = true; NAPI_GRO_CB(skb)->flush |= flush;
skb_gro_postpull_rcsum(skb, iph, nlen);
Richard Gobert wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used currently only in tcp flows in GRO.
These checks need to be done only once in tcp_gro_receive and only against the found p skb, since they only affect flush and not same_flow.
Levaraging the previous commit in the series, in which correct network header offsets are saved for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition - flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
This results in less parsing code for UDP flows and non-loop flush tests for TCP flows.
For example, running 40 IP/UDP netperf connections: ./super_netperf.sh 40 -H 1.1.1.2 -t UDP_STREAM -l 120
Running perf top for 90s we can see that relatively less time is spent on inet_gro_receive when GRO is not coalescing UDP:
net-next: 1.26% [kernel] [k] inet_gro_receive
patch applied: 0.85% [kernel] [k] inet_gro_receive
udpgro_bench.sh single connection GRO improvement: net-next: 0.76% [kernel] [k] inet_gro_receive
patch applied: 0.61% [kernel] [k] inet_gro_receive
Signed-off-by: Richard Gobert richardbgobert@gmail.com
In v3 we discussed how the flush on network layer differences (like TTL or ToS) currently only affect the TCP GRO path, but should apply more broadly.
We agreed that it is fine to leave that to a separate patch series.
But seeing this patch, it introduces a lot of churn, but also makes it harder to address that issue for UDP, as it now moves network layer checks directly to the TCP code.
Willem de Bruijn wrote:
In v3 we discussed how the flush on network layer differences (like TTL or ToS) currently only affect the TCP GRO path, but should apply more broadly.
We agreed that it is fine to leave that to a separate patch series.
But seeing this patch, it introduces a lot of churn, but also makes it harder to address that issue for UDP, as it now moves network layer checks directly to the TCP code.
Currently the logic of flush_id is scattered in tcp_gro_receive and {inet,ipv6}_gro_receive with conditionals rewriting ->flush and ->flush_id, so IMO the code should be more concise when it's in one place - in addition to not doing checks against non relevant packets.
With this patch, the fix will probably be simple, most likely just calling gro_network_flush from skb_gro_receive or from the relevant flow in udp_gro_receive_segment. Since this bug fix should be simple and it being not relevant to the optimization, I'd like to solve it in another series and properly test that new flow. Do you agree?
Richard Gobert wrote:
Willem de Bruijn wrote:
In v3 we discussed how the flush on network layer differences (like TTL or ToS) currently only affect the TCP GRO path, but should apply more broadly.
We agreed that it is fine to leave that to a separate patch series.
But seeing this patch, it introduces a lot of churn, but also makes it harder to address that issue for UDP, as it now moves network layer checks directly to the TCP code.
Currently the logic of flush_id is scattered in tcp_gro_receive and {inet,ipv6}_gro_receive with conditionals rewriting ->flush and ->flush_id, so IMO the code should be more concise when it's in one place - in addition to not doing checks against non relevant packets.
With this patch, the fix will probably be simple, most likely just calling gro_network_flush from skb_gro_receive or from the relevant flow in udp_gro_receive_segment. Since this bug fix should be simple and it being not relevant to the optimization, I'd like to solve it in another series and properly test that new flow. Do you agree?
My main concern is moving this code to tcp_offload.c, if it likely soon will be moved elsewhere again.
Willem de Bruijn wrote:
My main concern is moving this code to tcp_offload.c, if it likely soon will be moved elsewhere again.
Got it. I'll move these functions to gro.c and post a v5 with Jakub's requests as well.
On Mon, 25 Mar 2024 19:25:43 +0100 Richard Gobert wrote:
- const u32 id = ntohl(*(__be32 *)&iph->id);
- const u32 id2 = ntohl(*(__be32 *)&iph2->id);
- const int flush_id = ntohs(id >> 16) - ntohs(id2 >> 16);
The endian conversions don't match types here. sparse is unhappy. If id is in host endian shouldn't it be htons(id >> 16) ? Also if you cast to a bitwise type you need __force
On Mon, Mar 25, 2024 at 7:27 PM Richard Gobert richardbgobert@gmail.com wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used currently only in tcp flows in GRO.
I think this is a bug.
GRO should not aggregate packets if their ttl/tos fields do not match.
Eric Dumazet wrote:
On Mon, Mar 25, 2024 at 7:27 PM Richard Gobert richardbgobert@gmail.com wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used currently only in tcp flows in GRO.
I think this is a bug.
GRO should not aggregate packets if their ttl/tos fields do not match.
AFAIU, the only UDP flow where ttl/flush_id need to be checked is when udp_gro_receive_segment calls skb_gro_receive - could you confirm / point out if there are any other flows to which these flush checks may be relevant?
As I've discussed with Willem in v3 I prefer to fix this bug in a separate series.
On Tue, Mar 26, 2024 at 3:43 PM Richard Gobert richardbgobert@gmail.com wrote:
Eric Dumazet wrote:
On Mon, Mar 25, 2024 at 7:27 PM Richard Gobert richardbgobert@gmail.com wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used currently only in tcp flows in GRO.
I think this is a bug.
GRO should not aggregate packets if their ttl/tos fields do not match.
AFAIU, the only UDP flow where ttl/flush_id need to be checked is when udp_gro_receive_segment calls skb_gro_receive - could you confirm / point out if there are any other flows to which these flush checks may be relevant?
As I've discussed with Willem in v3 I prefer to fix this bug in a separate series.
I do not understand this patch 4/4 then.
Why bother moving stuff in net/ipv4/tcp_offload.c if we plan to move it back to where it belongs ?
Eric Dumazet wrote:
I do not understand this patch 4/4 then.
Why bother moving stuff in net/ipv4/tcp_offload.c if we plan to move it back to where it belongs ?
Willem also pointed that out, and I agree. I'll post a v5 and move this functionality to gro.c. Currently, gro_network_flush will be called from tcp_gro_receive and in a separate series I'll fix the bug by calling gro_network_flush in skb_gro_receive or adding it to udp_gro_receive_segment - whichever is better.
This patch is meaningful by itself - removing checks against non-relevant packets and making the flush/flush_id checks in a single place.
Richard Gobert wrote:
Eric Dumazet wrote:
I do not understand this patch 4/4 then.
Why bother moving stuff in net/ipv4/tcp_offload.c if we plan to move it back to where it belongs ?
Willem also pointed that out, and I agree. I'll post a v5 and move this functionality to gro.c. Currently, gro_network_flush will be called from tcp_gro_receive and in a separate series I'll fix the bug by calling gro_network_flush in skb_gro_receive or adding it to udp_gro_receive_segment - whichever is better.
This patch is meaningful by itself - removing checks against non-relevant packets and making the flush/flush_id checks in a single place.
One issue: if the do this move in net/next, then a later fix that relies on it cannot be backporter to older stable kernels.
Willem de Bruijn wrote:> One issue: if the do this move in net/next, then a later fix that
relies on it cannot be backporter to older stable kernels.
I understand. I can either add a first commit to this series which fixes the bug by checking ->flush and ->flush_id in udp_gro_receive_segment, or directly fix the bug in the 4th commit of this series in addition to the optimization - which can be backported as a whole.
What do you think?
Hi,
On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
This patch is meaningful by itself - removing checks against non-relevant packets and making the flush/flush_id checks in a single place.
I'm personally not sure this patch is a win. The code churn is significant. I understand this is for performance's sake, but I don't see the benefit???
The changelog shows that perf reports slightly lower figures for inet_gro_receive(). That is expected, as this patch move code out of such functio. What about inet_gro_flush()/tcp_gro_receive() where such code is moved?
Additionally the reported deltas is within noise level according to my personal experience with similar tests.
I think we are better off without this patch.
Paolo
Paolo Abeni wrote:
Hi,
On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
This patch is meaningful by itself - removing checks against non-relevant packets and making the flush/flush_id checks in a single place.
I'm personally not sure this patch is a win. The code churn is significant. I understand this is for performance's sake, but I don't see the benefit???
Could you clarify what do you mean by code churn?
This patch removes all use of p->flush and flush_id from the CB. The entire logic for L3 flush_id is scattered in tcp_gro_receive and {inet,ipv6}_gro_receive with conditionals rewriting ->flush, ->flush_id and ->is_atomic. Moving it to one place (gro_network_flush) should be more readable. (Personally, it took me a lot of time to understand the current logic of flush + flush_id + is_atomic)
The changelog shows that perf reports slightly lower figures for inet_gro_receive(). That is expected, as this patch move code out of such functio. What about inet_gro_flush()/tcp_gro_receive() where such code is moved?
Please consider the following 2 common scenarios:
1) Multiple packets in the GRO bucket - the common case with multiple packets in the bucket (i.e. running super_netperf TCP_STREAM) - each layer executes a for loop - going over each packet in the bucket. Specifically, L3 gro_receive loops over the bucket making flush,flush_id,is_atomic checks. For most packets in the bucket, these checks are not relevant. (possibly also dirtying cache lines with non-relevant p packets). Removing code in the for loop for this case is significant.
2) UDP/TCP streams which do not coalesce in GRO. This is the common case for regular UDP connections (i.e. running netperf UDP_STREAM). In this case, GRO is just overhead. Removing any code from these layers is good (shown in the first measurement of the commit message).
In the case of a single TCP connection - the amount of checks should be the same overall not causing any noticeable difference.
Additionally the reported deltas is within noise level according to my personal experience with similar tests.
I've tested the difference between net-next and this patch repetitively, which showed stable results each time. Is there any specific test you think would be helpful to show the result?
Thanks
On Tue, 2024-03-26 at 18:25 +0100, Richard Gobert wrote:
Paolo Abeni wrote:
Hi,
On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
This patch is meaningful by itself - removing checks against non-relevant packets and making the flush/flush_id checks in a single place.
I'm personally not sure this patch is a win. The code churn is significant. I understand this is for performance's sake, but I don't see the benefit???
Could you clarify what do you mean by code churn?
The diffstat of this patch is not negligible and touches very sensitive areas.
he changelog shows that perf reports slightly lower figures for inet_gro_receive(). That is expected, as this patch move code out of such functio. What about inet_gro_flush()/tcp_gro_receive() where such code is moved?
Please consider the following 2 common scenarios:
- Multiple packets in the GRO bucket - the common case with multiple packets in the bucket (i.e. running super_netperf TCP_STREAM) - each layer executes a for loop - going over each packet in the bucket. Specifically, L3 gro_receive loops over the bucket making flush,flush_id,is_atomic checks.
Only for packets with the same rx hash.
For most packets in the bucket, these checks are not relevant. (possibly also dirtying cache lines with non-relevant p packets). Removing code in the for loop for this case is significant.
- UDP/TCP streams which do not coalesce in GRO. This is the common case for regular UDP connections (i.e. running netperf UDP_STREAM). In this case, GRO is just overhead. Removing any code from these layers is good (shown in the first measurement of the commit message).
If UDP GRO is not enabled, there are no UDP packet staging in the UDP gro engine, the bucket list is empty.
Additionally the reported deltas is within noise level according to my personal experience with similar tests.
I've tested the difference between net-next and this patch repetitively, which showed stable results each time. Is there any specific test you think would be helpful to show the result?
Anything that show measurable gain.
Reporting the CPU utilization in the inet_gro_receive() function alone is not enough, as part of the load has been moved into gro_network_flush()/tcp_gro_receive().
Cheers,
Paolo
Paolo Abeni wrote:
On Tue, 2024-03-26 at 18:25 +0100, Richard Gobert wrote:
Paolo Abeni wrote:
Hi,
On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
This patch is meaningful by itself - removing checks against non-relevant packets and making the flush/flush_id checks in a single place.
I'm personally not sure this patch is a win. The code churn is significant. I understand this is for performance's sake, but I don't see the benefit???
Could you clarify what do you mean by code churn?
The diffstat of this patch is not negligible and touches very sensitive areas.
diff mainly touches flush/flush_id/is_atomic, the new code should be less complex. I agree this is sensitive as it is part of core GRO - I checked all relevant flows manually, but I can also create more tests and ensure that logic remains the same.
he changelog shows that perf reports slightly lower figures for inet_gro_receive(). That is expected, as this patch move code out of such functio. What about inet_gro_flush()/tcp_gro_receive() where such code is moved?
Please consider the following 2 common scenarios:
- Multiple packets in the GRO bucket - the common case with multiple packets in the bucket (i.e. running super_netperf TCP_STREAM) - each layer executes a for loop - going over each packet in the bucket. Specifically, L3 gro_receive loops over the bucket making flush,flush_id,is_atomic checks.
Only for packets with the same rx hash.
Right, but there are only 8 GRO buckets, so a collision can still happen on multiple concurrent streams.
For most packets in the bucket, these checks are not relevant. (possibly also dirtying cache lines with non-relevant p packets). Removing code in the for loop for this case is significant.
- UDP/TCP streams which do not coalesce in GRO. This is the common case for regular UDP connections (i.e. running netperf UDP_STREAM). In this case, GRO is just overhead. Removing any code from these layers is good (shown in the first measurement of the commit message).
If UDP GRO is not enabled, there are no UDP packet staging in the UDP gro engine, the bucket list is empty.
Additionally the reported deltas is within noise level according to my personal experience with similar tests.
I've tested the difference between net-next and this patch repetitively, which showed stable results each time. Is there any specific test you think would be helpful to show the result?
Anything that show measurable gain.
Reporting the CPU utilization in the inet_gro_receive() function alone is not enough, as part of the load has been moved into gro_network_flush()/tcp_gro_receive().
Got it, the numbers I reported were only relevant to UDP flows (so measuring perf top with -g flag showed the same improvement). I'll post in v5 numbers relevant to TCP as well.
Thanks
linux-kselftest-mirror@lists.linaro.org