Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. This series introduces basic infrastructure to accelerate other tunnel types (e.g. IP6IP6).
--- Changes in v7: - Introduce sw acceleration for tx path of IPIP tunnels - Rely on exact match during flowtable entry lookup - Fix typos - Link to v6: https://lore.kernel.org/r/20250818-nf-flowtable-ipip-v6-0-eda90442739c@kerne...
Changes in v6: - Rebase on top of nf-next main branch - Link to v5: https://lore.kernel.org/r/20250721-nf-flowtable-ipip-v5-0-0865af9e58c6@kerne...
Changes in v5: - Rely on __ipv4_addr_hash() to compute the hash used as encap ID - Remove unnecessary pskb_may_pull() in nf_flow_tuple_encap() - Add nf_flow_ip4_ecanp_pop utility routine - Link to v4: https://lore.kernel.org/r/20250718-nf-flowtable-ipip-v4-0-f8bb1c18b986@kerne...
Changes in v4: - Use the hash value of the saddr, daddr and protocol of outer IP header as encapsulation id. - Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kerne...
Changes in v3: - Add outer IP header sanity checks - target nf-next tree instead of net-next - Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kerne...
Changes in v2: - Introduce IPIP flowtable selftest - Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kerne...
--- Lorenzo Bianconi (3): net: netfilter: Add IPIP flowtable rx sw acceleration net: netfilter: Add IPIP flowtable tx sw acceleration selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest
include/linux/netdevice.h | 16 +++ include/net/netfilter/nf_flow_table.h | 26 +++++ net/ipv4/ipip.c | 29 +++++ net/netfilter/nf_flow_table_core.c | 10 ++ net/netfilter/nf_flow_table_ip.c | 118 ++++++++++++++++++++- net/netfilter/nft_flow_offload.c | 79 ++++++++++++-- .../selftests/net/netfilter/nft_flowtable.sh | 40 +++++++ 7 files changed, 307 insertions(+), 11 deletions(-) --- base-commit: d1d7998df9d7d3ee20bcfc876065fa897b11506d change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067
Best regards,
Introduce sw acceleration for rx path of IPIP tunnels relying on the netfilter flowtable infrastructure. Subsequent patches will add sw acceleration for IPIP tunnels tx path. This series introduces basic infrastructure to accelerate other tunnel types (e.g. IP6IP6). IPIP rx sw acceleration can be tested running the following scenario where the traffic is forwarded between two NICs (eth0 and eth1) and an IPIP tunnel is used to access a remote site (using eth1 as the underlay device):
ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (192.168.100.2)
$ip addr show 6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.0.2/24 scope global eth0 valid_lft forever preferred_lft forever 7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.1.1/24 scope global eth1 valid_lft forever preferred_lft forever 8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000 link/ipip 192.168.1.1 peer 192.168.1.2 inet 192.168.100.1/24 scope global tun0 valid_lft forever preferred_lft forever
$ip route show default via 192.168.100.2 dev tun0 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.2 192.168.1.0/24 dev eth1 proto kernel scope link src 192.168.1.1 192.168.100.0/24 dev tun0 proto kernel scope link src 192.168.100.1
$nft list ruleset table inet filter { flowtable ft { hook ingress priority filter devices = { eth0, eth1 } }
chain forward { type filter hook forward priority filter; policy accept; meta l4proto { tcp, udp } flow add @ft } }
Reproducing the scenario described above using veths I got the following results: - TCP stream received from the IPIP tunnel: - net-next: (baseline) ~ 71Gbps - net-next + IPIP flowtbale support: ~101Gbps
Signed-off-by: Lorenzo Bianconi lorenzo@kernel.org --- include/linux/netdevice.h | 13 +++++++++ include/net/netfilter/nf_flow_table.h | 18 +++++++++++++ net/ipv4/ipip.c | 25 +++++++++++++++++ net/netfilter/nf_flow_table_core.c | 8 ++++++ net/netfilter/nf_flow_table_ip.c | 51 +++++++++++++++++++++++++++++++++-- net/netfilter/nft_flow_offload.c | 29 +++++++++++++++++--- 6 files changed, 138 insertions(+), 6 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d1a687444b275d45d105e336d2ede264fd310f1b..183e2b8b0111da86c3c3e4eb1bfe8fdba433dad5 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -874,6 +874,7 @@ enum net_device_path_type { DEV_PATH_PPPOE, DEV_PATH_DSA, DEV_PATH_MTK_WDMA, + DEV_PATH_TUN, };
struct net_device_path { @@ -885,6 +886,18 @@ struct net_device_path { __be16 proto; u8 h_dest[ETH_ALEN]; } encap; + struct { + union { + struct in_addr src_v4; + struct in6_addr src_v6; + }; + union { + struct in_addr dst_v4; + struct in6_addr dst_v6; + }; + + u8 l3_proto; + } tun; struct { enum { DEV_PATH_BR_VLAN_KEEP, diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h index c003cd194fa2ae40545a196fcc74e83c2868b113..bf712ac3fd970012d5360ff11c17315338527438 100644 --- a/include/net/netfilter/nf_flow_table.h +++ b/include/net/netfilter/nf_flow_table.h @@ -106,6 +106,20 @@ enum flow_offload_xmit_type { };
#define NF_FLOW_TABLE_ENCAP_MAX 2 +#define NF_FLOW_TABLE_TUN_MAX 2 + +struct flow_offload_tunnel { + union { + struct in_addr src_v4; + struct in6_addr src_v6; + }; + union { + struct in_addr dst_v4; + struct in6_addr dst_v6; + }; + + u8 l3_proto; +};
struct flow_offload_tuple { union { @@ -129,6 +143,7 @@ struct flow_offload_tuple { u16 id; __be16 proto; } encap[NF_FLOW_TABLE_ENCAP_MAX]; + struct flow_offload_tunnel tun[NF_FLOW_TABLE_TUN_MAX];
/* All members above are keys for lookups, see flow_offload_hash(). */ struct { } __hash; @@ -136,6 +151,7 @@ struct flow_offload_tuple { u8 dir:2, xmit_type:3, encap_num:2, + tun_num:2, in_vlan_ingress:2; u16 mtu; union { @@ -206,7 +222,9 @@ struct nf_flow_route { u16 id; __be16 proto; } encap[NF_FLOW_TABLE_ENCAP_MAX]; + struct flow_offload_tunnel tun[NF_FLOW_TABLE_TUN_MAX]; u8 num_encaps:2, + num_tuns:2, ingress_vlans:2; } in; struct { diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index 3e03af073a1ccc3d7597a998a515b6cfdded40b5..ff95b1b9908e9f4ba4bff207a5bd2c5d5670215a 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -353,6 +353,30 @@ ipip_tunnel_ctl(struct net_device *dev, struct ip_tunnel_parm_kern *p, int cmd) return ip_tunnel_ctl(dev, p, cmd); }
+static int ipip_fill_forward_path(struct net_device_path_ctx *ctx, + struct net_device_path *path) +{ + struct ip_tunnel *tunnel = netdev_priv(ctx->dev); + const struct iphdr *tiph = &tunnel->parms.iph; + struct rtable *rt; + + rt = ip_route_output(dev_net(ctx->dev), tiph->daddr, 0, 0, 0, + RT_SCOPE_UNIVERSE); + if (IS_ERR(rt)) + return PTR_ERR(rt); + + path->type = DEV_PATH_TUN; + path->tun.src_v4.s_addr = tiph->saddr; + path->tun.dst_v4.s_addr = tiph->daddr; + path->tun.l3_proto = IPPROTO_IPIP; + path->dev = ctx->dev; + + ctx->dev = rt->dst.dev; + ip_rt_put(rt); + + return 0; +} + static const struct net_device_ops ipip_netdev_ops = { .ndo_init = ipip_tunnel_init, .ndo_uninit = ip_tunnel_uninit, @@ -362,6 +386,7 @@ static const struct net_device_ops ipip_netdev_ops = { .ndo_get_stats64 = dev_get_tstats64, .ndo_get_iflink = ip_tunnel_get_iflink, .ndo_tunnel_ctl = ipip_tunnel_ctl, + .ndo_fill_forward_path = ipip_fill_forward_path, };
#define IPIP_FEATURES (NETIF_F_SG | \ diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index 9441ac3d8c1a2eac32142ac43151e3acebcd8cab..4e08abc9bca705db36cbe26cf176e8f946722e32 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -118,7 +118,15 @@ static int flow_offload_fill_route(struct flow_offload *flow, flow_tuple->in_vlan_ingress |= BIT(j); j++; } + + j = 0; + for (i = route->tuple[dir].in.num_tuns - 1; i >= 0; i--) { + flow_tuple->tun[j] = route->tuple[dir].in.tun[i]; + j++; + } + flow_tuple->encap_num = route->tuple[dir].in.num_encaps; + flow_tuple->tun_num = route->tuple[dir].in.num_tuns;
switch (route->tuple[dir].xmit_type) { case FLOW_OFFLOAD_XMIT_DIRECT: diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c index 8cd4cf7ae21120f1057c4fce5aaca4e3152ae76d..0309dbff6e6219843fa7ec16a523d288c8274b77 100644 --- a/net/netfilter/nf_flow_table_ip.c +++ b/net/netfilter/nf_flow_table_ip.c @@ -147,6 +147,7 @@ static void nf_flow_tuple_encap(struct sk_buff *skb, { struct vlan_ethhdr *veth; struct pppoe_hdr *phdr; + struct iphdr *iph; int i = 0;
if (skb_vlan_tag_present(skb)) { @@ -165,6 +166,14 @@ static void nf_flow_tuple_encap(struct sk_buff *skb, tuple->encap[i].id = ntohs(phdr->sid); tuple->encap[i].proto = skb->protocol; break; + case htons(ETH_P_IP): + iph = (struct iphdr *)skb_network_header(skb); + if (iph->protocol == IPPROTO_IPIP) { + tuple->tun[i].src_v4.s_addr = iph->daddr; + tuple->tun[i].dst_v4.s_addr = iph->saddr; + tuple->tun[i].l3_proto = IPPROTO_IPIP; + } + break; } }
@@ -277,6 +286,40 @@ static unsigned int nf_flow_xmit_xfrm(struct sk_buff *skb, return NF_STOLEN; }
+static bool nf_flow_ip4_encap_proto(struct sk_buff *skb, u32 *psize) +{ + struct iphdr *iph; + u16 size; + + if (!pskb_may_pull(skb, sizeof(*iph))) + return false; + + iph = (struct iphdr *)skb_network_header(skb); + size = iph->ihl << 2; + + if (ip_is_fragment(iph) || unlikely(ip_has_options(size))) + return false; + + if (iph->ttl <= 1) + return false; + + if (iph->protocol == IPPROTO_IPIP) + *psize += size; + + return true; +} + +static void nf_flow_ip4_encap_pop(struct sk_buff *skb) +{ + struct iphdr *iph = (struct iphdr *)skb_network_header(skb); + + if (iph->protocol != IPPROTO_IPIP) + return; + + skb_pull(skb, iph->ihl << 2); + skb_reset_network_header(skb); +} + static bool nf_flow_skb_encap_protocol(struct sk_buff *skb, __be16 proto, u32 *offset) { @@ -284,6 +327,8 @@ static bool nf_flow_skb_encap_protocol(struct sk_buff *skb, __be16 proto, __be16 inner_proto;
switch (skb->protocol) { + case htons(ETH_P_IP): + return nf_flow_ip4_encap_proto(skb, offset); case htons(ETH_P_8021Q): if (!pskb_may_pull(skb, skb_mac_offset(skb) + sizeof(*veth))) return false; @@ -331,6 +376,9 @@ static void nf_flow_encap_pop(struct sk_buff *skb, break; } } + + if (skb->protocol == htons(ETH_P_IP)) + nf_flow_ip4_encap_pop(skb); }
static unsigned int nf_flow_queue_xmit(struct net *net, struct sk_buff *skb, @@ -357,8 +405,7 @@ nf_flow_offload_lookup(struct nf_flowtable_ctx *ctx, { struct flow_offload_tuple tuple = {};
- if (skb->protocol != htons(ETH_P_IP) && - !nf_flow_skb_encap_protocol(skb, htons(ETH_P_IP), &ctx->offset)) + if (!nf_flow_skb_encap_protocol(skb, htons(ETH_P_IP), &ctx->offset)) return NULL;
if (nf_flow_tuple_ip(ctx, skb, &tuple) < 0) diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c index 14dd1c0698c3c9ec2241b358deb80976a8aa4a13..e30abe026dd2a37cae2eea56257033a48e71af7c 100644 --- a/net/netfilter/nft_flow_offload.c +++ b/net/netfilter/nft_flow_offload.c @@ -86,6 +86,8 @@ struct nft_forward_info { __be16 proto; } encap[NF_FLOW_TABLE_ENCAP_MAX]; u8 num_encaps; + struct flow_offload_tunnel tun[NF_FLOW_TABLE_TUN_MAX]; + u8 num_tuns; u8 ingress_vlans; u8 h_source[ETH_ALEN]; u8 h_dest[ETH_ALEN]; @@ -108,6 +110,7 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack, case DEV_PATH_DSA: case DEV_PATH_VLAN: case DEV_PATH_PPPOE: + case DEV_PATH_TUN: info->indev = path->dev; if (is_zero_ether_addr(info->h_source)) memcpy(info->h_source, path->dev->dev_addr, ETH_ALEN); @@ -120,15 +123,29 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack, }
/* DEV_PATH_VLAN and DEV_PATH_PPPOE */ - if (info->num_encaps >= NF_FLOW_TABLE_ENCAP_MAX) { + if (info->num_encaps >= NF_FLOW_TABLE_ENCAP_MAX || + info->num_tuns >= NF_FLOW_TABLE_TUN_MAX) { info->indev = NULL; break; } if (!info->outdev) info->outdev = path->dev; - info->encap[info->num_encaps].id = path->encap.id; - info->encap[info->num_encaps].proto = path->encap.proto; - info->num_encaps++; + + if (path->type == DEV_PATH_TUN) { + info->tun[info->num_encaps].src_v6 = + path->tun.src_v6; + info->tun[info->num_encaps].dst_v6 = + path->tun.dst_v6; + info->tun[info->num_encaps].l3_proto = + path->tun.l3_proto; + info->num_tuns++; + } else { + info->encap[info->num_encaps].id = + path->encap.id; + info->encap[info->num_encaps].proto = + path->encap.proto; + info->num_encaps++; + } if (path->type == DEV_PATH_PPPOE) memcpy(info->h_dest, path->encap.h_dest, ETH_ALEN); break; @@ -207,7 +224,11 @@ static void nft_dev_forward_path(struct nf_flow_route *route, route->tuple[!dir].in.encap[i].id = info.encap[i].id; route->tuple[!dir].in.encap[i].proto = info.encap[i].proto; } + for (i = 0; i < info.num_tuns; i++) + route->tuple[!dir].in.tun[i] = info.tun[i]; + route->tuple[!dir].in.num_encaps = info.num_encaps; + route->tuple[!dir].in.num_tuns = info.num_tuns; route->tuple[!dir].in.ingress_vlans = info.ingress_vlans;
if (info.xmit_type == FLOW_OFFLOAD_XMIT_DIRECT) {
Introduce sw acceleration for tx path of IPIP tunnels relying on the netfilter flowtable infrastructure. This patch introduces basic infrastructure to accelerate other tunnel types (e.g. IP6IP6). IPIP sw tx acceleration can be tested running the following scenario where the traffic is forwarded between two NICs (eth0 and eth1) and an IPIP tunnel is used to access a remote site (using eth1 as the underlay device):
ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (192.168.100.2)
$ip addr show 6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.0.2/24 scope global eth0 valid_lft forever preferred_lft forever 7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.1.1/24 scope global eth1 valid_lft forever preferred_lft forever 8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000 link/ipip 192.168.1.1 peer 192.168.1.2 inet 192.168.100.1/24 scope global tun0 valid_lft forever preferred_lft forever
$ip route show default via 192.168.100.2 dev tun0 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.2 192.168.1.0/24 dev eth1 proto kernel scope link src 192.168.1.1 192.168.100.0/24 dev tun0 proto kernel scope link src 192.168.100.1
$nft list ruleset table inet filter { flowtable ft { hook ingress priority filter devices = { eth0, eth1 } }
chain forward { type filter hook forward priority filter; policy accept; meta l4proto { tcp, udp } flow add @ft } }
Reproducing the scenario described above using veths I got the following results: - TCP stream trasmitted into the IPIP tunnel: - net-next: (baseline) ~ 85Gbps - net-next + IPIP flowtbale support: ~100Gbps
Co-developed-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Lorenzo Bianconi lorenzo@kernel.org --- include/linux/netdevice.h | 3 ++ include/net/netfilter/nf_flow_table.h | 8 +++++ net/ipv4/ipip.c | 4 +++ net/netfilter/nf_flow_table_core.c | 2 ++ net/netfilter/nf_flow_table_ip.c | 67 +++++++++++++++++++++++++++++++++-- net/netfilter/nft_flow_offload.c | 50 ++++++++++++++++++++++++-- 6 files changed, 129 insertions(+), 5 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 183e2b8b0111da86c3c3e4eb1bfe8fdba433dad5..cac81dc481dc31cc6c0203895891d742e4bb5333 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -897,6 +897,9 @@ struct net_device_path { };
u8 l3_proto; + u8 tos; + u8 ttl; + __be16 df; } tun; struct { enum { diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h index bf712ac3fd970012d5360ff11c17315338527438..61812033c95fd78b2fc234b35ef20d2e33c08189 100644 --- a/include/net/netfilter/nf_flow_table.h +++ b/include/net/netfilter/nf_flow_table.h @@ -119,6 +119,9 @@ struct flow_offload_tunnel { };
u8 l3_proto; + u8 tos; + u8 ttl; + __be16 df; };
struct flow_offload_tuple { @@ -158,6 +161,8 @@ struct flow_offload_tuple { struct { struct dst_entry *dst_cache; u32 dst_cookie; + u8 tunnel_num; + struct flow_offload_tunnel tunnel; }; struct { u32 ifidx; @@ -232,6 +237,9 @@ struct nf_flow_route { u32 hw_ifindex; u8 h_source[ETH_ALEN]; u8 h_dest[ETH_ALEN]; + + u8 num_tuns; + struct flow_offload_tunnel tun; } out; enum flow_offload_xmit_type xmit_type; } tuple[FLOW_OFFLOAD_DIR_MAX]; diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index ff95b1b9908e9f4ba4bff207a5bd2c5d5670215a..cfd73342fd40916ca508734060a9e4ddc76cd245 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -369,6 +369,10 @@ static int ipip_fill_forward_path(struct net_device_path_ctx *ctx, path->tun.src_v4.s_addr = tiph->saddr; path->tun.dst_v4.s_addr = tiph->daddr; path->tun.l3_proto = IPPROTO_IPIP; + path->tun.tos = tiph->tos; + path->tun.ttl = tiph->ttl; + path->tun.df = tiph->frag_off; + path->dev = ctx->dev;
ctx->dev = rt->dst.dev; diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index 4e08abc9bca705db36cbe26cf176e8f946722e32..90ef99ab5bba6125cdfd49de437766f76674e21a 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -124,9 +124,11 @@ static int flow_offload_fill_route(struct flow_offload *flow, flow_tuple->tun[j] = route->tuple[dir].in.tun[i]; j++; } + flow_tuple->tunnel = route->tuple[dir].out.tun;
flow_tuple->encap_num = route->tuple[dir].in.num_encaps; flow_tuple->tun_num = route->tuple[dir].in.num_tuns; + flow_tuple->tunnel_num = route->tuple[dir].out.num_tuns;
switch (route->tuple[dir].xmit_type) { case FLOW_OFFLOAD_XMIT_DIRECT: diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c index 0309dbff6e6219843fa7ec16a523d288c8274b77..a695070ea9331a1c1e96d6a44609b62529ac356d 100644 --- a/net/netfilter/nf_flow_table_ip.c +++ b/net/netfilter/nf_flow_table_ip.c @@ -428,6 +428,9 @@ static int nf_flow_offload_forward(struct nf_flowtable_ctx *ctx, flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
mtu = flow->tuplehash[dir].tuple.mtu + ctx->offset; + if (tuplehash->tuple.tunnel_num) + mtu -= sizeof(*iph); + if (unlikely(nf_flow_exceeds_mtu(skb, mtu))) return 0;
@@ -461,6 +464,54 @@ static int nf_flow_offload_forward(struct nf_flowtable_ctx *ctx, return 1; }
+static unsigned int nf_flow_add_tunnel_v4(struct net *net, struct sk_buff *skb, + struct flow_offload *flow, int dir, + const struct rtable *rt) +{ + struct iphdr *iph = (struct iphdr *)skb_network_header(skb); + u32 headroom = sizeof(struct iphdr); + u8 tos, ttl; + __be16 df; + + if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP4)) + return -1; + + skb_set_inner_ipproto(skb, IPPROTO_IPIP); + headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len; + if (skb_cow_head(skb, headroom)) + return -1; + + skb_scrub_packet(skb, true); + skb_clear_hash_if_not_l4(skb); + memset(IPCB(skb), 0, sizeof(*IPCB(skb))); + + /* Push down and install the IP header. */ + skb_push(skb, sizeof(struct iphdr)); + skb_reset_network_header(skb); + + df = flow->tuplehash[dir].tuple.tunnel.df; + tos = ip_tunnel_ecn_encap(flow->tuplehash[dir].tuple.tunnel.tos, + iph, skb); + ttl = flow->tuplehash[dir].tuple.tunnel.ttl; + if (!ttl) + ttl = iph->ttl; + + iph = ip_hdr(skb); + iph->version = 4; + iph->ihl = sizeof(struct iphdr) >> 2; + iph->frag_off = ip_mtu_locked(&rt->dst) ? 0 : df; + iph->protocol = flow->tuplehash[dir].tuple.tunnel.l3_proto; + iph->tos = tos; + iph->daddr = flow->tuplehash[dir].tuple.tunnel.dst_v4.s_addr; + iph->saddr = flow->tuplehash[dir].tuple.tunnel.src_v4.s_addr; + iph->ttl = ttl; + iph->tot_len = htons(skb->len); + __ip_select_ident(net, iph, skb_shinfo(skb)->gso_segs ?: 1); + ip_send_check(iph); + + return 0; +} + unsigned int nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state) @@ -473,8 +524,8 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb, }; struct flow_offload *flow; struct net_device *outdev; + __be32 nexthop, daddr; struct rtable *rt; - __be32 nexthop; int ret;
tuplehash = nf_flow_offload_lookup(&ctx, flow_table, skb); @@ -501,9 +552,21 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb, switch (tuplehash->tuple.xmit_type) { case FLOW_OFFLOAD_XMIT_NEIGH: rt = dst_rtable(tuplehash->tuple.dst_cache); + if (tuplehash->tuple.tunnel_num) { + ret = nf_flow_add_tunnel_v4(state->net, skb, flow, dir, + rt); + if (ret < 0) { + ret = NF_DROP; + flow_offload_teardown(flow); + break; + } + daddr = tuplehash->tuple.tunnel.dst_v4.s_addr; + } else { + daddr = flow->tuplehash[!dir].tuple.src_v4.s_addr; + } outdev = rt->dst.dev; skb->dev = outdev; - nexthop = rt_nexthop(rt, flow->tuplehash[!dir].tuple.src_v4.s_addr); + nexthop = rt_nexthop(rt, daddr); skb_dst_set_noref(skb, &rt->dst); neigh_xmit(NEIGH_ARP_TABLE, outdev, &nexthop, skb); ret = NF_STOLEN; diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c index e30abe026dd2a37cae2eea56257033a48e71af7c..137939573f0c3511cf3d7987a59fc269f1a31387 100644 --- a/net/netfilter/nft_flow_offload.c +++ b/net/netfilter/nft_flow_offload.c @@ -202,10 +202,46 @@ static bool nft_flowtable_find_dev(const struct net_device *dev, return found; }
+static int nft_flow_tunnel_update_route(const struct nft_pktinfo *pkt, + struct nf_flow_route *route, + enum ip_conntrack_dir dir) +{ + struct dst_entry *tun_dst = NULL; + struct flowi fl = {}; + + switch (nft_pf(pkt)) { + case NFPROTO_IPV4: + fl.u.ip4.daddr = route->tuple[dir].out.tun.dst_v4.s_addr; + fl.u.ip4.saddr = route->tuple[dir].out.tun.src_v4.s_addr; + fl.u.ip4.flowi4_iif = nft_in(pkt)->ifindex; + fl.u.ip4.flowi4_dscp = ip4h_dscp(ip_hdr(pkt->skb)); + fl.u.ip4.flowi4_mark = pkt->skb->mark; + fl.u.ip4.flowi4_flags = FLOWI_FLAG_ANYSRC; + break; + case NFPROTO_IPV6: + fl.u.ip6.daddr = route->tuple[dir].out.tun.dst_v6; + fl.u.ip6.saddr = route->tuple[dir].out.tun.src_v6; + fl.u.ip6.flowi6_iif = nft_in(pkt)->ifindex; + fl.u.ip6.flowlabel = ip6_flowinfo(ipv6_hdr(pkt->skb)); + fl.u.ip6.flowi6_mark = pkt->skb->mark; + fl.u.ip6.flowi6_flags = FLOWI_FLAG_ANYSRC; + break; + } + + nf_route(nft_net(pkt), &tun_dst, &fl, false, nft_pf(pkt)); + if (!tun_dst) + return -ENOENT; + + nft_default_forward_path(route, tun_dst, dir); + + return 0; +} + static void nft_dev_forward_path(struct nf_flow_route *route, const struct nf_conn *ct, enum ip_conntrack_dir dir, - struct nft_flowtable *ft) + struct nft_flowtable *ft, + const struct nft_pktinfo *pkt) { const struct dst_entry *dst = route->tuple[dir].dst; struct net_device_path_stack stack; @@ -227,6 +263,14 @@ static void nft_dev_forward_path(struct nf_flow_route *route, for (i = 0; i < info.num_tuns; i++) route->tuple[!dir].in.tun[i] = info.tun[i];
+ /* Single encapsulation is supported for the moment. */ + route->tuple[dir].out.num_tuns = info.num_tuns; + if (route->tuple[dir].out.num_tuns) { + route->tuple[dir].out.tun = info.tun[0]; + if (nft_flow_tunnel_update_route(pkt, route, dir)) + return; + } + route->tuple[!dir].in.num_encaps = info.num_encaps; route->tuple[!dir].in.num_tuns = info.num_tuns; route->tuple[!dir].in.ingress_vlans = info.ingress_vlans; @@ -286,8 +330,8 @@ static int nft_flow_route(const struct nft_pktinfo *pkt,
if (route->tuple[dir].xmit_type == FLOW_OFFLOAD_XMIT_NEIGH && route->tuple[!dir].xmit_type == FLOW_OFFLOAD_XMIT_NEIGH) { - nft_dev_forward_path(route, ct, dir, ft); - nft_dev_forward_path(route, ct, !dir, ft); + nft_dev_forward_path(route, ct, dir, ft, pkt); + nft_dev_forward_path(route, ct, !dir, ft, pkt); }
return 0;
Introduce specific selftest for IPIP flowtable SW acceleration in nft_flowtable.sh
Signed-off-by: Lorenzo Bianconi lorenzo@kernel.org --- .../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/tools/testing/selftests/net/netfilter/nft_flowtable.sh b/tools/testing/selftests/net/netfilter/nft_flowtable.sh index 45832df982950c2164dcb6637497870f0d3daefe..e1434611464b3a8f5056e09a831180fa1bff7139 100755 --- a/tools/testing/selftests/net/netfilter/nft_flowtable.sh +++ b/tools/testing/selftests/net/netfilter/nft_flowtable.sh @@ -558,6 +558,44 @@ if ! test_tcp_forwarding_nat "$ns1" "$ns2" 1 ""; then ip netns exec "$nsr1" nft list ruleset fi
+# IPIP tunnel test: +# Add IPIP tunnel interfaces and check flowtable acceleration. +test_ipip() { +if ! ip -net "$nsr1" link add name tun0 type ipip \ + local 192.168.10.1 remote 192.168.10.2 >/dev/null;then + echo "SKIP: could not add ipip tunnel" + [ "$ret" -eq 0 ] && ret=$ksft_skip + return +fi +ip -net "$nsr1" link set tun0 up +ip -net "$nsr1" addr add 192.168.100.1/24 dev tun0 +ip netns exec "$nsr1" sysctl net.ipv4.conf.tun0.forwarding=1 > /dev/null + +ip -net "$nsr2" link add name tun0 type ipip local 192.168.10.2 remote 192.168.10.1 +ip -net "$nsr2" link set tun0 up +ip -net "$nsr2" addr add 192.168.100.2/24 dev tun0 +ip netns exec "$nsr2" sysctl net.ipv4.conf.tun0.forwarding=1 > /dev/null + +ip -net "$nsr1" route change default via 192.168.100.2 +ip -net "$nsr2" route change default via 192.168.100.1 +ip -net "$ns2" route add default via 10.0.2.1 + +ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif tun0 accept' +ip netns exec "$nsr1" nft -a insert rule inet filter forward \ + 'meta oif "veth0" tcp sport 12345 ct mark set 1 flow add @f1 counter name routed_repl accept' + +if ! test_tcp_forwarding_nat "$ns1" "$ns2" 1 "IPIP tunnel"; then + echo "FAIL: flow offload for ns1/ns2 with IPIP tunnel" 1>&2 + ip netns exec "$nsr1" nft list ruleset + ret=1 +fi + +# Restore the previous configuration +ip -net "$nsr1" route change default via 192.168.10.2 +ip -net "$nsr2" route change default via 192.168.10.1 +ip -net "$ns2" route del default via 10.0.2.1 +} + # Another test: # Add bridge interface br0 to Router1, with NAT enabled. test_bridge() { @@ -643,6 +681,8 @@ ip -net "$nsr1" addr add dead:1::1/64 dev veth0 nodad ip -net "$nsr1" link set up dev veth0 }
+test_ipip + test_bridge
KEY_SHA="0x"$(ps -af | sha1sum | cut -d " " -f 1)
Hi Lorenzo,
On Tue, Oct 21, 2025 at 07:48:17PM +0200, Lorenzo Bianconi wrote:
Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. This series introduces basic infrastructure to accelerate other tunnel types (e.g. IP6IP6).
Would you be so kind to rebase this series on top of:
https://patchwork.ozlabs.org/project/netfilter-devel/list/?series=477081
That series should simplify the integration of your IPIP support.
Thanks.
Changes in v7:
- Introduce sw acceleration for tx path of IPIP tunnels
- Rely on exact match during flowtable entry lookup
- Fix typos
- Link to v6: https://lore.kernel.org/r/20250818-nf-flowtable-ipip-v6-0-eda90442739c@kerne...
Changes in v6:
- Rebase on top of nf-next main branch
- Link to v5: https://lore.kernel.org/r/20250721-nf-flowtable-ipip-v5-0-0865af9e58c6@kerne...
Changes in v5:
- Rely on __ipv4_addr_hash() to compute the hash used as encap ID
- Remove unnecessary pskb_may_pull() in nf_flow_tuple_encap()
- Add nf_flow_ip4_ecanp_pop utility routine
- Link to v4: https://lore.kernel.org/r/20250718-nf-flowtable-ipip-v4-0-f8bb1c18b986@kerne...
Changes in v4:
- Use the hash value of the saddr, daddr and protocol of outer IP header as encapsulation id.
- Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kerne...
Changes in v3:
- Add outer IP header sanity checks
- target nf-next tree instead of net-next
- Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kerne...
Changes in v2:
- Introduce IPIP flowtable selftest
- Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kerne...
Lorenzo Bianconi (3): net: netfilter: Add IPIP flowtable rx sw acceleration net: netfilter: Add IPIP flowtable tx sw acceleration selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest
include/linux/netdevice.h | 16 +++ include/net/netfilter/nf_flow_table.h | 26 +++++ net/ipv4/ipip.c | 29 +++++ net/netfilter/nf_flow_table_core.c | 10 ++ net/netfilter/nf_flow_table_ip.c | 118 ++++++++++++++++++++- net/netfilter/nft_flow_offload.c | 79 ++++++++++++-- .../selftests/net/netfilter/nft_flowtable.sh | 40 +++++++ 7 files changed, 307 insertions(+), 11 deletions(-)
base-commit: d1d7998df9d7d3ee20bcfc876065fa897b11506d change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067
Best regards,
Lorenzo Bianconi lorenzo@kernel.org
Hi Lorenzo,
Hi Pablo,
On Tue, Oct 21, 2025 at 07:48:17PM +0200, Lorenzo Bianconi wrote:
Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. This series introduces basic infrastructure to accelerate other tunnel types (e.g. IP6IP6).
Would you be so kind to rebase this series on top of:
https://patchwork.ozlabs.org/project/netfilter-devel/list/?series=477081
That series should simplify the integration of your IPIP support.
ack, sure. I will do in v8.
Regards, Lorenzo
Thanks.
Changes in v7:
- Introduce sw acceleration for tx path of IPIP tunnels
- Rely on exact match during flowtable entry lookup
- Fix typos
- Link to v6: https://lore.kernel.org/r/20250818-nf-flowtable-ipip-v6-0-eda90442739c@kerne...
Changes in v6:
- Rebase on top of nf-next main branch
- Link to v5: https://lore.kernel.org/r/20250721-nf-flowtable-ipip-v5-0-0865af9e58c6@kerne...
Changes in v5:
- Rely on __ipv4_addr_hash() to compute the hash used as encap ID
- Remove unnecessary pskb_may_pull() in nf_flow_tuple_encap()
- Add nf_flow_ip4_ecanp_pop utility routine
- Link to v4: https://lore.kernel.org/r/20250718-nf-flowtable-ipip-v4-0-f8bb1c18b986@kerne...
Changes in v4:
- Use the hash value of the saddr, daddr and protocol of outer IP header as encapsulation id.
- Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kerne...
Changes in v3:
- Add outer IP header sanity checks
- target nf-next tree instead of net-next
- Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kerne...
Changes in v2:
- Introduce IPIP flowtable selftest
- Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kerne...
Lorenzo Bianconi (3): net: netfilter: Add IPIP flowtable rx sw acceleration net: netfilter: Add IPIP flowtable tx sw acceleration selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest
include/linux/netdevice.h | 16 +++ include/net/netfilter/nf_flow_table.h | 26 +++++ net/ipv4/ipip.c | 29 +++++ net/netfilter/nf_flow_table_core.c | 10 ++ net/netfilter/nf_flow_table_ip.c | 118 ++++++++++++++++++++- net/netfilter/nft_flow_offload.c | 79 ++++++++++++-- .../selftests/net/netfilter/nft_flowtable.sh | 40 +++++++ 7 files changed, 307 insertions(+), 11 deletions(-)
base-commit: d1d7998df9d7d3ee20bcfc876065fa897b11506d change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067
Best regards,
Lorenzo Bianconi lorenzo@kernel.org
linux-kselftest-mirror@lists.linaro.org