On Mon, Aug 18, 2025 at 11:07:33AM +0200, Lorenzo Bianconi wrote:
Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. IPIP SW acceleration can be tested running the following scenario where the traffic is forwarded between two NICs (eth0 and eth1) and an IPIP tunnel is used to access a remote site (using eth1 as the underlay device):
ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (192.168.100.2)
$ip addr show 6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.0.2/24 scope global eth0 valid_lft forever preferred_lft forever 7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff inet 192.168.1.1/24 scope global eth1 valid_lft forever preferred_lft forever 8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000 link/ipip 192.168.1.1 peer 192.168.1.2 inet 192.168.100.1/24 scope global tun0 valid_lft forever preferred_lft forever
$ip route show default via 192.168.100.2 dev tun0 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.2 192.168.1.0/24 dev eth1 proto kernel scope link src 192.168.1.1 192.168.100.0/24 dev tun0 proto kernel scope link src 192.168.100.1
$nft list ruleset table inet filter { flowtable ft { hook ingress priority filter devices = { eth0, eth1 } }
chain forward { type filter hook forward priority filter; policy accept; meta l4proto { tcp, udp } flow add @ft }
}
Reproducing the scenario described above using veths I got the following results:
- TCP stream transmitted into the IPIP tunnel:
- net-next: ~41Gbps
- net-next + IPIP flowtbale support: ~40Gbps
I found this patch in one of my trees (see attachment) to explore tunnel integration of the tx path, there has been similar patches floating on the mailing list for layer 2 encapsulation (eg. pppoe and vlan), IIRC for pppoe I remember they claim to accelerate tx.
Another aspect of this series is that I think it would be good to explore integration of other layer 3 tunnel protocols, rather than following an incremental approach.
More comments below.
- TCP stream received from the IPIP tunnel:
- net-next: ~35Gbps
- net-next + IPIP flowtbale support: ~49Gbps
Signed-off-by: Lorenzo Bianconi lorenzo@kernel.org
include/linux/netdevice.h | 1 + net/ipv4/ipip.c | 28 ++++++++++++++++++++ net/netfilter/nf_flow_table_ip.c | 56 ++++++++++++++++++++++++++++++++++++++-- net/netfilter/nft_flow_offload.c | 1 + 4 files changed, 84 insertions(+), 2 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f3a3b761abfb1b883a970b04634c1ef3e7ee5407..0527a4e3d1fd512b564e47311f6ce3957b66298f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -874,6 +874,7 @@ enum net_device_path_type { DEV_PATH_PPPOE, DEV_PATH_DSA, DEV_PATH_MTK_WDMA,
- DEV_PATH_IPENCAP,
}; struct net_device_path { diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index 3e03af073a1ccc3d7597a998a515b6cfdded40b5..b7a3311bd061c341987380b5872caa8990d02e63 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -353,6 +353,33 @@ ipip_tunnel_ctl(struct net_device *dev, struct ip_tunnel_parm_kern *p, int cmd) return ip_tunnel_ctl(dev, p, cmd); } +static int ipip_fill_forward_path(struct net_device_path_ctx *ctx,
struct net_device_path *path)
+{
- struct ip_tunnel *tunnel = netdev_priv(ctx->dev);
- const struct iphdr *tiph = &tunnel->parms.iph;
- struct rtable *rt;
- rt = ip_route_output(dev_net(ctx->dev), tiph->daddr, 0, 0, 0,
RT_SCOPE_UNIVERSE);
- if (IS_ERR(rt))
return PTR_ERR(rt);
- path->type = DEV_PATH_IPENCAP;
- path->dev = ctx->dev;
- path->encap.proto = htons(ETH_P_IP);
- /* Use the hash of outer header IP src and dst addresses as
* encapsulation ID. This must be kept in sync with
* nf_flow_tuple_encap().
*/
- path->encap.id = __ipv4_addr_hash(tiph->saddr, ntohl(tiph->daddr));
This hash approach sounds reasonable, but I feel a bit uncomfortable with the idea that the flowtable bypasses _entirely_ the existing firewall policy and that this does not provide a perfect match. The idea is that only initial packets of a flow goes through the policy, then once flow is added in the flowtabled such firewall policy validation is circumvented.
To achieve a perfect match, this means more memory consumption to store the two IPs in the tuple.
struct { u16 id; __be16 proto; } encap[NF_FLOW_TABLE_ENCAP_MAX];
And possibility more information will need to be stored for other layer 3 tunnel protocols.
While this hash trick looks like an interesting approach, I am ambivalent.
And one nitpick (typo) below...
- ctx->dev = rt->dst.dev;
- ip_rt_put(rt);
- return 0;
+}
[...]
+static void nf_flow_ip4_ecanp_pop(struct sk_buff *skb)
_encap_pop ?