Hi Alexei,
(cc netfilter maintainers)
On Mon, Mar 06, 2023 at 08:17:20PM -0800, Alexei Starovoitov wrote:
On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu dxu@dxuuu.xyz wrote:
Have you considered to skb redirect to another netdev that does ip defrag? Like macvlan does it under some conditions. This can be generalized.
I had not considered that yet. Are you suggesting adding a new passthrough netdev thing that'll defrags? I looked at the macvlan driver and it looks like it defrags to handle some multicast corner case.
Something like that. A netdev that bpf prog can redirect too. It will consume ip frags and eventually will produce reassembled skb.
The kernel ip_defrag logic has timeouts, counters, rhashtable with thresholds, etc. All of them are per netns. Just another ip_defrag_user will still share rhashtable with its limits. The kernel can even do icmp_send(). ip_defrag is not a kfunc. It's a big block with plenty of kernel wide side effects. I really don't think we can alloc_skb, copy_skb, and ip_defrag it. It messes with the stack too much. It's also not clear to me when skb is reassembled and how bpf sees it. "redirect into reassembling netdev" and attaching bpf prog to consume that skb is much cleaner imo. May be there are other ways to use ip_defrag, but certainly not like synchronous api helper.
I was giving the virtual netdev idea some thought this morning and I thought I'd give the netfilter approach a deeper look.
From my reading (I'll run some tests later) it looks like netfilter will defrag all ipv4/ipv6 packets in any netns with conntrack enabled. It appears to do so in NF_INET_PRE_ROUTING.
Unfortunately that does run after tc hooks. But fortunately with the new BPF netfilter hooks I think we can make defrag work outside of BPF kfuncs like you want. And the NF_IP_FORWARD hook works well for my router use case.
One thing we would need though are (probably kfunc) wrappers around nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs are not transitively depending on defrag support from other netfilter modules.
The exact mechanism would probably need some thinking, as the above functions kinda rely on module_init() and module_exit() semantics. We cannot make the prog bump the refcnt every time it runs -- it would overflow. And it would be nice to automatically free the refcnt when prog is unloaded.
Once the netfilter prog type series lands I can get that discussion started. Unless Daniel feels strongly that we should continue with the approach in this patchset, I am leaning towards dropping in favor of netfilter approach.
Thanks, Daniel