On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu dxu@dxuuu.xyz wrote:
Hi Alexei,
On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice:
Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration.
Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal.
=== BPF related bits ===
However, when policy is enforced through BPF, the prog is run before the kernel reassembles fragmented packets. This leaves BPF developers in a awkward place: implement reassembly (possibly poorly) or use a stateless method as described above.
Fortunately, the kernel has robust support for fragmented IP packets. This patchset wraps the existing defragmentation facilities in kfuncs so that BPF progs running on middleboxes can reassemble fragmented packets before applying policy.
=== Patchset details ===
This patchset is (hopefully) relatively straightforward from BPF perspective. One thing I'd like to call out is the skb_copy()ing of the prog skb. I did this to maintain the invariant that the ctx remains valid after prog has run. This is relevant b/c ip_defrag() and ip_check_defrag() may consume the skb if the skb is a fragment.
Instead of doing all that with extra skb copy can you hook bpf prog after the networking stack already handled ip defrag? What kind of middle box are you doing? Why does it have to run at TC layer?
Unless I'm missing something, the only other relevant hooks would be socket hooks, right?
Unfortunately I don't think my use case can do that. We are running the kernel as a router, so no sockets are involved.
Are you using bpf_fib_lookup and populating kernel routing table and doing everything on your own including neigh ?
Have you considered to skb redirect to another netdev that does ip defrag? Like macvlan does it under some conditions. This can be generalized.
Recently Florian proposed to allow calling bpf progs from all existing netfilter hooks. You can pretend to local deliver and hook in NF_INET_LOCAL_IN ? I feel it would be so much cleaner if stack does ip_defrag normally. The general issue of skb ownership between bpf prog and defrag logic isn't really solved with skb_copy. It's still an issue.
On 2/28/23 5:56 AM, Alexei Starovoitov wrote:
On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu dxu@dxuuu.xyz wrote:
On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice:
Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration.
Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal.
=== BPF related bits ===
However, when policy is enforced through BPF, the prog is run before the kernel reassembles fragmented packets. This leaves BPF developers in a awkward place: implement reassembly (possibly poorly) or use a stateless method as described above.
Fortunately, the kernel has robust support for fragmented IP packets. This patchset wraps the existing defragmentation facilities in kfuncs so that BPF progs running on middleboxes can reassemble fragmented packets before applying policy.
=== Patchset details ===
This patchset is (hopefully) relatively straightforward from BPF perspective. One thing I'd like to call out is the skb_copy()ing of the prog skb. I did this to maintain the invariant that the ctx remains valid after prog has run. This is relevant b/c ip_defrag() and ip_check_defrag() may consume the skb if the skb is a fragment.
Instead of doing all that with extra skb copy can you hook bpf prog after the networking stack already handled ip defrag? What kind of middle box are you doing? Why does it have to run at TC layer?
Unless I'm missing something, the only other relevant hooks would be socket hooks, right?
Unfortunately I don't think my use case can do that. We are running the kernel as a router, so no sockets are involved.
Are you using bpf_fib_lookup and populating kernel routing table and doing everything on your own including neigh ?
Have you considered to skb redirect to another netdev that does ip defrag? Like macvlan does it under some conditions. This can be generalized.
Recently Florian proposed to allow calling bpf progs from all existing netfilter hooks. You can pretend to local deliver and hook in NF_INET_LOCAL_IN ? I feel it would be so much cleaner if stack does ip_defrag normally. The general issue of skb ownership between bpf prog and defrag logic isn't really solved with skb_copy. It's still an issue.
I do like this series and we would also use it for Cilium case, so +1 on the tc BPF integration. Today we have in Cilium what Ed [0] hinted in his earlier mail where we extract information from first fragment and store the meta data in a BPF map for subsequent packets based on ipid [1], but limitations apply e.g. service load-balancing won't work. Redirecting to a different device or moving higher up the stack is cumbersome since we then need to go and recirculate back into tc BPF layer where all the business logic is located and handling the regular (non-fragmented) path, too. Wrt skb ownership, can you elaborate what is a concrete issue exactly? Anything that comes to mind with this approach that could crash the kernel?
[0] https://lore.kernel.org/bpf/cf49a091-9b14-05b8-6a79-00e56f3019e1@gmail.com/ [1] https://github.com/cilium/cilium/pull/10264
Hi Alexei,
On Mon, Feb 27, 2023 at 08:56:38PM -0800, Alexei Starovoitov wrote:
On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu dxu@dxuuu.xyz wrote:
Hi Alexei,
On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice:
Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration.
Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal.
=== BPF related bits ===
However, when policy is enforced through BPF, the prog is run before the kernel reassembles fragmented packets. This leaves BPF developers in a awkward place: implement reassembly (possibly poorly) or use a stateless method as described above.
Fortunately, the kernel has robust support for fragmented IP packets. This patchset wraps the existing defragmentation facilities in kfuncs so that BPF progs running on middleboxes can reassemble fragmented packets before applying policy.
=== Patchset details ===
This patchset is (hopefully) relatively straightforward from BPF perspective. One thing I'd like to call out is the skb_copy()ing of the prog skb. I did this to maintain the invariant that the ctx remains valid after prog has run. This is relevant b/c ip_defrag() and ip_check_defrag() may consume the skb if the skb is a fragment.
Instead of doing all that with extra skb copy can you hook bpf prog after the networking stack already handled ip defrag? What kind of middle box are you doing? Why does it have to run at TC layer?
Unless I'm missing something, the only other relevant hooks would be socket hooks, right?
Unfortunately I don't think my use case can do that. We are running the kernel as a router, so no sockets are involved.
Are you using bpf_fib_lookup and populating kernel routing table and doing everything on your own including neigh ?
We're currently not doing any routing things in BPF yet. All the routing manipulation has been done in iptables / netfilter so far. I'm not super familiar with routing stuff but from what I understand there is some relatively complicated stuff going on with BGP and ipsec tunnels at the moment. Not sure if that answers your question.
Have you considered to skb redirect to another netdev that does ip defrag? Like macvlan does it under some conditions. This can be generalized.
I had not considered that yet. Are you suggesting adding a new passthrough netdev thing that'll defrags? I looked at the macvlan driver and it looks like it defrags to handle some multicast corner case.
Recently Florian proposed to allow calling bpf progs from all existing netfilter hooks. You can pretend to local deliver and hook in NF_INET_LOCAL_IN ?
Does that work for forwarding cases? I'm reading through [0] and it seems to suggest that it'll only defrag for locally destined packets:
If the destination IP address is matches with local NIC's IP address, the dst_input() function will brings the packets into the ip_local_deliver(), which will defrag the packet and pass it to the NF_IP_LOCAL_IN hook
Faking local delivery seems kinda ugly -- maybe I don't know any clean ways.
[...]
[0]: https://kernelnewbies.org/Networking?action=AttachFile&do=get&target...
Thanks, Daniel
On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu dxu@dxuuu.xyz wrote:
Have you considered to skb redirect to another netdev that does ip defrag? Like macvlan does it under some conditions. This can be generalized.
I had not considered that yet. Are you suggesting adding a new passthrough netdev thing that'll defrags? I looked at the macvlan driver and it looks like it defrags to handle some multicast corner case.
Something like that. A netdev that bpf prog can redirect too. It will consume ip frags and eventually will produce reassembled skb.
The kernel ip_defrag logic has timeouts, counters, rhashtable with thresholds, etc. All of them are per netns. Just another ip_defrag_user will still share rhashtable with its limits. The kernel can even do icmp_send(). ip_defrag is not a kfunc. It's a big block with plenty of kernel wide side effects. I really don't think we can alloc_skb, copy_skb, and ip_defrag it. It messes with the stack too much. It's also not clear to me when skb is reassembled and how bpf sees it. "redirect into reassembling netdev" and attaching bpf prog to consume that skb is much cleaner imo. May be there are other ways to use ip_defrag, but certainly not like synchronous api helper.
Hi Alexei,
(cc netfilter maintainers)
On Mon, Mar 06, 2023 at 08:17:20PM -0800, Alexei Starovoitov wrote:
On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu dxu@dxuuu.xyz wrote:
Have you considered to skb redirect to another netdev that does ip defrag? Like macvlan does it under some conditions. This can be generalized.
I had not considered that yet. Are you suggesting adding a new passthrough netdev thing that'll defrags? I looked at the macvlan driver and it looks like it defrags to handle some multicast corner case.
Something like that. A netdev that bpf prog can redirect too. It will consume ip frags and eventually will produce reassembled skb.
The kernel ip_defrag logic has timeouts, counters, rhashtable with thresholds, etc. All of them are per netns. Just another ip_defrag_user will still share rhashtable with its limits. The kernel can even do icmp_send(). ip_defrag is not a kfunc. It's a big block with plenty of kernel wide side effects. I really don't think we can alloc_skb, copy_skb, and ip_defrag it. It messes with the stack too much. It's also not clear to me when skb is reassembled and how bpf sees it. "redirect into reassembling netdev" and attaching bpf prog to consume that skb is much cleaner imo. May be there are other ways to use ip_defrag, but certainly not like synchronous api helper.
I was giving the virtual netdev idea some thought this morning and I thought I'd give the netfilter approach a deeper look.
From my reading (I'll run some tests later) it looks like netfilter will defrag all ipv4/ipv6 packets in any netns with conntrack enabled. It appears to do so in NF_INET_PRE_ROUTING.
Unfortunately that does run after tc hooks. But fortunately with the new BPF netfilter hooks I think we can make defrag work outside of BPF kfuncs like you want. And the NF_IP_FORWARD hook works well for my router use case.
One thing we would need though are (probably kfunc) wrappers around nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs are not transitively depending on defrag support from other netfilter modules.
The exact mechanism would probably need some thinking, as the above functions kinda rely on module_init() and module_exit() semantics. We cannot make the prog bump the refcnt every time it runs -- it would overflow. And it would be nice to automatically free the refcnt when prog is unloaded.
Once the netfilter prog type series lands I can get that discussion started. Unless Daniel feels strongly that we should continue with the approach in this patchset, I am leaning towards dropping in favor of netfilter approach.
Thanks, Daniel
Daniel Xu dxu@dxuuu.xyz wrote:
From my reading (I'll run some tests later) it looks like netfilter will defrag all ipv4/ipv6 packets in any netns with conntrack enabled. It appears to do so in NF_INET_PRE_ROUTING.
Yes, and output.
One thing we would need though are (probably kfunc) wrappers around nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs are not transitively depending on defrag support from other netfilter modules.
The exact mechanism would probably need some thinking, as the above functions kinda rely on module_init() and module_exit() semantics. We cannot make the prog bump the refcnt every time it runs -- it would overflow. And it would be nice to automatically free the refcnt when prog is unloaded.
Probably add a flag attribute that is evaluated at BPF_LINK time, so progs can say they need defrag enabled. Same could be used to request conntrack enablement.
Will need some glue on netfilter side to handle DEFRAG=m, but we already have plenty of those.
On Tue, Mar 7, 2023 at 12:11 PM Florian Westphal fw@strlen.de wrote:
Daniel Xu dxu@dxuuu.xyz wrote:
From my reading (I'll run some tests later) it looks like netfilter will defrag all ipv4/ipv6 packets in any netns with conntrack enabled. It appears to do so in NF_INET_PRE_ROUTING.
Yes, and output.
One thing we would need though are (probably kfunc) wrappers around nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs are not transitively depending on defrag support from other netfilter modules.
The exact mechanism would probably need some thinking, as the above functions kinda rely on module_init() and module_exit() semantics. We cannot make the prog bump the refcnt every time it runs -- it would overflow. And it would be nice to automatically free the refcnt when prog is unloaded.
Probably add a flag attribute that is evaluated at BPF_LINK time, so progs can say they need defrag enabled. Same could be used to request conntrack enablement.
Will need some glue on netfilter side to handle DEFRAG=m, but we already have plenty of those.
All makes perfect sense to me. It's cleaner than a special netdevice. ipv4_conntrack_defrag() is pretty neat. I didn't know about it. If we can reuse it as-is that would be ideal. Conceptually it fits perfectly. If we cannot reuse it (for whatever unlikely reason) I would argue that TC hook should gain similar functionality.
linux-kselftest-mirror@lists.linaro.org