Hi all,
This series updates the drv-net XDP program used by the new xdp.py selftest to use the bpf_dynptr APIs for packet access.
The selftest itself is unchanged.
The original program accessed packet headers directly via ctx->data/data_end, implicitly assuming headers are always in the linear region. That assumption is incorrect for multi-buffer XDP and does not hold across all drivers. For example, mlx5 with striding RQ can leave the linear area empty, causing the multi-buffer cases to fail.
Switching to bpf_xdp_load/store_bytes would work but always incurs copies. Instead, this series adopts bpf_dynptr, which provides safe, verifier-checked access across both linear and fragmented areas while avoiding copies.
Amery Hung has also proposed a series [1] that addresses the same issues in the program, but through the use of bpf_xdp_pull_data. My series is not intended as a replacement for that work, but rather as an exploration of another viable solution, each of which may be preferable under different circumstances.
In cases where the program does not return XDP_PASS, I believe dynptr has an advantage since it avoids an extra copy. Conversely, when the program returns XDP_PASS, bpf_xdp_pull_data may be preferable, as the copy will be performed in any case during skb creation.
It may make sense to split the work into two separate programs, allowing us to test both solutions independently. Alternatively, we can consider a combined approach, where the more fitting solution is applied for each use case. I welcome feedback on which direction would be most useful.
[1] https://lore.kernel.org/all/20250905173352.3759457-1-ameryhung@gmail.com/
Thanks! Nimrod
Nimrod Oren (5): selftests: drv-net: Test XDP_TX with bpf_dynptr selftests: drv-net: Test XDP tail adjustment with bpf_dynptr selftests: drv-net: Test XDP head adjustment with bpf_dynptr selftests: drv-net: Adjust XDP header data with bpf_dynptr selftests: drv-net: Check XDP header data with bpf_dynptr
.../selftests/net/lib/xdp_native.bpf.c | 219 ++++++++---------- 1 file changed, 96 insertions(+), 123 deletions(-)
Update xdp_mode_tx_handler to use bpf_dynptr_slice_rdwr instead of accessing the packet data directly.
This makes the test viable for drivers that do not store any packet data in the linear part when in multi-buffer mode.
While here, remove the unused local min() helper.
Signed-off-by: Nimrod Oren noren@nvidia.com Reviewed-by: Carolina Jubran cjubran@nvidia.com Reviewed-by: Dragos Tatulea dtatulea@nvidia.com --- .../selftests/net/lib/xdp_native.bpf.c | 49 +++++++++---------- 1 file changed, 23 insertions(+), 26 deletions(-)
diff --git a/tools/testing/selftests/net/lib/xdp_native.bpf.c b/tools/testing/selftests/net/lib/xdp_native.bpf.c index 521ba38f2ddd..346cbba9afec 100644 --- a/tools/testing/selftests/net/lib/xdp_native.bpf.c +++ b/tools/testing/selftests/net/lib/xdp_native.bpf.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0
#include <stddef.h> +#include <stdbool.h> #include <linux/bpf.h> #include <linux/in.h> #include <linux/if_ether.h> @@ -9,6 +10,7 @@ #include <linux/udp.h> #include <bpf/bpf_endian.h> #include <bpf/bpf_helpers.h> +#include "../../bpf/bpf_kfuncs.h"
#define MAX_ADJST_OFFSET 256 #define MAX_PAYLOAD_LEN 5000 @@ -51,11 +53,6 @@ struct { __type(value, __u64); } map_xdp_stats SEC(".maps");
-static __u32 min(__u32 a, __u32 b) -{ - return a < b ? a : b; -} - static void record_stats(struct xdp_md *ctx, __u32 stat_type) { __u64 *count; @@ -145,32 +142,33 @@ static void swap_machdr(void *data)
static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port) { - void *data_end = (void *)(long)ctx->data_end; - void *data = (void *)(long)ctx->data; struct udphdr *udph = NULL; - struct ethhdr *eth = data; + struct ethhdr *eth = NULL; + struct bpf_dynptr ptr;
- if (data + sizeof(*eth) > data_end) + bpf_dynptr_from_xdp(ctx, 0, &ptr); + eth = bpf_dynptr_slice_rdwr(&ptr, 0, NULL, sizeof(*eth)); + if (!eth) return XDP_PASS;
if (eth->h_proto == bpf_htons(ETH_P_IP)) { - struct iphdr *iph = data + sizeof(*eth); - __be32 tmp_ip = iph->saddr; + struct iphdr *iph = NULL; + __be32 tmp_ip;
- if (iph + 1 > (struct iphdr *)data_end || - iph->protocol != IPPROTO_UDP) + iph = bpf_dynptr_slice_rdwr(&ptr, sizeof(*eth), NULL, + sizeof(*iph)); + if (!iph || iph->protocol != IPPROTO_UDP) return XDP_PASS;
- udph = data + sizeof(*iph) + sizeof(*eth); - - if (udph + 1 > (struct udphdr *)data_end) - return XDP_PASS; - if (udph->dest != bpf_htons(port)) + udph = bpf_dynptr_slice(&ptr, sizeof(*iph) + sizeof(*eth), NULL, + sizeof(*udph)); + if (!udph || udph->dest != bpf_htons(port)) return XDP_PASS;
record_stats(ctx, STATS_RX); swap_machdr((void *)eth);
+ tmp_ip = iph->saddr; iph->saddr = iph->daddr; iph->daddr = tmp_ip;
@@ -179,18 +177,17 @@ static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port) return XDP_TX;
} else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) { - struct ipv6hdr *ipv6h = data + sizeof(*eth); + struct ipv6hdr *ipv6h = NULL; struct in6_addr tmp_ipv6;
- if (ipv6h + 1 > (struct ipv6hdr *)data_end || - ipv6h->nexthdr != IPPROTO_UDP) + ipv6h = bpf_dynptr_slice_rdwr(&ptr, sizeof(*eth), NULL, + sizeof(*ipv6h)); + if (!ipv6h || ipv6h->nexthdr != IPPROTO_UDP) return XDP_PASS;
- udph = data + sizeof(*ipv6h) + sizeof(*eth); - - if (udph + 1 > (struct udphdr *)data_end) - return XDP_PASS; - if (udph->dest != bpf_htons(port)) + udph = bpf_dynptr_slice(&ptr, sizeof(*ipv6h) + sizeof(*eth), + NULL, sizeof(*udph)); + if (!udph || udph->dest != bpf_htons(port)) return XDP_PASS;
record_stats(ctx, STATS_RX);
Update xdp_adjst_tail_shrnk_data to use bpf_dynptr_slice to read the data at the end of the packet, instead of bpf_xdp_load_bytes.
This may avoid a copy by returning a direct pointer to the dynptr data.
Note: since bpf_dynptr_slice() does not support variable lengths, we use MAX_ADJST_OFFSET as the slice size, and update both the size check and the UDP checksum calculation accordingly.
Signed-off-by: Nimrod Oren noren@nvidia.com Reviewed-by: Carolina Jubran cjubran@nvidia.com Reviewed-by: Dragos Tatulea dtatulea@nvidia.com --- .../selftests/net/lib/xdp_native.bpf.c | 20 ++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/net/lib/xdp_native.bpf.c b/tools/testing/selftests/net/lib/xdp_native.bpf.c index 346cbba9afec..ff768fbc8606 100644 --- a/tools/testing/selftests/net/lib/xdp_native.bpf.c +++ b/tools/testing/selftests/net/lib/xdp_native.bpf.c @@ -272,10 +272,11 @@ static __u16 csum_fold_helper(__u32 csum) static int xdp_adjst_tail_shrnk_data(struct xdp_md *ctx, __u16 offset, __u32 hdr_len) { - char tmp_buff[MAX_ADJST_OFFSET]; - __u32 buff_pos, udp_csum = 0; + char tmp_buff[MAX_ADJST_OFFSET] = {0}; + __u32 buff_len, udp_csum = 0; struct udphdr *udph = NULL; - __u32 buff_len; + struct bpf_dynptr ptr; + void *data = NULL;
udph = update_pkt(ctx, 0 - offset, &udp_csum); if (!udph) @@ -285,18 +286,19 @@ static int xdp_adjst_tail_shrnk_data(struct xdp_md *ctx, __u16 offset,
offset = (offset & 0x1ff) >= MAX_ADJST_OFFSET ? MAX_ADJST_OFFSET : offset & 0xff; - if (offset == 0) - return -1;
/* Make sure we have enough data to avoid eating the header */ - if (buff_len - offset < hdr_len) + if (buff_len - sizeof(tmp_buff) < hdr_len) return -1;
- buff_pos = buff_len - offset; - if (bpf_xdp_load_bytes(ctx, buff_pos, tmp_buff, offset) < 0) + bpf_dynptr_from_xdp(ctx, 0, &ptr); + data = bpf_dynptr_slice(&ptr, buff_len - sizeof(tmp_buff), tmp_buff, + sizeof(tmp_buff)); + if (!data) return -1;
- udp_csum = bpf_csum_diff((__be32 *)tmp_buff, offset, 0, 0, udp_csum); + udp_csum = bpf_csum_diff(data, sizeof(tmp_buff), data, + sizeof(tmp_buff) - offset, udp_csum); udph->check = (__u16)csum_fold_helper(udp_csum);
if (bpf_xdp_adjust_tail(ctx, 0 - offset) < 0)
Update xdp_adjst_head_shrnk_data to use bpf_dynptr_slice to read the data right after the headers, instead of bpf_xdp_load_bytes.
This may avoid a copy by returning a direct pointer to the dynptr data.
Also, use bpf_dynptr_read and bpf_dynptr_write to move the headers, instead of bpf_xdp_load_bytes and bpf_xdp_store_bytes.
Signed-off-by: Nimrod Oren noren@nvidia.com Reviewed-by: Carolina Jubran cjubran@nvidia.com Reviewed-by: Dragos Tatulea dtatulea@nvidia.com --- .../selftests/net/lib/xdp_native.bpf.c | 35 +++++++------------ 1 file changed, 12 insertions(+), 23 deletions(-)
diff --git a/tools/testing/selftests/net/lib/xdp_native.bpf.c b/tools/testing/selftests/net/lib/xdp_native.bpf.c index ff768fbc8606..71172d32c529 100644 --- a/tools/testing/selftests/net/lib/xdp_native.bpf.c +++ b/tools/testing/selftests/net/lib/xdp_native.bpf.c @@ -398,10 +398,11 @@ static int xdp_adjst_tail(struct xdp_md *ctx, __u16 port) static int xdp_adjst_head_shrnk_data(struct xdp_md *ctx, __u64 hdr_len, __u32 offset) { - char tmp_buff[MAX_ADJST_OFFSET]; + char tmp_buff[MAX_ADJST_OFFSET] = {0}; + struct bpf_dynptr ptr; struct udphdr *udph; - void *offset_ptr; __u32 udp_csum = 0; + void *data = NULL;
/* Update the length information in the IP and UDP headers before * adjusting the headroom. This simplifies accessing the relevant @@ -414,37 +415,25 @@ static int xdp_adjst_head_shrnk_data(struct xdp_md *ctx, __u64 hdr_len, if (!udph) return -1;
- offset = (offset & 0x1ff) >= MAX_ADJST_OFFSET ? MAX_ADJST_OFFSET : - offset & 0xff; - if (offset == 0) - return -1; - - if (bpf_xdp_load_bytes(ctx, hdr_len, tmp_buff, offset) < 0) + bpf_dynptr_from_xdp(ctx, 0, &ptr); + data = bpf_dynptr_slice(&ptr, hdr_len, tmp_buff, sizeof(tmp_buff)); + if (!data) return -1;
- udp_csum = bpf_csum_diff((__be32 *)tmp_buff, offset, 0, 0, udp_csum); + udp_csum = bpf_csum_diff(data, offset, 0, 0, udp_csum);
udph->check = (__u16)csum_fold_helper(udp_csum);
- if (bpf_xdp_load_bytes(ctx, 0, tmp_buff, MAX_ADJST_OFFSET) < 0) - return -1; - - if (bpf_xdp_adjust_head(ctx, offset) < 0) - return -1; - - if (offset > MAX_ADJST_OFFSET) - return -1; - - if (hdr_len > MAX_ADJST_OFFSET || hdr_len == 0) - return -1; - /* Added here to handle clang complain about negative value */ hdr_len = hdr_len & 0xff;
- if (hdr_len == 0) + if (bpf_dynptr_read(tmp_buff, hdr_len, &ptr, 0, 0) < 0) return -1;
- if (bpf_xdp_store_bytes(ctx, 0, tmp_buff, hdr_len) < 0) + if (bpf_dynptr_write(&ptr, offset, tmp_buff, hdr_len, 0) < 0) + return -1; + + if (bpf_xdp_adjust_head(ctx, offset) < 0) return -1;
return 0;
On 9/9/25 1:52 AM, Nimrod Oren wrote:
- if (bpf_dynptr_read(tmp_buff, hdr_len, &ptr, 0, 0) < 0) return -1;
- if (bpf_xdp_store_bytes(ctx, 0, tmp_buff, hdr_len) < 0)
- if (bpf_dynptr_write(&ptr, offset, tmp_buff, hdr_len, 0) < 0)
Instead of bpf_dynptr_read() and then bpf_dynptr_write(), try to use bpf_dynptr_copy(). I think you have also noticed that Amery is also modifying xdp_native.bpf.c to test the linear access (i.e. data/data_end) with the bpf_xdp_pull_data addition. Not sure if xdp_native.bpf.c wants to keep testing both (data/data_end and dynptr). In the bpf selftests, it does want to have test coverage for both and have some existing selftests for that.
return -1;
- if (bpf_xdp_adjust_head(ctx, offset) < 0) return -1;
return 0;
Update update_pkt to use bpf_dynptr_slice_rdwr instead of accessing the packet data directly.
This makes the test viable for drivers that do not store any packet data in the linear part when in multi-buffer mode.
Signed-off-by: Nimrod Oren noren@nvidia.com Reviewed-by: Carolina Jubran cjubran@nvidia.com Reviewed-by: Dragos Tatulea dtatulea@nvidia.com --- .../selftests/net/lib/xdp_native.bpf.c | 35 ++++++++++--------- 1 file changed, 19 insertions(+), 16 deletions(-)
diff --git a/tools/testing/selftests/net/lib/xdp_native.bpf.c b/tools/testing/selftests/net/lib/xdp_native.bpf.c index 71172d32c529..ff63f572552b 100644 --- a/tools/testing/selftests/net/lib/xdp_native.bpf.c +++ b/tools/testing/selftests/net/lib/xdp_native.bpf.c @@ -208,38 +208,41 @@ static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
static void *update_pkt(struct xdp_md *ctx, __s16 offset, __u32 *udp_csum) { - void *data_end = (void *)(long)ctx->data_end; - void *data = (void *)(long)ctx->data; struct udphdr *udph = NULL; - struct ethhdr *eth = data; + struct ethhdr *eth = NULL; + struct bpf_dynptr ptr; __u32 len, len_new;
- if (data + sizeof(*eth) > data_end) + bpf_dynptr_from_xdp(ctx, 0, &ptr); + eth = bpf_dynptr_slice(&ptr, 0, NULL, sizeof(*eth)); + if (!eth) return NULL;
if (eth->h_proto == bpf_htons(ETH_P_IP)) { - struct iphdr *iph = data + sizeof(*eth); - __u16 total_len; - - if (iph + 1 > (struct iphdr *)data_end) + struct iphdr *iph = bpf_dynptr_slice_rdwr(&ptr, sizeof(*eth), + NULL, sizeof(*iph)); + if (!iph) return NULL;
iph->tot_len = bpf_htons(bpf_ntohs(iph->tot_len) + offset);
- udph = (void *)eth + sizeof(*iph) + sizeof(*eth); - if (!udph || udph + 1 > (struct udphdr *)data_end) + udph = bpf_dynptr_slice_rdwr(&ptr, sizeof(*iph) + sizeof(*eth), + NULL, sizeof(*udph)); + if (!udph) return NULL;
len_new = bpf_htons(bpf_ntohs(udph->len) + offset); } else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) { - struct ipv6hdr *ipv6h = data + sizeof(*eth); - __u16 payload_len; - - if (ipv6h + 1 > (struct ipv6hdr *)data_end) + struct ipv6hdr *ipv6h = + bpf_dynptr_slice_rdwr(&ptr, sizeof(*eth), + NULL, sizeof(*ipv6h)); + if (!ipv6h) return NULL;
- udph = (void *)eth + sizeof(*ipv6h) + sizeof(*eth); - if (!udph || udph + 1 > (struct udphdr *)data_end) + udph = bpf_dynptr_slice_rdwr(&ptr, + sizeof(*ipv6h) + sizeof(*eth), + NULL, sizeof(*udph)); + if (!udph) return NULL;
*udp_csum = ~((__u32)udph->check);
Update filter_udphdr to use bpf_dynptr_slice to read the packet headers instead of accessing them directly.
The function previously returned a pointer to the UDP header, which was then used to calculate Ethernet and IP header lengths by subtracting xdp->data. Since this only works when the UDP header is in the linear region, rework the function to return the total header length instead of a pointer. Rename filter_udphdr() to check_udphdr() to reflect the new behavior.
This makes the test viable for drivers that do not store any packet data in the linear part when in multi-buffer mode.
Signed-off-by: Nimrod Oren noren@nvidia.com Reviewed-by: Carolina Jubran cjubran@nvidia.com Reviewed-by: Dragos Tatulea dtatulea@nvidia.com --- .../selftests/net/lib/xdp_native.bpf.c | 80 +++++++------------ 1 file changed, 31 insertions(+), 49 deletions(-)
diff --git a/tools/testing/selftests/net/lib/xdp_native.bpf.c b/tools/testing/selftests/net/lib/xdp_native.bpf.c index ff63f572552b..6df5164e3791 100644 --- a/tools/testing/selftests/net/lib/xdp_native.bpf.c +++ b/tools/testing/selftests/net/lib/xdp_native.bpf.c @@ -63,53 +63,49 @@ static void record_stats(struct xdp_md *ctx, __u32 stat_type) __sync_fetch_and_add(count, 1); }
-static struct udphdr *filter_udphdr(struct xdp_md *ctx, __u16 port) +static __u32 check_udphdr(struct xdp_md *ctx, __u16 port) { - void *data_end = (void *)(long)ctx->data_end; - void *data = (void *)(long)ctx->data; struct udphdr *udph = NULL; - struct ethhdr *eth = data; + struct ethhdr *eth = NULL; + struct bpf_dynptr ptr;
- if (data + sizeof(*eth) > data_end) - return NULL; + bpf_dynptr_from_xdp(ctx, 0, &ptr); + eth = bpf_dynptr_slice(&ptr, 0, NULL, sizeof(*eth)); + if (!eth) + return 0;
if (eth->h_proto == bpf_htons(ETH_P_IP)) { - struct iphdr *iph = data + sizeof(*eth); + struct iphdr *iph = bpf_dynptr_slice(&ptr, sizeof(*eth), + NULL, sizeof(*iph));
- if (iph + 1 > (struct iphdr *)data_end || - iph->protocol != IPPROTO_UDP) - return NULL; + if (!iph || iph->protocol != IPPROTO_UDP) + return 0;
- udph = (void *)eth + sizeof(*iph) + sizeof(*eth); - } else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) { - struct ipv6hdr *ipv6h = data + sizeof(*eth); + udph = bpf_dynptr_slice(&ptr, sizeof(*iph) + sizeof(*eth), + NULL, sizeof(*udph)); + } else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) { + struct ipv6hdr *ipv6h = bpf_dynptr_slice(&ptr, sizeof(*eth), + NULL, sizeof(*ipv6h));
- if (ipv6h + 1 > (struct ipv6hdr *)data_end || - ipv6h->nexthdr != IPPROTO_UDP) - return NULL; + if (!ipv6h || ipv6h->nexthdr != IPPROTO_UDP) + return 0;
- udph = (void *)eth + sizeof(*ipv6h) + sizeof(*eth); + udph = bpf_dynptr_slice(&ptr, sizeof(*ipv6h) + sizeof(*eth), + NULL, sizeof(*udph)); } else { - return NULL; + return 0; }
- if (udph + 1 > (struct udphdr *)data_end) - return NULL; - - if (udph->dest != bpf_htons(port)) - return NULL; + if (!udph || udph->dest != bpf_htons(port)) + return 0;
record_stats(ctx, STATS_RX); - - return udph; + return (void *)udph - (void *)eth + sizeof(*udph); }
static int xdp_mode_pass(struct xdp_md *ctx, __u16 port) { - struct udphdr *udph = NULL; - - udph = filter_udphdr(ctx, port); - if (!udph) + if (!check_udphdr(ctx, port)) return XDP_PASS;
record_stats(ctx, STATS_PASS); @@ -119,10 +115,7 @@ static int xdp_mode_pass(struct xdp_md *ctx, __u16 port)
static int xdp_mode_drop_handler(struct xdp_md *ctx, __u16 port) { - struct udphdr *udph = NULL; - - udph = filter_udphdr(ctx, port); - if (!udph) + if (!check_udphdr(ctx, port)) return XDP_PASS;
record_stats(ctx, STATS_DROP); @@ -363,19 +356,14 @@ static int xdp_adjst_tail_grow_data(struct xdp_md *ctx, __u16 offset)
static int xdp_adjst_tail(struct xdp_md *ctx, __u16 port) { - void *data = (void *)(long)ctx->data; - struct udphdr *udph = NULL; - __s32 *adjust_offset, *val; + __s32 *adjust_offset; __u32 key, hdr_len; - void *offset_ptr; - __u8 tag; int ret;
- udph = filter_udphdr(ctx, port); - if (!udph) + hdr_len = check_udphdr(ctx, port); + if (!hdr_len) return XDP_PASS;
- hdr_len = (void *)udph - data + sizeof(struct udphdr); key = XDP_ADJST_OFFSET; adjust_offset = bpf_map_lookup_elem(&map_xdp_setup, &key); if (!adjust_offset) @@ -504,20 +492,14 @@ static int xdp_adjst_head_grow_data(struct xdp_md *ctx, __u64 hdr_len,
static int xdp_head_adjst(struct xdp_md *ctx, __u16 port) { - void *data_end = (void *)(long)ctx->data_end; - void *data = (void *)(long)ctx->data; - struct udphdr *udph_ptr = NULL; __u32 key, size, hdr_len; __s32 *val; int res;
- /* Filter packets based on UDP port */ - udph_ptr = filter_udphdr(ctx, port); - if (!udph_ptr) + hdr_len = check_udphdr(ctx, port); + if (!hdr_len) return XDP_PASS;
- hdr_len = (void *)udph_ptr - data + sizeof(struct udphdr); - key = XDP_ADJST_OFFSET; val = bpf_map_lookup_elem(&map_xdp_setup, &key); if (!val)
On Tue, 9 Sep 2025 11:52:31 +0300 Nimrod Oren wrote:
In cases where the program does not return XDP_PASS, I believe dynptr has an advantage since it avoids an extra copy. Conversely, when the program returns XDP_PASS, bpf_xdp_pull_data may be preferable, as the copy will be performed in any case during skb creation.
It may make sense to split the work into two separate programs, allowing us to test both solutions independently. Alternatively, we can consider a combined approach, where the more fitting solution is applied for each use case. I welcome feedback on which direction would be most useful.
Ideally we'd make the BPF code work in either mode. But not sure it's achievable given the verification complexity. Failing that we can have two BPF objects and parameterize the test cases.
It'd be neat if we could capture if the pull actually pulled anything for a given case, so that we only bother re-running with the dynptr when we'd end up with different packet geometry. But only if it doesn't make the test code too complex.
linux-kselftest-mirror@lists.linaro.org