This patch set improves the documentation and selftests for XDP Rx metadata handling. The first patch clarifies the documentation around XDP metadata layout and METADATA_SIZE. The second patch enhances the BPF selftests to make XDP metadata handling more robust across different NICs.
Prior to this patch set, the XDP program might accidentally overwrite the device-reserved metadata.
V3: - update doc and commit msg accordingly.
V2: https://lore.kernel.org/netdev/20250702030349.3275368-1-yoong.siang.song@int... - unconditionally do bpf_xdp_adjust_meta with -XDP_METADATA_SIZE (Stanislav)
V1: https://lore.kernel.org/netdev/20250701042940.3272325-1-yoong.siang.song@int...
Song Yoong Siang (2): doc: enhance explanation of XDP Rx metadata layout and METADATA_SIZE selftests/bpf: Enhance XDP Rx metadata handling
Documentation/networking/xdp-rx-metadata.rst | 36 +++++++++++++++---- .../selftests/bpf/prog_tests/xdp_metadata.c | 2 +- .../selftests/bpf/progs/xdp_hw_metadata.c | 2 +- .../selftests/bpf/progs/xdp_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_metadata.h | 7 ++++ 6 files changed, 41 insertions(+), 10 deletions(-)
Add diagram to show metadata layout of devices that utilize the data_meta area for their own purposes. Besides, enhance the documentation on selecting an appropriate METADATA_SIZE for XDP Rx metadata, ensuring it accommodates both device-reserved and custom metadata. It includes considerations for alignment and size constraints. The updated guidance helps users correctly allocate and access metadata in AF_XDP scenarios.
Signed-off-by: Song Yoong Siang yoong.siang.song@intel.com --- Documentation/networking/xdp-rx-metadata.rst | 36 ++++++++++++++++---- 1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst index a6e0ece18be5..65a1a6e0f7a2 100644 --- a/Documentation/networking/xdp-rx-metadata.rst +++ b/Documentation/networking/xdp-rx-metadata.rst @@ -54,6 +54,19 @@ area in whichever format it chooses. Later consumers of the metadata will have to agree on the format by some out of band contract (like for the AF_XDP use case, see below).
+It is important to note that some devices may utilize the ``data_meta`` area for +their own purposes. For example, the IGC device utilizes ``IGC_TS_HDR_LEN`` +bytes of the ``data_meta`` area for receiving hardware timestamps. Therefore, +the XDP program should ensure that it does not overwrite any existing metadata. +The metadata layout of such device is depicted below:: + + +----------+-----------------+--------------------------+------+ + | headroom | custom metadata | device-reserved metadata | data | + +----------+-----------------+--------------------------+------+ + ^ ^ + | | + xdp_buff->data_meta xdp_buff->data + AF_XDP ======
@@ -69,12 +82,23 @@ descriptor does _not_ explicitly carry the size of the metadata).
Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
- +----------+-----------------+------+ - | headroom | custom metadata | data | - +----------+-----------------+------+ - ^ - | - rx_desc->address + |<--------------METADATA_SIZE--------------->| + +----------+-----------------+--------------------------+------+ + | headroom | custom metadata | device-reserved metadata | data | + +----------+-----------------+--------------------------+------+ + ^ + | + rx_desc->address + +It is crucial that the agreed ``METADATA_SIZE`` between the BPF program and the +final consumer is sufficient to accommodate both device-reserved metadata and +the data the BPF program needs to populate. + +``bpf_xdp_adjust_meta`` ensures that ``METADATA_SIZE`` is aligned to 4 bytes, +does not exceed 252 bytes, and leaves sufficient space for building the +xdp_frame. If these conditions are not met, it returns a negative error. In this +case, the BPF program should not proceed to populate data into the ``data_meta`` +area.
XDP_PASS ========
On 7/2/25 6:57 PM, Song Yoong Siang wrote: [...]
+It is important to note that some devices may utilize the ``data_meta`` area for +their own purposes. For example, the IGC device utilizes ``IGC_TS_HDR_LEN`` +bytes of the ``data_meta`` area for receiving hardware timestamps. Therefore, +the XDP program should ensure that it does not overwrite any existing metadata. +The metadata layout of such device is depicted below::
- +----------+-----------------+--------------------------+------+
- | headroom | custom metadata | device-reserved metadata | data |
- +----------+-----------------+--------------------------+------+
^ ^
| |
- xdp_buff->data_meta xdp_buff->data
Imho, this section is misleading to developers. Suppose you're a XDP program writer and you want to implement a generic native BPF program (independent of the underlying NIC). Does this mean, the expectation is to dig into driver code to gather whether or not a driver is prepopulating and how much of it? What are the implications if the data is overwritten? For example, in Cilium today we use the buffer described here as device-reserved metadata and override it. How will users know what breaks?
On Thursday, July 3, 2025 11:58 PM, Daniel Borkmann daniel@iogearbox.net wrote:
On 7/2/25 6:57 PM, Song Yoong Siang wrote: [...]
+It is important to note that some devices may utilize the ``data_meta`` area for +their own purposes. For example, the IGC device utilizes ``IGC_TS_HDR_LEN`` +bytes of the ``data_meta`` area for receiving hardware timestamps. Therefore, +the XDP program should ensure that it does not overwrite any existing metadata. +The metadata layout of such device is depicted below::
- +----------+-----------------+--------------------------+------+
- | headroom | custom metadata | device-reserved metadata | data |
- +----------+-----------------+--------------------------+------+
^ ^
| |
- xdp_buff->data_meta xdp_buff->data
Imho, this section is misleading to developers. Suppose you're a XDP program writer and you want to implement a generic native BPF program (independent of the underlying NIC). Does this mean, the expectation is to dig into driver code to gather whether or not a driver is prepopulating and how much of it? What are the implications if the data is overwritten? For example, in Cilium today we use the buffer described here as device-reserved metadata and override it. How will users know what breaks?
Thanks for your input.
A generic XDP program can always check the size of device-reserved metadata by "ctx->data - ctx->data_meta" and avoid overwrite it, as shown in code below in my v1 submission [1]. This requires driver to expose the metadata length used [2]. However, I dint have good justification for making the metadata length user-visible. So, I submitted this v3 to keep it simple. Any thoughts?
+ metalen_used = ctx->data - ctx->data_meta; + metalen_to_adjust = XDP_METADATA_SIZE - metalen_used; + if (metalen_to_adjust < (int)sizeof(struct xdp_meta)) + return XDP_DROP; + + ret = bpf_xdp_adjust_meta(ctx, -metalen_to_adjust);
[1] https://lore.kernel.org/netdev/20250701042940.3272325-3-yoong.siang.song@int... [2] https://lore.kernel.org/netdev/20250701080955.3273137-1-yoong.siang.song@int...
Introduce the XDP_METADATA_SIZE macro as a conservative measure to accommodate any metadata areas reserved by Ethernet devices.
Signed-off-by: Song Yoong Siang yoong.siang.song@intel.com --- tools/testing/selftests/bpf/prog_tests/xdp_metadata.c | 2 +- tools/testing/selftests/bpf/progs/xdp_hw_metadata.c | 2 +- tools/testing/selftests/bpf/progs/xdp_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_metadata.h | 7 +++++++ 5 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c index 19f92affc2da..8d6c2633698b 100644 --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c @@ -302,7 +302,7 @@ static int verify_xsk_metadata(struct xsk *xsk, bool sent_from_af_xdp)
/* custom metadata */
- meta = data - sizeof(struct xdp_meta); + meta = data - XDP_METADATA_SIZE;
if (!ASSERT_NEQ(meta->rx_timestamp, 0, "rx_timestamp")) return -1; diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c index 330ece2eabdb..3766f58d3486 100644 --- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c +++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c @@ -72,7 +72,7 @@ int rx(struct xdp_md *ctx) return XDP_PASS; }
- err = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta)); + err = bpf_xdp_adjust_meta(ctx, -(int)XDP_METADATA_SIZE); if (err) { __sync_add_and_fetch(&pkts_fail, 1); return XDP_PASS; diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c index 09bb8a038d52..5cada85fe0f4 100644 --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c @@ -73,7 +73,7 @@ int rx(struct xdp_md *ctx)
/* Reserve enough for all custom metadata. */
- ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta)); + ret = bpf_xdp_adjust_meta(ctx, -(int)XDP_METADATA_SIZE); if (ret != 0) return XDP_DROP;
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c index 3d8de0d4c96a..a529d55d4ff4 100644 --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c @@ -223,7 +223,7 @@ static void verify_xdp_metadata(void *data, clockid_t clock_id) { struct xdp_meta *meta;
- meta = data - sizeof(*meta); + meta = data - XDP_METADATA_SIZE;
if (meta->hint_valid & XDP_META_FIELD_RSS) printf("rx_hash: 0x%X with RSS type:0x%X\n", diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h index 87318ad1117a..2dfd3bf5e7bb 100644 --- a/tools/testing/selftests/bpf/xdp_metadata.h +++ b/tools/testing/selftests/bpf/xdp_metadata.h @@ -50,3 +50,10 @@ struct xdp_meta { }; enum xdp_meta_field hint_valid; }; + +/* XDP_METADATA_SIZE must be at least the size of struct xdp_meta. An additional + * 32 bytes of padding is included as a conservative measure to accommodate any + * metadata areas reserved by Ethernet devices. If the device-reserved metadata + * exceeds 32 bytes, this value will need adjustment. + */ +#define XDP_METADATA_SIZE (sizeof(struct xdp_meta) + 32)
On 02/07/2025 18.57, Song Yoong Siang wrote:
Introduce the XDP_METADATA_SIZE macro as a conservative measure to accommodate any metadata areas reserved by Ethernet devices.
This seems like a sloppy workaround :-(
To me, the problem arise because AF_XDP is lacking the ability to communicate the size of the data_meta area. If we had this capability, then we could allow the IGC driver to take some of the space, have the BPF-prog expand it futher (bpf_xdp_adjust_meta) and then userspace AF_XDP would simply be able to see the size of the data_meta area, and apply the struct xdp_meta at right offset.
Signed-off-by: Song Yoong Siang yoong.siang.song@intel.com
tools/testing/selftests/bpf/prog_tests/xdp_metadata.c | 2 +- tools/testing/selftests/bpf/progs/xdp_hw_metadata.c | 2 +- tools/testing/selftests/bpf/progs/xdp_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_metadata.h | 7 +++++++ 5 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c index 19f92affc2da..8d6c2633698b 100644 --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c @@ -302,7 +302,7 @@ static int verify_xsk_metadata(struct xsk *xsk, bool sent_from_af_xdp) /* custom metadata */
- meta = data - sizeof(struct xdp_meta);
- meta = data - XDP_METADATA_SIZE;
if (!ASSERT_NEQ(meta->rx_timestamp, 0, "rx_timestamp")) return -1; diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c index 330ece2eabdb..3766f58d3486 100644 --- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c +++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c @@ -72,7 +72,7 @@ int rx(struct xdp_md *ctx) return XDP_PASS; }
- err = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta));
- err = bpf_xdp_adjust_meta(ctx, -(int)XDP_METADATA_SIZE); if (err) { __sync_add_and_fetch(&pkts_fail, 1); return XDP_PASS;
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c index 09bb8a038d52..5cada85fe0f4 100644 --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c @@ -73,7 +73,7 @@ int rx(struct xdp_md *ctx) /* Reserve enough for all custom metadata. */
- ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta));
- ret = bpf_xdp_adjust_meta(ctx, -(int)XDP_METADATA_SIZE); if (ret != 0) return XDP_DROP;
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c index 3d8de0d4c96a..a529d55d4ff4 100644 --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c @@ -223,7 +223,7 @@ static void verify_xdp_metadata(void *data, clockid_t clock_id) { struct xdp_meta *meta;
- meta = data - sizeof(*meta);
- meta = data - XDP_METADATA_SIZE;
if (meta->hint_valid & XDP_META_FIELD_RSS) printf("rx_hash: 0x%X with RSS type:0x%X\n", diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h index 87318ad1117a..2dfd3bf5e7bb 100644 --- a/tools/testing/selftests/bpf/xdp_metadata.h +++ b/tools/testing/selftests/bpf/xdp_metadata.h @@ -50,3 +50,10 @@ struct xdp_meta { }; enum xdp_meta_field hint_valid; };
+/* XDP_METADATA_SIZE must be at least the size of struct xdp_meta. An additional
- 32 bytes of padding is included as a conservative measure to accommodate any
- metadata areas reserved by Ethernet devices. If the device-reserved metadata
- exceeds 32 bytes, this value will need adjustment.
- */
+#define XDP_METADATA_SIZE (sizeof(struct xdp_meta) + 32)
On Friday, July 4, 2025 1:05 AM, Jesper Dangaard Brouer hawk@kernel.org wrote:
On 02/07/2025 18.57, Song Yoong Siang wrote:
Introduce the XDP_METADATA_SIZE macro as a conservative measure to accommodate any metadata areas reserved by Ethernet devices.
This seems like a sloppy workaround :-(
To me, the problem arise because AF_XDP is lacking the ability to communicate the size of the data_meta area. If we had this capability, then we could allow the IGC driver to take some of the space, have the BPF-prog expand it futher (bpf_xdp_adjust_meta) and then userspace AF_XDP would simply be able to see the size of the data_meta area, and apply the struct xdp_meta at right offset.
Thanks for your input.
I agree with you that the implementation will be simple if user application able to get the size of data_meta area. The intention of this patch set is to let developer aware of such limitations before we have a perfect solution.
Btw, do you got any suggestion on how to expose the metadata length? I not sure whether xdp_desc.options is a simple and good idea or not?
On 04/07/2025 03.17, Song, Yoong Siang wrote:
On Friday, July 4, 2025 1:05 AM, Jesper Dangaard Brouer hawk@kernel.org wrote:
On 02/07/2025 18.57, Song Yoong Siang wrote:
Introduce the XDP_METADATA_SIZE macro as a conservative measure to accommodate any metadata areas reserved by Ethernet devices.
This seems like a sloppy workaround :-(
To me, the problem arise because AF_XDP is lacking the ability to communicate the size of the data_meta area. If we had this capability, then we could allow the IGC driver to take some of the space, have the BPF-prog expand it futher (bpf_xdp_adjust_meta) and then userspace AF_XDP would simply be able to see the size of the data_meta area, and apply the struct xdp_meta at right offset.
Thanks for your input.
I agree with you that the implementation will be simple if user application able to get the size of data_meta area. The intention of this patch set is to let developer aware of such limitations before we have a perfect solution.
Btw, do you got any suggestion on how to expose the metadata length? I not sure whether xdp_desc.options is a simple and good idea or not?
That is a question to the AF_XDP maintainers... added them to this email.
/* Rx/Tx descriptor */ struct xdp_desc { __u64 addr; __u32 len; __u32 options; };
As far as I know, the xdp_desc.options field isn't used, right?
(Please AF_XDP experts, please verify below statements:) Something else we likely want to document: The available headroom in the AF_XDP frame. When accessing the metadata in userspace AF_XDP we do a negative offset from the UMEM packet pointer. IIRC on RX the available headroom will be either 255 or 192 bytes (depending on NIC drivers).
Slightly confusing when AF_XDP transmitting from userspace the UMEM headroom is default zero (XSK_UMEM__DEFAULT_FRAME_HEADROOM is zero). This is configurable via xsk_umem_config.frame_headroom, like I did in this example[1].
Maybe I did something wrong in[1], because I see that the new method is setting xsk_umem_config.tx_metadata_len + flag XDP_UMEM_TX_METADATA_LEN. This is nicely documented in [2]. How does this interact with setting xsk_umem_config.frame_headroom ?
[1] https://github.com/xdp-project/bpf-examples/blob/3f365af4be1fe6a0ef77e751ff9... [2] https://www.kernel.org/doc/html/v6.12/networking/xsk-tx-metadata.html
--Jesper
On 7/4/25 11:58 AM, Jesper Dangaard Brouer wrote:
On 04/07/2025 03.17, Song, Yoong Siang wrote:
On Friday, July 4, 2025 1:05 AM, Jesper Dangaard Brouer hawk@kernel.org wrote:
On 02/07/2025 18.57, Song Yoong Siang wrote:
Introduce the XDP_METADATA_SIZE macro as a conservative measure to accommodate any metadata areas reserved by Ethernet devices.
This seems like a sloppy workaround :-(
To me, the problem arise because AF_XDP is lacking the ability to communicate the size of the data_meta area. If we had this capability, then we could allow the IGC driver to take some of the space, have the BPF-prog expand it futher (bpf_xdp_adjust_meta) and then userspace AF_XDP would simply be able to see the size of the data_meta area, and apply the struct xdp_meta at right offset.
Thanks for your input.
I agree with you that the implementation will be simple if user application able to get the size of data_meta area. The intention of this patch set is to let developer aware of such limitations before we have a perfect solution.
Btw, do you got any suggestion on how to expose the metadata length? I not sure whether xdp_desc.options is a simple and good idea or not?
That is a question to the AF_XDP maintainers... added them to this email.
/* Rx/Tx descriptor */ struct xdp_desc { __u64 addr; __u32 len; __u32 options; };
As far as I know, the xdp_desc.options field isn't used, right?
The options holds flags, see also XDP_PKT_CONTD and XDP_TX_METADATA.
(Please AF_XDP experts, please verify below statements:) Something else we likely want to document: The available headroom in the AF_XDP frame. When accessing the metadata in userspace AF_XDP we do a negative offset from the UMEM packet pointer. IIRC on RX the available headroom will be either 255 or 192 bytes (depending on NIC drivers).
Slightly confusing when AF_XDP transmitting from userspace the UMEM headroom is default zero (XSK_UMEM__DEFAULT_FRAME_HEADROOM is zero). This is configurable via xsk_umem_config.frame_headroom, like I did in this example[1].
Maybe I did something wrong in[1], because I see that the new method is setting xsk_umem_config.tx_metadata_len + flag XDP_UMEM_TX_METADATA_LEN. This is nicely documented in [2]. How does this interact with setting xsk_umem_config.frame_headroom ?
If you request XDP_UMEM_TX_METADATA_LEN then on TX side you can fill struct xsk_tx_metadata before the start of packet data, that is, meta = data - sizeof(struct xsk_tx_metadata). The validity of the latter is indicated via desc->options |= XDP_TX_METADATA and then you fill meta->flags with things like XDP_TXMD_FLAGS_CHECKSUM to tell that the related fields are valid (ex. request.csum_start, request.csum_offset) and that you expect the driver to do the offload with this info. This is also what I mentioned in the other thread some time ago that imho it would make sense to have this also on RX side somewhat similar to virtio_net_hdr..
[1] https://github.com/xdp-project/bpf-examples/blob/3f365af4be1fe6a0ef77e751ff9... [2] https://www.kernel.org/doc/html/v6.12/networking/xsk-tx-metadata.html
--Jesper
On 07/03, Song Yoong Siang wrote:
This patch set improves the documentation and selftests for XDP Rx metadata handling. The first patch clarifies the documentation around XDP metadata layout and METADATA_SIZE. The second patch enhances the BPF selftests to make XDP metadata handling more robust across different NICs.
Prior to this patch set, the XDP program might accidentally overwrite the device-reserved metadata.
V3:
- update doc and commit msg accordingly.
V2: https://lore.kernel.org/netdev/20250702030349.3275368-1-yoong.siang.song@int...
- unconditionally do bpf_xdp_adjust_meta with -XDP_METADATA_SIZE (Stanislav)
V1: https://lore.kernel.org/netdev/20250701042940.3272325-1-yoong.siang.song@int...
Acked-by: Stanislav Fomichev sdf@fomichev.me
linux-kselftest-mirror@lists.linaro.org