[PATCH bpf-next v2 0/3] Add overwrite mode for bpf ring buffer

List overview All Threads
Download

newer

older

[PATCH v2]...

[PATCH net repost] selftests: net:...

Xu Kuohai

5 Sep 2025 5 Sep '25

3:06 p.m.

When the bpf ring buffer is full, new events can not be recorded util the consumer consumes some events to free space. This may cause critical events to be discarded, such as in fault diagnostic, where recent events are more critical than older ones.

So add ovewrite mode for bpf ring buffer. In this mode, the new event overwrites the oldest event when the buffer is full.

v2: - remove libbpf changes (Andrii) - update overwrite benchmark

v1: https://lore.kernel.org/bpf/20250804022101.2171981-1-xukuohai@huaweicloud.co...

Xu Kuohai (3): bpf: Add overwrite mode for bpf ring buffer selftests/bpf: Add test for overwrite ring buffer selftests/bpf/benchs: Add producer and overwrite bench for ring buffer

include/uapi/linux/bpf.h | 4 + kernel/bpf/ringbuf.c | 159 +++++++++++++++--- tools/include/uapi/linux/bpf.h | 4 + tools/testing/selftests/bpf/Makefile | 3 +- tools/testing/selftests/bpf/bench.c | 2 + .../selftests/bpf/benchs/bench_ringbufs.c | 95 ++++++++++- .../bpf/benchs/run_bench_ringbufs.sh | 4 + .../selftests/bpf/prog_tests/ringbuf.c | 74 ++++++++ .../selftests/bpf/progs/ringbuf_bench.c | 10 ++ .../bpf/progs/test_ringbuf_overwrite.c | 98 +++++++++++ 10 files changed, 418 insertions(+), 35 deletions(-) create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c

-- 2.43.0

Show replies by date

Xu Kuohai

5 Sep 5 Sep

3:06 p.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

From: Xu Kuohai xukuohai@huawei.com

So add ovewrite mode for bpf ring buffer. In this mode, the new event overwrites the oldest event when the buffer is full.

The scheme is as follows:

1. producer_pos tracks the next position to write new data. When there is enough free space, producer simply moves producer_pos forward to make space for the new event.

2. To avoid waiting for consumer to free space when the buffer is full, a new variable overwrite_pos is introduced for producer. overwrite_pos tracks the next event to be overwritten (the oldest event committed) in the buffer. producer moves it forward to discard the oldest events when the buffer is full.

3. pending_pos tracks the oldest event under committing. producer ensures producers_pos never passes pending_pos when making space for new events. So multiple producers never write to the same position at the same time.

4. producer wakes up consumer every half a round ahead to give it a chance to retrieve data. However, for an overwrite-mode ring buffer, users typically only cares about the ring buffer snapshot before a fault occurs. In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag to avoid unnecessary wakeups.

To make it clear, here are some example diagrams.

1. Let's say we have a ring buffer with size 4096.

At first, {producer,overwrite,pending,consumer}_pos are all set to 0

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------+ ^ | | producer_pos = 0 overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

2. Reserve event A, size 512.

There is enough free space, so A is allocated at offset 0 and producer_pos is moved to 512, the end of A. Since A is not submitted, the BUSY bit is set.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | A | | | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 512 | overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

3. Reserve event B, size 1024.

B is allocated at offset 512 with BUSY bit set, and producer_pos is moved to the end of B.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | A | B | | | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 1536 | overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

4. Reserve event C, size 2048.

C is allocated at offset 1536 and producer_pos becomes 3584.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | [BUSY] | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 3584 | overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

5. Submit event A.

The BUSY bit of A is cleared. B becomes the oldest event under writing, so pending_pos is moved to 512, the start of B.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | | | pending_pos = 512 producer_pos = 3584 | overwrite_pos = 0 consumer_pos = 0

6. Submit event B.

The BUSY bit of B is cleared, and pending_pos is moved to the start of C, which is the oldest event under writing now.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | | | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | | | pending_pos = 1536 producer_pos = 3584 | overwrite_pos = 0 consumer_pos = 0

7. Reserve event D, size 1536 (3 * 512).

There are 2048 bytes not under writing between producer_pos and pending_pos, so D is allocated at offset 3584, and producer_pos is moved from 3584 to 5120.

Since event D will overwrite all bytes of event A and the begining 512 bytes of event B, overwrite_pos is moved to the start of event C, the oldest event that is not overwritten.

8. Reserve event E, size 1024.

Though there are 512 bytes not under writing between producer_pos and pending_pos, E can not be reserved, as it would overwrite the first 512 bytes of event C, which is still under writing.

9. Submit event C and D.

pending_pos is moved to the end of D.

The performance data for overwrite mode will be provided in a follow-up patch that adds overwrite mode benchs.

A sample of performance data for non-overwrite mode on an x86_64 and arm64 CPU, before and after this patch, is shown below. As we can see, no obvious performance regression occurs.

- x86_64 (AMD EPYC 9654)

Before:

Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 13.218 ± 0.039M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.684 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.771 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.281 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.842 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.001 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.833 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.508 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.421 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.309 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.265 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.198 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.174 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.113 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.097 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.070 ± 0.002M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 13.751 ± 0.673M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.592 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.776 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.463 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.883 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.017 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.816 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.512 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.396 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.303 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.267 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.210 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.181 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.136 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.090 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.091 ± 0.002M/s (drops 0.000 ± 0.000M/s)

- arm64 (HiSilicon Kunpeng 920)

Before:

Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.602 ± 0.423M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.599 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.669 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.806 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.856 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.368 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.210 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 3.003 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.944 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.863 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.819 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.887 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.837 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.787 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.738 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.700 ± 0.007M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.614 ± 0.268M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.917 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.920 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.803 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.898 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.426 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.320 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 3.029 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 3.068 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.890 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.950 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.812 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.834 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.803 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.766 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.754 ± 0.009M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai xukuohai@huawei.com --- include/uapi/linux/bpf.h | 4 + kernel/bpf/ringbuf.c | 159 +++++++++++++++++++++++++++------ tools/include/uapi/linux/bpf.h | 4 + 3 files changed, 141 insertions(+), 26 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 233de8677382..d3b2fd2ae527 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1430,6 +1430,9 @@ enum {

/* Do not translate kernel bpf_arena pointers to user pointers */ BPF_F_NO_USER_CONV = (1U << 18), + +/* bpf ringbuf works in overwrite mode? */ + BPF_F_OVERWRITE = (1U << 19), };

/* Flags for BPF_PROG_QUERY. */ @@ -6215,6 +6218,7 @@ enum { BPF_RB_RING_SIZE = 1, BPF_RB_CONS_POS = 2, BPF_RB_PROD_POS = 3, + BPF_RB_OVER_POS = 4, };

/* BPF ring buffer constants */ diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 719d73299397..6ca41d01f187 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -13,7 +13,7 @@ #include <linux/btf_ids.h> #include <asm/rqspinlock.h>

-#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE) +#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_OVERWRITE)

/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */ #define RINGBUF_PGOFF \ @@ -27,7 +27,8 @@ struct bpf_ringbuf { wait_queue_head_t waitq; struct irq_work work; - u64 mask; + u64 mask:48; + u64 overwrite_mode:1; struct page **pages; int nr_pages; rqspinlock_t spinlock ____cacheline_aligned_in_smp; @@ -72,6 +73,7 @@ struct bpf_ringbuf { */ unsigned long consumer_pos __aligned(PAGE_SIZE); unsigned long producer_pos __aligned(PAGE_SIZE); + unsigned long overwrite_pos; /* to be overwritten in overwrite mode */ unsigned long pending_pos; char data[] __aligned(PAGE_SIZE); }; @@ -166,7 +168,8 @@ static void bpf_ringbuf_notify(struct irq_work *work) * considering that the maximum value of data_sz is (4GB - 1), there * will be no overflow, so just note the size limit in the comments. */ -static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node, + int overwrite_mode) { struct bpf_ringbuf *rb;

@@ -183,17 +186,25 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) rb->consumer_pos = 0; rb->producer_pos = 0; rb->pending_pos = 0; + rb->overwrite_mode = overwrite_mode;

return rb; }

static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr) { + int overwrite_mode = 0; struct bpf_ringbuf_map *rb_map;

if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK) return ERR_PTR(-EINVAL);

+ if (attr->map_flags & BPF_F_OVERWRITE) { + if (attr->map_type == BPF_MAP_TYPE_USER_RINGBUF) + return ERR_PTR(-EINVAL); + overwrite_mode = 1; + } + if (attr->key_size || attr->value_size || !is_power_of_2(attr->max_entries) || !PAGE_ALIGNED(attr->max_entries)) @@ -205,7 +216,8 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)

bpf_map_init_from_attr(&rb_map->map, attr);

- rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node); + rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node, + overwrite_mode); if (!rb_map->rb) { bpf_map_area_free(rb_map); return ERR_PTR(-ENOMEM); @@ -295,11 +307,16 @@ static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma

static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) { - unsigned long cons_pos, prod_pos; + unsigned long cons_pos, prod_pos, over_pos;

cons_pos = smp_load_acquire(&rb->consumer_pos); prod_pos = smp_load_acquire(&rb->producer_pos); - return prod_pos - cons_pos; + + if (likely(!rb->overwrite_mode)) + return prod_pos - cons_pos; + + over_pos = READ_ONCE(rb->overwrite_pos); + return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1); }

static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }

+ +static bool bpf_ringbuf_has_space(const struct bpf_ringbuf *rb, + unsigned long new_prod_pos, + unsigned long cons_pos, + unsigned long pend_pos) +{ + /* no space if oldest not yet committed record until the newest + * record span more than (ringbuf_size - 1) + */ + if (new_prod_pos - pend_pos > rb->mask) + return false; + + /* ok, we have space in ovewrite mode */ + if (unlikely(rb->overwrite_mode)) + return true; + + /* no space if producer position advances more than (ringbuf_size - 1) + * ahead than consumer position when not in overwrite mode + */ + if (new_prod_pos - cons_pos > rb->mask) + return false; + + return true; +} + +static u32 ringbuf_round_up_hdr_len(u32 hdr_len) +{ + hdr_len &= ~BPF_RINGBUF_DISCARD_BIT; + return round_up(hdr_len + BPF_RINGBUF_HDR_SZ, 8); +} + static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) { - unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags; + unsigned long flags; struct bpf_ringbuf_hdr *hdr; - u32 len, pg_off, tmp_size, hdr_len; + u32 len, pg_off, hdr_len; + unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, over_pos;

if (unlikely(size > RINGBUF_MAX_RECORD_SZ)) return NULL; @@ -429,24 +478,39 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) hdr_len = READ_ONCE(hdr->len); if (hdr_len & BPF_RINGBUF_BUSY_BIT) break; - tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT; - tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8); - pend_pos += tmp_size; + pend_pos += ringbuf_round_up_hdr_len(hdr_len); } rb->pending_pos = pend_pos;

- /* check for out of ringbuf space: - * - by ensuring producer position doesn't advance more than - * (ringbuf_size - 1) ahead - * - by ensuring oldest not yet committed record until newest - * record does not span more than (ringbuf_size - 1) - */ - if (new_prod_pos - cons_pos > rb->mask || - new_prod_pos - pend_pos > rb->mask) { + if (!bpf_ringbuf_has_space(rb, new_prod_pos, cons_pos, pend_pos)) { raw_res_spin_unlock_irqrestore(&rb->spinlock, flags); return NULL; }

+ /* In overwrite mode, move overwrite_pos to the next record to be + * overwritten if the ring buffer is full + */ + if (unlikely(rb->overwrite_mode)) { + over_pos = rb->overwrite_pos; + while (new_prod_pos - over_pos > rb->mask) { + hdr = (void *)rb->data + (over_pos & rb->mask); + hdr_len = READ_ONCE(hdr->len); + /* since pending_pos is the first record with BUSY + * bit set and overwrite_pos is never bigger than + * pending_pos, no need to check BUSY bit here. + */ + over_pos += ringbuf_round_up_hdr_len(hdr_len); + } + /* smp_store_release(&rb->producer_pos, new_prod_pos) at + * the end of the function ensures that when consumer sees + * the updated rb->producer_pos, it always sees the updated + * rb->overwrite_pos, so when consumer reads overwrite_pos + * after smp_load_acquire(r->producer_pos), the overwrite_pos + * will always be valid. + */ + WRITE_ONCE(rb->overwrite_pos, over_pos); + } + hdr = (void *)rb->data + (prod_pos & rb->mask); pg_off = bpf_ringbuf_rec_pg_off(rb, hdr); hdr->len = size | BPF_RINGBUF_BUSY_BIT; @@ -479,7 +543,50 @@ const struct bpf_func_proto bpf_ringbuf_reserve_proto = { .arg3_type = ARG_ANYTHING, };

-static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) +static __always_inline +bool ringbuf_should_wakeup(const struct bpf_ringbuf *rb, + unsigned long rec_pos, + unsigned long cons_pos, + u32 len, u64 flags) +{ + unsigned long rec_end; + + if (flags & BPF_RB_FORCE_WAKEUP) + return true; + + if (flags & BPF_RB_NO_WAKEUP) + return false; + + /* for non-overwrite mode, if consumer caught up and is waiting for + * our record, notify about new data availability + */ + if (likely(!rb->overwrite_mode)) + return cons_pos == rec_pos; + + /* for overwrite mode, to give the consumer a chance to catch up + * before being overwritten, wake up consumer every half a round + * ahead. + */ + rec_end = rec_pos + ringbuf_round_up_hdr_len(len); + + cons_pos &= (rb->mask >> 1); + rec_pos &= (rb->mask >> 1); + rec_end &= (rb->mask >> 1); + + if (cons_pos == rec_pos) + return true; + + if (rec_pos < cons_pos && cons_pos < rec_end) + return true; + + if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end)) + return true; + + return false; +} + +static __always_inline +void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) { unsigned long rec_pos, cons_pos; struct bpf_ringbuf_hdr *hdr; @@ -495,15 +602,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) /* update record header with correct final size prefix */ xchg(&hdr->len, new_len);

- /* if consumer caught up and is waiting for our record, notify about - * new data availability - */ rec_pos = (void *)hdr - (void *)rb->data; cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;

- if (flags & BPF_RB_FORCE_WAKEUP) - irq_work_queue(&rb->work); - else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP)) + if (ringbuf_should_wakeup(rb, rec_pos, cons_pos, new_len, flags)) irq_work_queue(&rb->work); }

@@ -576,6 +678,8 @@ BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags) return smp_load_acquire(&rb->consumer_pos); case BPF_RB_PROD_POS: return smp_load_acquire(&rb->producer_pos); + case BPF_RB_OVER_POS: + return READ_ONCE(rb->overwrite_pos); default: return 0; } @@ -749,6 +853,9 @@ BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,

rb = container_of(map, struct bpf_ringbuf_map, map)->rb;

+ if (unlikely(rb->overwrite_mode)) + return -EOPNOTSUPP; + /* If another consumer is already consuming a sample, wait for them to finish. */ if (!atomic_try_cmpxchg(&rb->busy, &busy, 1)) return -EBUSY; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 233de8677382..d3b2fd2ae527 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1430,6 +1430,9 @@ enum {

/* Do not translate kernel bpf_arena pointers to user pointers */ BPF_F_NO_USER_CONV = (1U << 18), + +/* bpf ringbuf works in overwrite mode? */ + BPF_F_OVERWRITE = (1U << 19), };

/* Flags for BPF_PROG_QUERY. */ @@ -6215,6 +6218,7 @@ enum { BPF_RB_RING_SIZE = 1, BPF_RB_CONS_POS = 2, BPF_RB_PROD_POS = 3, + BPF_RB_OVER_POS = 4, };

/* BPF ring buffer constants */

-- 2.43.0

Andrii Nakryiko

19 Sep 19 Sep

10:10 p.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...

From: Xu Kuohai xukuohai@huawei.com

When the bpf ring buffer is full, new events can not be recorded util

typo: until

...

the consumer consumes some events to free space. This may cause critical events to be discarded, such as in fault diagnostic, where recent events are more critical than older ones.

So add ovewrite mode for bpf ring buffer. In this mode, the new event

overwrite, BPF

...

overwrites the oldest event when the buffer is full.

The scheme is as follows:

producer_pos tracks the next position to write new data. When there is enough free space, producer simply moves producer_pos forward to make space for the new event.

To avoid waiting for consumer to free space when the buffer is full, a new variable overwrite_pos is introduced for producer. overwrite_pos tracks the next event to be overwritten (the oldest event committed) in the buffer. producer moves it forward to discard the oldest events when the buffer is full.

pending_pos tracks the oldest event under committing. producer ensures

"under committing" is confusing. Oldest event to be committed?

...

producers_pos never passes pending_pos when making space for new events. So multiple producers never write to the same position at the same time.

producer wakes up consumer every half a round ahead to give it a chance to retrieve data. However, for an overwrite-mode ring buffer, users typically only cares about the ring buffer snapshot before a fault occurs. In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag to avoid unnecessary wakeups.

To make it clear, here are some example diagrams.

Let's say we have a ring buffer with size 4096.

At first, {producer,overwrite,pending,consumer}_pos are all set to 0

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------+ ^ | |

producer_pos = 0 overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event A, size 512.

There is enough free space, so A is allocated at offset 0 and producer_pos is moved to 512, the end of A. Since A is not submitted, the BUSY bit is set.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | A | | | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 512 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event B, size 1024.

B is allocated at offset 512 with BUSY bit set, and producer_pos is moved to the end of B.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | A | B | | | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 1536 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event C, size 2048.

C is allocated at offset 1536 and producer_pos becomes 3584.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | [BUSY] | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 3584 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Submit event A.

The BUSY bit of A is cleared. B becomes the oldest event under writing, so

Now it's "under writing" :) To be committed? Or "pending committing" or just "pending", I guess. But not under anything, it just confuses readers. IMO.

...

pending_pos is moved to 512, the start of B.

0       512      1024    1536     2048     2560     3072     3584       4096
+-----------------------------------------------------------------------+
|        |                 |                                   |        |
|    A   |        B        |                 C                 |        |
|        |      [BUSY]     |               [BUSY]              |        |
+-----------------------------------------------------------------------+
^        ^                                                     ^
|        |                                                     |
|        |                                                     |
|   pending_pos = 512                                  producer_pos = 3584
|

overwrite_pos = 0 consumer_pos = 0

Submit event B.

The BUSY bit of B is cleared, and pending_pos is moved to the start of C, which is the oldest event under writing now.

ditto

...

0       512      1024    1536     2048     2560     3072     3584       4096
+-----------------------------------------------------------------------+
|        |                 |                                   |        |
|    A   |        B        |                 C                 |        |
|        |                 |               [BUSY]              |        |
+-----------------------------------------------------------------------+
^                          ^                                   ^
|                          |                                   |
|                          |                                   |
|                     pending_pos = 1536               producer_pos = 3584
|

overwrite_pos = 0 consumer_pos = 0

Reserve event D, size 1536 (3 * 512).

There are 2048 bytes not under writing between producer_pos and pending_pos, so D is allocated at offset 3584, and producer_pos is moved from 3584 to 5120.

Since event D will overwrite all bytes of event A and the begining 512 bytes

typo: beginning, but really "first 512 bytes" would be clearer

...

of event B, overwrite_pos is moved to the start of event C, the oldest event
that is not overwritten.

0       512      1024    1536     2048     2560     3072     3584       4096
+-----------------------------------------------------------------------+
|                 |        |                                   |        |
|      D End      |        |                 C                 | D Begin|
|      [BUSY]     |        |               [BUSY]              | [BUSY] |
+-----------------------------------------------------------------------+
^                 ^        ^
|                 |        |
|                 |   pending_pos = 1536
|                 |   overwrite_pos = 1536
|                 |
|             producer_pos=5120
|
consumer_pos = 0

Reserve event E, size 1024.

Though there are 512 bytes not under writing between producer_pos and pending_pos, E can not be reserved, as it would overwrite the first 512 bytes of event C, which is still under writing.

Submit event C and D.

pending_pos is moved to the end of D.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | D End | | C | D Begin| | | | | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | overwrite_pos = 1536 | | | producer_pos=5120 | pending_pos=5120 |

consumer_pos = 0

The performance data for overwrite mode will be provided in a follow-up patch that adds overwrite mode benchs.

A sample of performance data for non-overwrite mode on an x86_64 and arm64 CPU, before and after this patch, is shown below. As we can see, no obvious performance regression occurs.

x86_64 (AMD EPYC 9654)

Before:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 13.218 ± 0.039M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.684 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.771 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.281 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.842 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.001 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.833 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.508 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.421 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.309 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.265 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.198 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.174 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.113 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.097 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.070 ± 0.002M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 13.751 ± 0.673M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.592 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.776 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.463 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.883 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.017 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.816 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.512 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.396 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.303 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.267 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.210 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.181 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.136 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.090 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.091 ± 0.002M/s (drops 0.000 ± 0.000M/s)

arm64 (HiSilicon Kunpeng 920)

Before:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 11.602 ± 0.423M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.599 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.669 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.806 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.856 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.368 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.210 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 3.003 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.944 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.863 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.819 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.887 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.837 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.787 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.738 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.700 ± 0.007M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 11.614 ± 0.268M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.917 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.920 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.803 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.898 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.426 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.320 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 3.029 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 3.068 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.890 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.950 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.812 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.834 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.803 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.766 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.754 ± 0.009M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai xukuohai@huawei.com

include/uapi/linux/bpf.h | 4 + kernel/bpf/ringbuf.c | 159 +++++++++++++++++++++++++++------ tools/include/uapi/linux/bpf.h | 4 + 3 files changed, 141 insertions(+), 26 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 233de8677382..d3b2fd2ae527 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1430,6 +1430,9 @@ enum {

/* Do not translate kernel bpf_arena pointers to user pointers */ BPF_F_NO_USER_CONV = (1U << 18),

+/* bpf ringbuf works in overwrite mode? */
  BPF_F_OVERWRITE         = (1U << 19),

let's call it BPF_F_RB_OVERWRITE as this is ringbuf-specific? And use imperative voice in the comment:

/* Enable BPF ringbuf overwrite mode */

...

};

/* Flags for BPF_PROG_QUERY. */ @@ -6215,6 +6218,7 @@ enum { BPF_RB_RING_SIZE = 1, BPF_RB_CONS_POS = 2, BPF_RB_PROD_POS = 3,
  BPF_RB_OVER_POS = 4,

nit: BPF_RB_OVERWITE_POS?

...

};

/* BPF ring buffer constants */ diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 719d73299397..6ca41d01f187 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -13,7 +13,7 @@ #include <linux/btf_ids.h> #include <asm/rqspinlock.h>

-#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE) +#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_OVERWRITE)

/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */ #define RINGBUF_PGOFF \ @@ -27,7 +27,8 @@ struct bpf_ringbuf { wait_queue_head_t waitq; struct irq_work work;
  u64 mask;
  u64 mask:48;
  u64 overwrite_mode:1;

Please, don't touch the mask field, it's a very hot field, no need to make it a bit field. Just add a separate bool for overwrite_mode.

...

    struct page **pages;
    int nr_pages;
    rqspinlock_t spinlock ____cacheline_aligned_in_smp;
@@ -72,6 +73,7 @@ struct bpf_ringbuf { */ unsigned long consumer_pos __aligned(PAGE_SIZE); unsigned long producer_pos __aligned(PAGE_SIZE);
  unsigned long overwrite_pos;  /* to be overwritten in overwrite mode */

Not a really precise comment, IMO. This is a position pointing to after the last overwritten record, no?

...

    unsigned long pending_pos;
    char data[] __aligned(PAGE_SIZE);
}; @@ -166,7 +168,8 @@ static void bpf_ringbuf_notify(struct irq_work *work)

considering that the maximum value of data_sz is (4GB - 1), there

will be no overflow, so just note the size limit in the comments.

*/ -static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
                                       int overwrite_mode)
{ struct bpf_ringbuf *rb;

@@ -183,17 +186,25 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) rb->consumer_pos = 0; rb->producer_pos = 0; rb->pending_pos = 0;
  rb->overwrite_mode = overwrite_mode;

  return rb;
}

static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr) {
  int overwrite_mode = 0;
  struct bpf_ringbuf_map *rb_map;

  if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
          return ERR_PTR(-EINVAL);
  if (attr->map_flags & BPF_F_OVERWRITE) {
          if (attr->map_type == BPF_MAP_TYPE_USER_RINGBUF)
                  return ERR_PTR(-EINVAL);
          overwrite_mode = 1;
  }
  if (attr->key_size || attr->value_size ||
      !is_power_of_2(attr->max_entries) ||
      !PAGE_ALIGNED(attr->max_entries))
@@ -205,7 +216,8 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
    bpf_map_init_from_attr(&rb_map->map, attr);
  rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
  rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
                                 overwrite_mode);

keep on single line, it fits under 100 characters

...

    if (!rb_map->rb) {
            bpf_map_area_free(rb_map);
            return ERR_PTR(-ENOMEM);

@@ -295,11 +307,16 @@ static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma

static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) {

```
  unsigned long cons_pos, prod_pos;
```

  unsigned long cons_pos, prod_pos, over_pos;

  cons_pos = smp_load_acquire(&rb->consumer_pos);
  prod_pos = smp_load_acquire(&rb->producer_pos);

```
  return prod_pos - cons_pos;
```

```
  if (likely(!rb->overwrite_mode))
```
```
          return prod_pos - cons_pos;
```

nit: invert the condition to unlikely and handle that special case in a nested if, moving "over_pos" inside the if itself

...

  over_pos = READ_ONCE(rb->overwrite_pos);

  return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);

I'm trying to understand why you need to min with `rb->mask + 1`, can you please elaborate? And also, at least for consistency, use smp_load_acquire() for overwrite_pos?

...

}

static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }

+static bool bpf_ringbuf_has_space(const struct bpf_ringbuf *rb,

                            unsigned long new_prod_pos,

                            unsigned long cons_pos,

                            unsigned long pend_pos)

  /* no space if oldest not yet committed record until the newest

   * record span more than (ringbuf_size - 1)

```
   */
```

  if (new_prod_pos - pend_pos > rb->mask)

```
          return false;
```

  /* ok, we have space in ovewrite mode */

typo: overwrite

...

```
  if (unlikely(rb->overwrite_mode))
```
```
          return true;
```

  /* no space if producer position advances more than (ringbuf_size - 1)

   * ahead than consumer position when not in overwrite mode

typo: ahead of consumer position

...

```
   */
```

  if (new_prod_pos - cons_pos > rb->mask)

```
          return false;
```
```
  return true;
```

+static u32 ringbuf_round_up_hdr_len(u32 hdr_len)

use consistent naming, if you have bpf_ringbuf_has_space, then this should have been bpf_ringbuf_round_up_len() or something like that.

...

```
  hdr_len &= ~BPF_RINGBUF_DISCARD_BIT;
```

  return round_up(hdr_len + BPF_RINGBUF_HDR_SZ, 8);

static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) {

  unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;

  unsigned long flags;
  struct bpf_ringbuf_hdr *hdr;

```
  u32 len, pg_off, tmp_size, hdr_len;
```

```
  u32 len, pg_off, hdr_len;
```

  unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, over_pos;

100 character line length limit, just add over_pos to original single line declaration

...

    if (unlikely(size > RINGBUF_MAX_RECORD_SZ))
            return NULL;

@@ -429,24 +478,39 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) hdr_len = READ_ONCE(hdr->len); if (hdr_len & BPF_RINGBUF_BUSY_BIT) break;

          tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT;

          tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);

```
          pend_pos += tmp_size;
```

          pend_pos += ringbuf_round_up_hdr_len(hdr_len);
  }
  rb->pending_pos = pend_pos;

```
  /* check for out of ringbuf space:
```

   * - by ensuring producer position doesn't advance more than

```
   *   (ringbuf_size - 1) ahead
```

   * - by ensuring oldest not yet committed record until newest

   *   record does not span more than (ringbuf_size - 1)

```
   */
```

  if (new_prod_pos - cons_pos > rb->mask ||

      new_prod_pos - pend_pos > rb->mask) {

  if (!bpf_ringbuf_has_space(rb, new_prod_pos, cons_pos, pend_pos)) {
          raw_res_spin_unlock_irqrestore(&rb->spinlock, flags);
          return NULL;
  }

  /* In overwrite mode, move overwrite_pos to the next record to be

   * overwritten if the ring buffer is full

```
   */
```

hm... here I think the important point is that we search for the next record boundary until which we need to overwrite data such that it fits newly reserved record. "next record to be overwritten" isn't that important (we might never need to overwrite it). Important are those aspects of a) staying on record boundary and b) consuming enough records to reserve the new one.

Can you please update the comment to mention the above points?

...

```
  if (unlikely(rb->overwrite_mode)) {
```

          over_pos = rb->overwrite_pos;

          while (new_prod_pos - over_pos > rb->mask) {

                  hdr = (void *)rb->data + (over_pos & rb->mask);

                  hdr_len = READ_ONCE(hdr->len);

                  /* since pending_pos is the first record with BUSY

                   * bit set and overwrite_pos is never bigger than

                   * pending_pos, no need to check BUSY bit here.

```
                   */
```

honestly, this comment just confused me by implying that BUSY bit might be important (and set) here. But in reality, we are just overwriting already committed data which can't have BUSY bit set. It would be more helpful to mention that bpf_ringbuf_has_space() check above made sure we are not going to step over record that is being actively worked on by some other producer.

...

                  over_pos += ringbuf_round_up_hdr_len(hdr_len);

```
          }
```

          /* smp_store_release(&rb->producer_pos, new_prod_pos) at

           * the end of the function ensures that when consumer sees

           * the updated rb->producer_pos, it always sees the updated

           * rb->overwrite_pos, so when consumer reads overwrite_pos

           * after smp_load_acquire(r->producer_pos), the overwrite_pos

```
           * will always be valid.
```
```
           */
```

          WRITE_ONCE(rb->overwrite_pos, over_pos);

```
  }
```

  hdr = (void *)rb->data + (prod_pos & rb->mask);
  pg_off = bpf_ringbuf_rec_pg_off(rb, hdr);
  hdr->len = size | BPF_RINGBUF_BUSY_BIT;

@@ -479,7 +543,50 @@ const struct bpf_func_proto bpf_ringbuf_reserve_proto = { .arg3_type = ARG_ANYTHING, };

-static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) +static __always_inline +bool ringbuf_should_wakeup(const struct bpf_ringbuf *rb,

consistent naming: bpf_ringbuf_should_wakeup

...

                     unsigned long rec_pos,

                     unsigned long cons_pos,

                     u32 len, u64 flags)

```
  unsigned long rec_end;
```
```
  if (flags & BPF_RB_FORCE_WAKEUP)
```
```
          return true;
```
```
  if (flags & BPF_RB_NO_WAKEUP)
```
```
          return false;
```

  /* for non-overwrite mode, if consumer caught up and is waiting for

   * our record, notify about new data availability

```
   */
```
```
  if (likely(!rb->overwrite_mode))
```
```
          return cons_pos == rec_pos;
```

  /* for overwrite mode, to give the consumer a chance to catch up

   * before being overwritten, wake up consumer every half a round

```
   * ahead.
```
```
   */
```

  rec_end = rec_pos + ringbuf_round_up_hdr_len(len);

```
  cons_pos &= (rb->mask >> 1);
```
```
  rec_pos &= (rb->mask >> 1);
```
```
  rec_end &= (rb->mask >> 1);
```
```
  if (cons_pos == rec_pos)
```
```
          return true;
```

  if (rec_pos < cons_pos && cons_pos < rec_end)

```
          return true;
```

  if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))

```
          return true;
```

hm... ok, let's discuss this. Why do we need to do some half-round heuristic for overwrite mode? If a consumer is falling behind it should be actively trying to catch up and they don't need notification (that's the non-overwrite mode logic already).

So there is more to this than a brief comment you left, can you please elaborate?

pw-bot: cr

...

  return false;
+}

+static __always_inline

we didn't have always_inline before, any strong reason to add it now?

...

+void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) { unsigned long rec_pos, cons_pos; struct bpf_ringbuf_hdr *hdr; @@ -495,15 +602,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) /* update record header with correct final size prefix */ xchg(&hdr->len, new_len);
  /* if consumer caught up and is waiting for our record, notify about
   * new data availability
   */
  rec_pos = (void *)hdr - (void *)rb->data;
  cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
  if (flags & BPF_RB_FORCE_WAKEUP)
          irq_work_queue(&rb->work);
  else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP))
  if (ringbuf_should_wakeup(rb, rec_pos, cons_pos, new_len, flags))
          irq_work_queue(&rb->work);
}

@@ -576,6 +678,8 @@ BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags) return smp_load_acquire(&rb->consumer_pos); case BPF_RB_PROD_POS: return smp_load_acquire(&rb->producer_pos);
  case BPF_RB_OVER_POS:
          return READ_ONCE(rb->overwrite_pos);

do the smp_load_acquire() here just like with all other positions?

...

    default:
            return 0;
    }

@@ -749,6 +853,9 @@ BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,

    rb = container_of(map, struct bpf_ringbuf_map, map)->rb;

```
  if (unlikely(rb->overwrite_mode))
```
```
          return -EOPNOTSUPP;
```

why this check? We don't allow rb->overwrite_mode to be set for user ringbuf, no?

...

  /* If another consumer is already consuming a sample, wait for them to finish. */
  if (!atomic_try_cmpxchg(&rb->busy, &busy, 1))
          return -EBUSY;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 233de8677382..d3b2fd2ae527 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1430,6 +1430,9 @@ enum {

/* Do not translate kernel bpf_arena pointers to user pointers */ BPF_F_NO_USER_CONV = (1U << 18),

+/* bpf ringbuf works in overwrite mode? */
  BPF_F_OVERWRITE         = (1U << 19),
};

/* Flags for BPF_PROG_QUERY. */ @@ -6215,6 +6218,7 @@ enum { BPF_RB_RING_SIZE = 1, BPF_RB_CONS_POS = 2, BPF_RB_PROD_POS = 3,
  BPF_RB_OVER_POS = 4,
};

/* BPF ring buffer constants */

2.43.0

Xu Kuohai

29 Sep 29 Sep

2:22 a.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On 9/20/2025 6:10 AM, Andrii Nakryiko wrote:

...

On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...
From: Xu Kuohai xukuohai@huawei.com

When the bpf ring buffer is full, new events can not be recorded util

typo: until

ACK

...

...
the consumer consumes some events to free space. This may cause critical events to be discarded, such as in fault diagnostic, where recent events are more critical than older ones.

So add ovewrite mode for bpf ring buffer. In this mode, the new event

overwrite, BPF

ACK

...

...
overwrites the oldest event when the buffer is full.

The scheme is as follows:

producer_pos tracks the next position to write new data. When there is enough free space, producer simply moves producer_pos forward to make space for the new event.

To avoid waiting for consumer to free space when the buffer is full, a new variable overwrite_pos is introduced for producer. overwrite_pos tracks the next event to be overwritten (the oldest event committed) in the buffer. producer moves it forward to discard the oldest events when the buffer is full.

pending_pos tracks the oldest event under committing. producer ensures

"under committing" is confusing. Oldest event to be committed?

Yes, 'the oldest event to be committed'. Thanks!

...

...
producers_pos never passes pending_pos when making space for new events.
So multiple producers never write to the same position at the same time.
producer wakes up consumer every half a round ahead to give it a chance to retrieve data. However, for an overwrite-mode ring buffer, users typically only cares about the ring buffer snapshot before a fault occurs. In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag to avoid unnecessary wakeups.

To make it clear, here are some example diagrams.

Let's say we have a ring buffer with size 4096.

At first, {producer,overwrite,pending,consumer}_pos are all set to 0

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------+ ^ | |

producer_pos = 0 overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event A, size 512.

There is enough free space, so A is allocated at offset 0 and producer_pos is moved to 512, the end of A. Since A is not submitted, the BUSY bit is set.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | A | | | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 512 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event B, size 1024.

B is allocated at offset 512 with BUSY bit set, and producer_pos is moved to the end of B.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | A | B | | | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 1536 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event C, size 2048.

C is allocated at offset 1536 and producer_pos becomes 3584.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | [BUSY] | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 3584 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Submit event A.

The BUSY bit of A is cleared. B becomes the oldest event under writing, so
Now it's "under writing" :) To be committed? Or "pending committing" or just "pending", I guess. But not under anything, it just confuses readers. IMO.

Once again, 'oldest event to be committed'.

I should check it with an AI agent first.

...

 pending_pos is moved to 512, the start of B.

 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |        |                 |                                   |        |
 |    A   |        B        |                 C                 |        |
 |        |      [BUSY]     |               [BUSY]              |        |
 +-----------------------------------------------------------------------+
 ^        ^                                                     ^
 |        |                                                     |
 |        |                                                     |
 |   pending_pos = 512                                  producer_pos = 3584
 |

overwrite_pos = 0 consumer_pos = 0

Submit event B.

The BUSY bit of B is cleared, and pending_pos is moved to the start of C, which is the oldest event under writing now.

ditto

Again and again :(

...

...
 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |        |                 |                                   |        |
 |    A   |        B        |                 C                 |        |
 |        |                 |               [BUSY]              |        |
 +-----------------------------------------------------------------------+
 ^                          ^                                   ^
 |                          |                                   |
 |                          |                                   |
 |                     pending_pos = 1536               producer_pos = 3584
 |
overwrite_pos = 0 consumer_pos = 0

Reserve event D, size 1536 (3 * 512).

There are 2048 bytes not under writing between producer_pos and pending_pos, so D is allocated at offset 3584, and producer_pos is moved from 3584 to 5120.

Since event D will overwrite all bytes of event A and the begining 512 bytes
typo: beginning, but really "first 512 bytes" would be clearer

OK, I’ll switch to 'first 512 bytes' for clarity.

...

...
 of event B, overwrite_pos is moved to the start of event C, the oldest event
 that is not overwritten.

 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |                 |        |                                   |        |
 |      D End      |        |                 C                 | D Begin|
 |      [BUSY]     |        |               [BUSY]              | [BUSY] |
 +-----------------------------------------------------------------------+
 ^                 ^        ^
 |                 |        |
 |                 |   pending_pos = 1536
 |                 |   overwrite_pos = 1536
 |                 |
 |             producer_pos=5120
 |
consumer_pos = 0

Reserve event E, size 1024.

Though there are 512 bytes not under writing between producer_pos and pending_pos, E can not be reserved, as it would overwrite the first 512 bytes of event C, which is still under writing.

Submit event C and D.

pending_pos is moved to the end of D.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | D End | | C | D Begin| | | | | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | overwrite_pos = 1536 | | | producer_pos=5120 | pending_pos=5120 |

consumer_pos = 0

The performance data for overwrite mode will be provided in a follow-up patch that adds overwrite mode benchs.

A sample of performance data for non-overwrite mode on an x86_64 and arm64 CPU, before and after this patch, is shown below. As we can see, no obvious performance regression occurs.

x86_64 (AMD EPYC 9654)

Before:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 13.218 ± 0.039M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.684 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.771 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.281 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.842 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.001 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.833 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.508 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.421 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.309 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.265 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.198 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.174 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.113 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.097 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.070 ± 0.002M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 13.751 ± 0.673M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.592 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.776 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.463 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.883 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.017 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.816 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.512 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.396 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.303 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.267 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.210 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.181 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.136 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.090 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.091 ± 0.002M/s (drops 0.000 ± 0.000M/s)

arm64 (HiSilicon Kunpeng 920)

Before:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 11.602 ± 0.423M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.599 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.669 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.806 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.856 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.368 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.210 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 3.003 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.944 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.863 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.819 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.887 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.837 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.787 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.738 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.700 ± 0.007M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 11.614 ± 0.268M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.917 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.920 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.803 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.898 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.426 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.320 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 3.029 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 3.068 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.890 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.950 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.812 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.834 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.803 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.766 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.754 ± 0.009M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai xukuohai@huawei.com

include/uapi/linux/bpf.h | 4 + kernel/bpf/ringbuf.c | 159 +++++++++++++++++++++++++++------ tools/include/uapi/linux/bpf.h | 4 + 3 files changed, 141 insertions(+), 26 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 233de8677382..d3b2fd2ae527 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1430,6 +1430,9 @@ enum {

/* Do not translate kernel bpf_arena pointers to user pointers */ BPF_F_NO_USER_CONV = (1U << 18),

+/* bpf ringbuf works in overwrite mode? */
  BPF_F_OVERWRITE         = (1U << 19),
let's call it BPF_F_RB_OVERWRITE as this is ringbuf-specific? And use imperative voice in the comment:

/* Enable BPF ringbuf overwrite mode */

...

...
};

/* Flags for BPF_PROG_QUERY. */ @@ -6215,6 +6218,7 @@ enum { BPF_RB_RING_SIZE = 1, BPF_RB_CONS_POS = 2, BPF_RB_PROD_POS = 3,
  BPF_RB_OVER_POS = 4,
nit: BPF_RB_OVERWITE_POS?

ACK

...

...
};

/* BPF ring buffer constants */ diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 719d73299397..6ca41d01f187 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -13,7 +13,7 @@ #include <linux/btf_ids.h> #include <asm/rqspinlock.h>

-#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE) +#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_OVERWRITE)

/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */ #define RINGBUF_PGOFF \ @@ -27,7 +27,8 @@ struct bpf_ringbuf { wait_queue_head_t waitq; struct irq_work work;
  u64 mask;
  u64 mask:48;
  u64 overwrite_mode:1;
Please, don't touch the mask field, it's a very hot field, no need to make it a bit field. Just add a separate bool for overwrite_mode.

ACK

...

...
     struct page **pages;
     int nr_pages;
     rqspinlock_t spinlock ____cacheline_aligned_in_smp;
@@ -72,6 +73,7 @@ struct bpf_ringbuf { */ unsigned long consumer_pos __aligned(PAGE_SIZE); unsigned long producer_pos __aligned(PAGE_SIZE);
  unsigned long overwrite_pos;  /* to be overwritten in overwrite mode */
Not a really precise comment, IMO. This is a position pointing to after the last overwritten record, no?

Yes, It’s actually the position after the last overwritten record. I'll update the comment for clarity.

...

...
     unsigned long pending_pos;
     char data[] __aligned(PAGE_SIZE);
}; @@ -166,7 +168,8 @@ static void bpf_ringbuf_notify(struct irq_work *work)

considering that the maximum value of data_sz is (4GB - 1), there

will be no overflow, so just note the size limit in the comments.

*/ -static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
                                       int overwrite_mode)
{ struct bpf_ringbuf *rb;
@@ -183,17 +186,25 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) rb->consumer_pos = 0; rb->producer_pos = 0; rb->pending_pos = 0;
  rb->overwrite_mode = overwrite_mode;

   return rb;
}

static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr) {
  int overwrite_mode = 0;
   struct bpf_ringbuf_map *rb_map;

   if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
           return ERR_PTR(-EINVAL);
  if (attr->map_flags & BPF_F_OVERWRITE) {
          if (attr->map_type == BPF_MAP_TYPE_USER_RINGBUF)
                  return ERR_PTR(-EINVAL);
          overwrite_mode = 1;
  }
   if (attr->key_size || attr->value_size ||
       !is_power_of_2(attr->max_entries) ||
       !PAGE_ALIGNED(attr->max_entries))
@@ -205,7 +216,8 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
     bpf_map_init_from_attr(&rb_map->map, attr);
  rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
  rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
                                 overwrite_mode);
keep on single line, it fits under 100 characters

...

...
     if (!rb_map->rb) {
             bpf_map_area_free(rb_map);
             return ERR_PTR(-ENOMEM);
@@ -295,11 +307,16 @@ static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma

static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) {
  unsigned long cons_pos, prod_pos;
  unsigned long cons_pos, prod_pos, over_pos;

   cons_pos = smp_load_acquire(&rb->consumer_pos);
   prod_pos = smp_load_acquire(&rb->producer_pos);
  return prod_pos - cons_pos;
  if (likely(!rb->overwrite_mode))
          return prod_pos - cons_pos;
nit: invert the condition to unlikely and handle that special case in a nested if, moving "over_pos" inside the if itself

...

...
  over_pos = READ_ONCE(rb->overwrite_pos);
  return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
I'm trying to understand why you need to min with `rb->mask + 1`, can you please elaborate?

We need the min because rb->producer_pos and rb->overwrite_pos are read at different times. During this gap, a fast producer may wrap once or more, making over_pos larger than prod_pos.

...

And also, at least for consistency, use smp_load_acquire() for overwrite_pos?

Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(), where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos). So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos) to ensure a consistent view of the ring buffer.

For consistency when reading consumer_pos and producer_pos, I’m fine with switching READ_ONCE to smp_load_acquire for overwrite_pos.

...

...
}

static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }

+static bool bpf_ringbuf_has_space(const struct bpf_ringbuf *rb,
                            unsigned long new_prod_pos,
                            unsigned long cons_pos,
                            unsigned long pend_pos)
+{
  /* no space if oldest not yet committed record until the newest
   * record span more than (ringbuf_size - 1)
   */
  if (new_prod_pos - pend_pos > rb->mask)
          return false;
  /* ok, we have space in ovewrite mode */
typo: overwrite

...

```
  if (unlikely(rb->overwrite_mode))
```
```
          return true;
```

  /* no space if producer position advances more than (ringbuf_size - 1)

   * ahead than consumer position when not in overwrite mode

typo: ahead of consumer position

...

...
   */
  if (new_prod_pos - cons_pos > rb->mask)
          return false;
  return true;
+}

+static u32 ringbuf_round_up_hdr_len(u32 hdr_len)
use consistent naming, if you have bpf_ringbuf_has_space, then this should have been bpf_ringbuf_round_up_len() or something like that.

OK, will add "bpf_" prefix

...

```
  hdr_len &= ~BPF_RINGBUF_DISCARD_BIT;
```

  return round_up(hdr_len + BPF_RINGBUF_HDR_SZ, 8);

static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) {

  unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;

  unsigned long flags;
   struct bpf_ringbuf_hdr *hdr;

```
  u32 len, pg_off, tmp_size, hdr_len;
```

```
  u32 len, pg_off, hdr_len;
```

  unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, over_pos;

100 character line length limit, just add over_pos to original single line declaration

...

...
     if (unlikely(size > RINGBUF_MAX_RECORD_SZ))
             return NULL;
@@ -429,24 +478,39 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) hdr_len = READ_ONCE(hdr->len); if (hdr_len & BPF_RINGBUF_BUSY_BIT) break;
          tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT;
          tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);
          pend_pos += tmp_size;
          pend_pos += ringbuf_round_up_hdr_len(hdr_len);
   }
   rb->pending_pos = pend_pos;
  /* check for out of ringbuf space:
   * - by ensuring producer position doesn't advance more than
   *   (ringbuf_size - 1) ahead
   * - by ensuring oldest not yet committed record until newest
   *   record does not span more than (ringbuf_size - 1)
   */
  if (new_prod_pos - cons_pos > rb->mask ||
      new_prod_pos - pend_pos > rb->mask) {
  if (!bpf_ringbuf_has_space(rb, new_prod_pos, cons_pos, pend_pos)) {
           raw_res_spin_unlock_irqrestore(&rb->spinlock, flags);
           return NULL;
   }
  /* In overwrite mode, move overwrite_pos to the next record to be
   * overwritten if the ring buffer is full
   */
hm... here I think the important point is that we search for the next record boundary until which we need to overwrite data such that it fits newly reserved record. "next record to be overwritten" isn't that important (we might never need to overwrite it). Important are those aspects of a) staying on record boundary and b) consuming enough records to reserve the new one.

Can you please update the comment to mention the above points?

Sure, I'll update the comment to:

In overwrite mode, advance overwrite_pos when the ring buffer is full. The key points are to stay on record boundaries and consume enough records to fit the new one.

...

...
  if (unlikely(rb->overwrite_mode)) {
          over_pos = rb->overwrite_pos;
          while (new_prod_pos - over_pos > rb->mask) {
                  hdr = (void *)rb->data + (over_pos & rb->mask);
                  hdr_len = READ_ONCE(hdr->len);
                  /* since pending_pos is the first record with BUSY
                   * bit set and overwrite_pos is never bigger than
                   * pending_pos, no need to check BUSY bit here.
                   */
honestly, this comment just confused me by implying that BUSY bit might be important (and set) here. But in reality, we are just overwriting already committed data which can't have BUSY bit set. It would be more helpful to mention that bpf_ringbuf_has_space() check above made sure we are not going to step over record that is being actively worked on by some other producer.

Sorry for the confusion, and thanks for the clarification. I’ll update the comment to:

The bpf_ringbuf_has_space() check above ensures we won’t step over a record currently being worked on by another producer.

...

...
                  over_pos += ringbuf_round_up_hdr_len(hdr_len);
          }
          /* smp_store_release(&rb->producer_pos, new_prod_pos) at
           * the end of the function ensures that when consumer sees
           * the updated rb->producer_pos, it always sees the updated
           * rb->overwrite_pos, so when consumer reads overwrite_pos
           * after smp_load_acquire(r->producer_pos), the overwrite_pos
           * will always be valid.
           */
          WRITE_ONCE(rb->overwrite_pos, over_pos);
  }
   hdr = (void *)rb->data + (prod_pos & rb->mask);
   pg_off = bpf_ringbuf_rec_pg_off(rb, hdr);
   hdr->len = size | BPF_RINGBUF_BUSY_BIT;
@@ -479,7 +543,50 @@ const struct bpf_func_proto bpf_ringbuf_reserve_proto = { .arg3_type = ARG_ANYTHING, };

-static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) +static __always_inline +bool ringbuf_should_wakeup(const struct bpf_ringbuf *rb,
consistent naming: bpf_ringbuf_should_wakeup

...
                     unsigned long rec_pos,
                     unsigned long cons_pos,
                     u32 len, u64 flags)
+{
  unsigned long rec_end;
  if (flags & BPF_RB_FORCE_WAKEUP)
          return true;
  if (flags & BPF_RB_NO_WAKEUP)
          return false;
  /* for non-overwrite mode, if consumer caught up and is waiting for
   * our record, notify about new data availability
   */
  if (likely(!rb->overwrite_mode))
          return cons_pos == rec_pos;
  /* for overwrite mode, to give the consumer a chance to catch up
   * before being overwritten, wake up consumer every half a round
   * ahead.
   */
  rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
  cons_pos &= (rb->mask >> 1);
  rec_pos &= (rb->mask >> 1);
  rec_end &= (rb->mask >> 1);
  if (cons_pos == rec_pos)
          return true;
  if (rec_pos < cons_pos && cons_pos < rec_end)
          return true;
  if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
          return true;
hm... ok, let's discuss this. Why do we need to do some half-round heuristic for overwrite mode? If a consumer is falling behind it should be actively trying to catch up and they don't need notification (that's the non-overwrite mode logic already).

So there is more to this than a brief comment you left, can you please elaborate?

pw-bot: cr

...
  return false;
+}

+static __always_inline
we didn't have always_inline before, any strong reason to add it now?

...
+void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) { unsigned long rec_pos, cons_pos; struct bpf_ringbuf_hdr *hdr; @@ -495,15 +602,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) /* update record header with correct final size prefix */ xchg(&hdr->len, new_len);
  /* if consumer caught up and is waiting for our record, notify about
   * new data availability
   */
   rec_pos = (void *)hdr - (void *)rb->data;
   cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
  if (flags & BPF_RB_FORCE_WAKEUP)
          irq_work_queue(&rb->work);
  else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP))
  if (ringbuf_should_wakeup(rb, rec_pos, cons_pos, new_len, flags))
           irq_work_queue(&rb->work);
}
@@ -576,6 +678,8 @@ BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags) return smp_load_acquire(&rb->consumer_pos); case BPF_RB_PROD_POS: return smp_load_acquire(&rb->producer_pos);
  case BPF_RB_OVER_POS:
          return READ_ONCE(rb->overwrite_pos);
do the smp_load_acquire() here just like with all other positions?

...
     default:
             return 0;
     }
@@ -749,6 +853,9 @@ BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,
     rb = container_of(map, struct bpf_ringbuf_map, map)->rb;
  if (unlikely(rb->overwrite_mode))
          return -EOPNOTSUPP;
why this check? We don't allow rb->overwrite_mode to be set for user ringbuf, no?

...
   /* If another consumer is already consuming a sample, wait for them to finish. */
   if (!atomic_try_cmpxchg(&rb->busy, &busy, 1))
           return -EBUSY;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 233de8677382..d3b2fd2ae527 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1430,6 +1430,9 @@ enum {

/* Do not translate kernel bpf_arena pointers to user pointers */ BPF_F_NO_USER_CONV = (1U << 18),

+/* bpf ringbuf works in overwrite mode? */
  BPF_F_OVERWRITE         = (1U << 19),
};

/* Flags for BPF_PROG_QUERY. */
@@ -6215,6 +6218,7 @@ enum { BPF_RB_RING_SIZE = 1, BPF_RB_CONS_POS = 2, BPF_RB_PROD_POS = 3,
  BPF_RB_OVER_POS = 4,
};

/* BPF ring buffer constants */
-- 2.43.0

Xu Kuohai

2:26 a.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On 9/29/2025 10:22 AM, Xu Kuohai wrote:

...

On 9/20/2025 6:10 AM, Andrii Nakryiko wrote:

...
On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...
From: Xu Kuohai xukuohai@huawei.com

When the bpf ring buffer is full, new events can not be recorded util

typo: until

ACK

[...]

oops, I wasn’t done with my reply and hit send by mistake. Please ignore it, thanks!

Xu Kuohai

1:41 p.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On 9/20/2025 6:10 AM, Andrii Nakryiko wrote:

...

On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...
From: Xu Kuohai xukuohai@huawei.com

When the bpf ring buffer is full, new events can not be recorded util

typo: until

ACK

...

...
the consumer consumes some events to free space. This may cause critical events to be discarded, such as in fault diagnostic, where recent events are more critical than older ones.

So add ovewrite mode for bpf ring buffer. In this mode, the new event

overwrite, BPF

ACK

...

...
overwrites the oldest event when the buffer is full.

The scheme is as follows:

producer_pos tracks the next position to write new data. When there is enough free space, producer simply moves producer_pos forward to make space for the new event.

To avoid waiting for consumer to free space when the buffer is full, a new variable overwrite_pos is introduced for producer. overwrite_pos tracks the next event to be overwritten (the oldest event committed) in the buffer. producer moves it forward to discard the oldest events when the buffer is full.

pending_pos tracks the oldest event under committing. producer ensures

"under committing" is confusing. Oldest event to be committed?

Yes, 'the oldest event to be committed'. Thanks!

...

...
producers_pos never passes pending_pos when making space for new events.
So multiple producers never write to the same position at the same time.
producer wakes up consumer every half a round ahead to give it a chance to retrieve data. However, for an overwrite-mode ring buffer, users typically only cares about the ring buffer snapshot before a fault occurs. In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag to avoid unnecessary wakeups.

To make it clear, here are some example diagrams.

Let's say we have a ring buffer with size 4096.

At first, {producer,overwrite,pending,consumer}_pos are all set to 0

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------+ ^ | |

producer_pos = 0 overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event A, size 512.

There is enough free space, so A is allocated at offset 0 and producer_pos is moved to 512, the end of A. Since A is not submitted, the BUSY bit is set.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | A | | | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 512 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event B, size 1024.

B is allocated at offset 512 with BUSY bit set, and producer_pos is moved to the end of B.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | A | B | | | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 1536 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event C, size 2048.

C is allocated at offset 1536 and producer_pos becomes 3584.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | [BUSY] | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 3584 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Submit event A.

The BUSY bit of A is cleared. B becomes the oldest event under writing, so
Now it's "under writing" :) To be committed? Or "pending committing" or just "pending", I guess. But not under anything, it just confuses readers. IMO.

Once again, 'oldest event to be committed'.

I should check it with an AI agent first.

...

 pending_pos is moved to 512, the start of B.

 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |        |                 |                                   |        |
 |    A   |        B        |                 C                 |        |
 |        |      [BUSY]     |               [BUSY]              |        |
 +-----------------------------------------------------------------------+
 ^        ^                                                     ^
 |        |                                                     |
 |        |                                                     |
 |   pending_pos = 512                                  producer_pos = 3584
 |

overwrite_pos = 0 consumer_pos = 0

Submit event B.

The BUSY bit of B is cleared, and pending_pos is moved to the start of C, which is the oldest event under writing now.

ditto

Again and again :(

...

...
 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |        |                 |                                   |        |
 |    A   |        B        |                 C                 |        |
 |        |                 |               [BUSY]              |        |
 +-----------------------------------------------------------------------+
 ^                          ^                                   ^
 |                          |                                   |
 |                          |                                   |
 |                     pending_pos = 1536               producer_pos = 3584
 |
overwrite_pos = 0 consumer_pos = 0

Reserve event D, size 1536 (3 * 512).

There are 2048 bytes not under writing between producer_pos and pending_pos, so D is allocated at offset 3584, and producer_pos is moved from 3584 to 5120.

Since event D will overwrite all bytes of event A and the begining 512 bytes
typo: beginning, but really "first 512 bytes" would be clearer

OK, I’ll switch to 'first 512 bytes' for clarity.

...

...
 of event B, overwrite_pos is moved to the start of event C, the oldest event
 that is not overwritten.

 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |                 |        |                                   |        |
 |      D End      |        |                 C                 | D Begin|
 |      [BUSY]     |        |               [BUSY]              | [BUSY] |
 +-----------------------------------------------------------------------+
 ^                 ^        ^
 |                 |        |
 |                 |   pending_pos = 1536
 |                 |   overwrite_pos = 1536
 |                 |
 |             producer_pos=5120
 |
consumer_pos = 0

Reserve event E, size 1024.

Though there are 512 bytes not under writing between producer_pos and pending_pos, E can not be reserved, as it would overwrite the first 512 bytes of event C, which is still under writing.

Submit event C and D.

pending_pos is moved to the end of D.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | D End | | C | D Begin| | | | | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | overwrite_pos = 1536 | | | producer_pos=5120 | pending_pos=5120 |

consumer_pos = 0

The performance data for overwrite mode will be provided in a follow-up patch that adds overwrite mode benchs.

A sample of performance data for non-overwrite mode on an x86_64 and arm64 CPU, before and after this patch, is shown below. As we can see, no obvious performance regression occurs.

x86_64 (AMD EPYC 9654)

Before:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 13.218 ± 0.039M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.684 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.771 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.281 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.842 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.001 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.833 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.508 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.421 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.309 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.265 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.198 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.174 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.113 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.097 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.070 ± 0.002M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 13.751 ± 0.673M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.592 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.776 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.463 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.883 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.017 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.816 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.512 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.396 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.303 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.267 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.210 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.181 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.136 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.090 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.091 ± 0.002M/s (drops 0.000 ± 0.000M/s)

arm64 (HiSilicon Kunpeng 920)

Before:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 11.602 ± 0.423M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.599 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.669 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.806 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.856 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.368 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.210 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 3.003 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.944 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.863 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.819 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.887 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.837 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.787 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.738 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.700 ± 0.007M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 11.614 ± 0.268M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.917 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.920 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.803 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.898 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.426 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.320 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 3.029 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 3.068 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.890 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.950 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.812 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.834 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.803 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.766 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.754 ± 0.009M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai xukuohai@huawei.com

include/uapi/linux/bpf.h | 4 + kernel/bpf/ringbuf.c | 159 +++++++++++++++++++++++++++------ tools/include/uapi/linux/bpf.h | 4 + 3 files changed, 141 insertions(+), 26 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 233de8677382..d3b2fd2ae527 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1430,6 +1430,9 @@ enum {

/* Do not translate kernel bpf_arena pointers to user pointers */ BPF_F_NO_USER_CONV = (1U << 18),

+/* bpf ringbuf works in overwrite mode? */
  BPF_F_OVERWRITE         = (1U << 19),
let's call it BPF_F_RB_OVERWRITE as this is ringbuf-specific? And use imperative voice in the comment:

/* Enable BPF ringbuf overwrite mode */

OK, will call it BPF_F_RB_OVERWRITE and use this comment.

...

...
};

/* Flags for BPF_PROG_QUERY. */ @@ -6215,6 +6218,7 @@ enum { BPF_RB_RING_SIZE = 1, BPF_RB_CONS_POS = 2, BPF_RB_PROD_POS = 3,
  BPF_RB_OVER_POS = 4,
nit: BPF_RB_OVERWITE_POS?

...

...
};

/* BPF ring buffer constants */ diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 719d73299397..6ca41d01f187 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -13,7 +13,7 @@ #include <linux/btf_ids.h> #include <asm/rqspinlock.h>

-#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE) +#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_OVERWRITE)

/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */ #define RINGBUF_PGOFF \ @@ -27,7 +27,8 @@ struct bpf_ringbuf { wait_queue_head_t waitq; struct irq_work work;
  u64 mask;
  u64 mask:48;
  u64 overwrite_mode:1;
Please, don't touch the mask field, it's a very hot field, no need to make it a bit field. Just add a separate bool for overwrite_mode.

...

...
     struct page **pages;
     int nr_pages;
     rqspinlock_t spinlock ____cacheline_aligned_in_smp;
@@ -72,6 +73,7 @@ struct bpf_ringbuf { */ unsigned long consumer_pos __aligned(PAGE_SIZE); unsigned long producer_pos __aligned(PAGE_SIZE);
  unsigned long overwrite_pos;  /* to be overwritten in overwrite mode */
Not a really precise comment, IMO. This is a position pointing to after the last overwritten record, no?

Yes, It’s actually the position after the last overwritten record. I'll update the comment for clarity.

...

...
     unsigned long pending_pos;
     char data[] __aligned(PAGE_SIZE);
}; @@ -166,7 +168,8 @@ static void bpf_ringbuf_notify(struct irq_work *work)

considering that the maximum value of data_sz is (4GB - 1), there

will be no overflow, so just note the size limit in the comments.

*/ -static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
                                       int overwrite_mode)
{ struct bpf_ringbuf *rb;
@@ -183,17 +186,25 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) rb->consumer_pos = 0; rb->producer_pos = 0; rb->pending_pos = 0;
  rb->overwrite_mode = overwrite_mode;

   return rb;
}

static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr) {
  int overwrite_mode = 0;
   struct bpf_ringbuf_map *rb_map;

   if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
           return ERR_PTR(-EINVAL);
  if (attr->map_flags & BPF_F_OVERWRITE) {
          if (attr->map_type == BPF_MAP_TYPE_USER_RINGBUF)
                  return ERR_PTR(-EINVAL);
          overwrite_mode = 1;
  }
   if (attr->key_size || attr->value_size ||
       !is_power_of_2(attr->max_entries) ||
       !PAGE_ALIGNED(attr->max_entries))
@@ -205,7 +216,8 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
     bpf_map_init_from_attr(&rb_map->map, attr);
  rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
  rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
                                 overwrite_mode);
keep on single line, it fits under 100 characters

...

...
     if (!rb_map->rb) {
             bpf_map_area_free(rb_map);
             return ERR_PTR(-ENOMEM);
@@ -295,11 +307,16 @@ static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma

static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) {
  unsigned long cons_pos, prod_pos;
  unsigned long cons_pos, prod_pos, over_pos;

   cons_pos = smp_load_acquire(&rb->consumer_pos);
   prod_pos = smp_load_acquire(&rb->producer_pos);
  return prod_pos - cons_pos;
  if (likely(!rb->overwrite_mode))
          return prod_pos - cons_pos;
nit: invert the condition to unlikely and handle that special case in a nested if, moving "over_pos" inside the if itself

...

...
  over_pos = READ_ONCE(rb->overwrite_pos);
  return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
I'm trying to understand why you need to min with `rb->mask + 1`, can you please elaborate?

We need the min because rb->producer_pos and rb->overwrite_pos are read at different times. During this gap, a fast producer may wrap once or more, making over_pos larger than prod_pos.

...

And also, at least for consistency, use smp_load_acquire() for overwrite_pos?

For consistency when reading consumer_pos and producer_pos, I’m fine with switching READ_ONCE to smp_load_acquire for overwrite_pos.

...

...
}

static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }

+static bool bpf_ringbuf_has_space(const struct bpf_ringbuf *rb,
                            unsigned long new_prod_pos,
                            unsigned long cons_pos,
                            unsigned long pend_pos)
+{
  /* no space if oldest not yet committed record until the newest
   * record span more than (ringbuf_size - 1)
   */
  if (new_prod_pos - pend_pos > rb->mask)
          return false;
  /* ok, we have space in ovewrite mode */
typo: overwrite

...

```
  if (unlikely(rb->overwrite_mode))
```
```
          return true;
```

  /* no space if producer position advances more than (ringbuf_size - 1)

   * ahead than consumer position when not in overwrite mode

typo: ahead of consumer position

...

...
   */
  if (new_prod_pos - cons_pos > rb->mask)
          return false;
  return true;
+}

+static u32 ringbuf_round_up_hdr_len(u32 hdr_len)
use consistent naming, if you have bpf_ringbuf_has_space, then this should have been bpf_ringbuf_round_up_len() or something like that.

OK, will add "bpf_" prefix

...

```
  hdr_len &= ~BPF_RINGBUF_DISCARD_BIT;
```

  return round_up(hdr_len + BPF_RINGBUF_HDR_SZ, 8);

static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) {

  unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;

  unsigned long flags;
   struct bpf_ringbuf_hdr *hdr;

```
  u32 len, pg_off, tmp_size, hdr_len;
```

```
  u32 len, pg_off, hdr_len;
```

  unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, over_pos;

100 character line length limit, just add over_pos to original single line declaration

...

...
     if (unlikely(size > RINGBUF_MAX_RECORD_SZ))
             return NULL;
@@ -429,24 +478,39 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) hdr_len = READ_ONCE(hdr->len); if (hdr_len & BPF_RINGBUF_BUSY_BIT) break;
          tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT;
          tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);
          pend_pos += tmp_size;
          pend_pos += ringbuf_round_up_hdr_len(hdr_len);
   }
   rb->pending_pos = pend_pos;
  /* check for out of ringbuf space:
   * - by ensuring producer position doesn't advance more than
   *   (ringbuf_size - 1) ahead
   * - by ensuring oldest not yet committed record until newest
   *   record does not span more than (ringbuf_size - 1)
   */
  if (new_prod_pos - cons_pos > rb->mask ||
      new_prod_pos - pend_pos > rb->mask) {
  if (!bpf_ringbuf_has_space(rb, new_prod_pos, cons_pos, pend_pos)) {
           raw_res_spin_unlock_irqrestore(&rb->spinlock, flags);
           return NULL;
   }
  /* In overwrite mode, move overwrite_pos to the next record to be
   * overwritten if the ring buffer is full
   */
hm... here I think the important point is that we search for the next record boundary until which we need to overwrite data such that it fits newly reserved record. "next record to be overwritten" isn't that important (we might never need to overwrite it). Important are those aspects of a) staying on record boundary and b) consuming enough records to reserve the new one.

Can you please update the comment to mention the above points?

Sure, I'll update the comment to:

In overwrite mode, advance overwrite_pos when the ring buffer is full. The key points are to stay on record boundaries and consume enough records to fit the new one.

...

...
  if (unlikely(rb->overwrite_mode)) {
          over_pos = rb->overwrite_pos;
          while (new_prod_pos - over_pos > rb->mask) {
                  hdr = (void *)rb->data + (over_pos & rb->mask);
                  hdr_len = READ_ONCE(hdr->len);
                  /* since pending_pos is the first record with BUSY
                   * bit set and overwrite_pos is never bigger than
                   * pending_pos, no need to check BUSY bit here.
                   */
honestly, this comment just confused me by implying that BUSY bit might be important (and set) here. But in reality, we are just overwriting already committed data which can't have BUSY bit set. It would be more helpful to mention that bpf_ringbuf_has_space() check above made sure we are not going to step over record that is being actively worked on by some other producer.

Sorry for the confusion, and thanks for the clarification. I’ll update the comment to:

The bpf_ringbuf_has_space() check above ensures we won’t step over a record currently being worked on by another producer.

...

                  over_pos += ringbuf_round_up_hdr_len(hdr_len);

```
          }
```

          /* smp_store_release(&rb->producer_pos, new_prod_pos) at

           * the end of the function ensures that when consumer sees

           * the updated rb->producer_pos, it always sees the updated

           * rb->overwrite_pos, so when consumer reads overwrite_pos

           * after smp_load_acquire(r->producer_pos), the overwrite_pos

```
           * will always be valid.
```
```
           */
```

          WRITE_ONCE(rb->overwrite_pos, over_pos);

```
  }
```

   hdr = (void *)rb->data + (prod_pos & rb->mask);
   pg_off = bpf_ringbuf_rec_pg_off(rb, hdr);
   hdr->len = size | BPF_RINGBUF_BUSY_BIT;

@@ -479,7 +543,50 @@ const struct bpf_func_proto bpf_ringbuf_reserve_proto = { .arg3_type = ARG_ANYTHING, };

-static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) +static __always_inline +bool ringbuf_should_wakeup(const struct bpf_ringbuf *rb,

consistent naming: bpf_ringbuf_should_wakeup

ACK

...

                     unsigned long rec_pos,

                     unsigned long cons_pos,

                     u32 len, u64 flags)

```
  unsigned long rec_end;
```
```
  if (flags & BPF_RB_FORCE_WAKEUP)
```
```
          return true;
```
```
  if (flags & BPF_RB_NO_WAKEUP)
```
```
          return false;
```

  /* for non-overwrite mode, if consumer caught up and is waiting for

   * our record, notify about new data availability

```
   */
```
```
  if (likely(!rb->overwrite_mode))
```
```
          return cons_pos == rec_pos;
```

  /* for overwrite mode, to give the consumer a chance to catch up

   * before being overwritten, wake up consumer every half a round

```
   * ahead.
```
```
   */
```

  rec_end = rec_pos + ringbuf_round_up_hdr_len(len);

```
  cons_pos &= (rb->mask >> 1);
```
```
  rec_pos &= (rb->mask >> 1);
```
```
  rec_end &= (rb->mask >> 1);
```
```
  if (cons_pos == rec_pos)
```
```
          return true;
```

  if (rec_pos < cons_pos && cons_pos < rec_end)

```
          return true;
```

  if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))

```
          return true;
```

So there is more to this than a brief comment you left, can you please elaborate?

The half-round wakeup was originally intended to work with libbpf in the v1 version. In that version, libbpf used a retry loop to safely copy data from the ring buffer that hadn’t been overwritten. By waking the consumer once every half round, there was always a period where the consumer and producer did not overlap, which helped reduce the number of retries.

...

pw-bot: cr

...
  return false;
+}

+static __always_inline
we didn't have always_inline before, any strong reason to add it now?

I just wanted to avoid introducing any performance regression. Before this patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but after the patch it wasn’t, so I added always_inline explicitly to keep it inlined.

...

...
+void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) { unsigned long rec_pos, cons_pos; struct bpf_ringbuf_hdr *hdr; @@ -495,15 +602,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard) /* update record header with correct final size prefix */ xchg(&hdr->len, new_len);
  /* if consumer caught up and is waiting for our record, notify about
   * new data availability
   */
   rec_pos = (void *)hdr - (void *)rb->data;
   cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
  if (flags & BPF_RB_FORCE_WAKEUP)
          irq_work_queue(&rb->work);
  else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP))
  if (ringbuf_should_wakeup(rb, rec_pos, cons_pos, new_len, flags))
           irq_work_queue(&rb->work);
}
@@ -576,6 +678,8 @@ BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags) return smp_load_acquire(&rb->consumer_pos); case BPF_RB_PROD_POS: return smp_load_acquire(&rb->producer_pos);
  case BPF_RB_OVER_POS:
          return READ_ONCE(rb->overwrite_pos);
do the smp_load_acquire() here just like with all other positions?

...

...
     default:
             return 0;
     }
@@ -749,6 +853,9 @@ BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,
     rb = container_of(map, struct bpf_ringbuf_map, map)->rb;
  if (unlikely(rb->overwrite_mode))
          return -EOPNOTSUPP;
why this check? We don't allow rb->overwrite_mode to be set for user ringbuf, no?

Yeah, ringbuf_map_alloc disallows creating user-space ring buffers in overwrite mode, so this check is not needed. I’ll drop it in the next version.

...

...
   /* If another consumer is already consuming a sample, wait for them to finish. */
   if (!atomic_try_cmpxchg(&rb->busy, &busy, 1))
           return -EBUSY;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 233de8677382..d3b2fd2ae527 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1430,6 +1430,9 @@ enum {

/* Do not translate kernel bpf_arena pointers to user pointers */ BPF_F_NO_USER_CONV = (1U << 18),

+/* bpf ringbuf works in overwrite mode? */
  BPF_F_OVERWRITE         = (1U << 19),
};

/* Flags for BPF_PROG_QUERY. */
@@ -6215,6 +6218,7 @@ enum { BPF_RB_RING_SIZE = 1, BPF_RB_CONS_POS = 2, BPF_RB_PROD_POS = 3,
  BPF_RB_OVER_POS = 4,
};

/* BPF ring buffer constants */
-- 2.43.0

Andrii Nakryiko

6 Oct 6 Oct

10:10 p.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On Mon, Sep 29, 2025 at 6:41 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...

On 9/20/2025 6:10 AM, Andrii Nakryiko wrote:

...
On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...
From: Xu Kuohai xukuohai@huawei.com

When the bpf ring buffer is full, new events can not be recorded util

typo: until

ACK

...
...
the consumer consumes some events to free space. This may cause critical events to be discarded, such as in fault diagnostic, where recent events are more critical than older ones.

So add ovewrite mode for bpf ring buffer. In this mode, the new event

overwrite, BPF

ACK

...
...
overwrites the oldest event when the buffer is full.

The scheme is as follows:

producer_pos tracks the next position to write new data. When there is enough free space, producer simply moves producer_pos forward to make space for the new event.

To avoid waiting for consumer to free space when the buffer is full, a new variable overwrite_pos is introduced for producer. overwrite_pos tracks the next event to be overwritten (the oldest event committed) in the buffer. producer moves it forward to discard the oldest events when the buffer is full.

pending_pos tracks the oldest event under committing. producer ensures

"under committing" is confusing. Oldest event to be committed?

Yes, 'the oldest event to be committed'. Thanks!

...
...
producers_pos never passes pending_pos when making space for new events.
So multiple producers never write to the same position at the same time.
producer wakes up consumer every half a round ahead to give it a chance to retrieve data. However, for an overwrite-mode ring buffer, users typically only cares about the ring buffer snapshot before a fault occurs. In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag to avoid unnecessary wakeups.

To make it clear, here are some example diagrams.

Let's say we have a ring buffer with size 4096.

At first, {producer,overwrite,pending,consumer}_pos are all set to 0

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------+ ^ | |

producer_pos = 0 overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event A, size 512.

There is enough free space, so A is allocated at offset 0 and producer_pos is moved to 512, the end of A. Since A is not submitted, the BUSY bit is set.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | A | | | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 512 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event B, size 1024.

B is allocated at offset 512 with BUSY bit set, and producer_pos is moved to the end of B.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | A | B | | | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 1536 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Reserve event C, size 2048.

C is allocated at offset 1536 and producer_pos becomes 3584.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | [BUSY] | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 3584 |

overwrite_pos = 0 pending_pos = 0 consumer_pos = 0

Submit event A.

The BUSY bit of A is cleared. B becomes the oldest event under writing, so
Now it's "under writing" :) To be committed? Or "pending committing" or just "pending", I guess. But not under anything, it just confuses readers. IMO.
Once again, 'oldest event to be committed'.

I should check it with an AI agent first.

...
...
 pending_pos is moved to 512, the start of B.

 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |        |                 |                                   |        |
 |    A   |        B        |                 C                 |        |
 |        |      [BUSY]     |               [BUSY]              |        |
 +-----------------------------------------------------------------------+
 ^        ^                                                     ^
 |        |                                                     |
 |        |                                                     |
 |   pending_pos = 512                                  producer_pos = 3584
 |
overwrite_pos = 0 consumer_pos = 0

Submit event B.

The BUSY bit of B is cleared, and pending_pos is moved to the start of C, which is the oldest event under writing now.
ditto
Again and again :(

...
...
 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |        |                 |                                   |        |
 |    A   |        B        |                 C                 |        |
 |        |                 |               [BUSY]              |        |
 +-----------------------------------------------------------------------+
 ^                          ^                                   ^
 |                          |                                   |
 |                          |                                   |
 |                     pending_pos = 1536               producer_pos = 3584
 |
overwrite_pos = 0 consumer_pos = 0

Reserve event D, size 1536 (3 * 512).

There are 2048 bytes not under writing between producer_pos and pending_pos, so D is allocated at offset 3584, and producer_pos is moved from 3584 to 5120.

Since event D will overwrite all bytes of event A and the begining 512 bytes
typo: beginning, but really "first 512 bytes" would be clearer
OK, I’ll switch to 'first 512 bytes' for clarity.

...
...
 of event B, overwrite_pos is moved to the start of event C, the oldest event
 that is not overwritten.

 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |                 |        |                                   |        |
 |      D End      |        |                 C                 | D Begin|
 |      [BUSY]     |        |               [BUSY]              | [BUSY] |
 +-----------------------------------------------------------------------+
 ^                 ^        ^
 |                 |        |
 |                 |   pending_pos = 1536
 |                 |   overwrite_pos = 1536
 |                 |
 |             producer_pos=5120
 |
consumer_pos = 0

Reserve event E, size 1024.

Though there are 512 bytes not under writing between producer_pos and pending_pos, E can not be reserved, as it would overwrite the first 512 bytes of event C, which is still under writing.

Submit event C and D.

pending_pos is moved to the end of D.

0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | D End | | C | D Begin| | | | | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | overwrite_pos = 1536 | | | producer_pos=5120 | pending_pos=5120 |

consumer_pos = 0

The performance data for overwrite mode will be provided in a follow-up patch that adds overwrite mode benchs.

A sample of performance data for non-overwrite mode on an x86_64 and arm64 CPU, before and after this patch, is shown below. As we can see, no obvious performance regression occurs.

[...]

...

...
...
Signed-off-by: Xu Kuohai xukuohai@huawei.com

include/uapi/linux/bpf.h | 4 + kernel/bpf/ringbuf.c | 159 +++++++++++++++++++++++++++------ tools/include/uapi/linux/bpf.h | 4 + 3 files changed, 141 insertions(+), 26 deletions(-)

[...]

...

...
...
  over_pos = READ_ONCE(rb->overwrite_pos);
  return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
I'm trying to understand why you need to min with `rb->mask + 1`, can you please elaborate?
We need the min because rb->producer_pos and rb->overwrite_pos are read at different times. During this gap, a fast producer may wrap once or more, making over_pos larger than prod_pos.

what if you read overwrite_pos before reading producer_pos? Then it can't be larger than producer_pos and available data would be producer_pos - max(consumer_pos, overwrite_pos)? would that work?

...

...
And also, at least for consistency, use smp_load_acquire() for overwrite_pos?

Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(), where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos). So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos) to ensure a consistent view of the ring buffer.

For consistency when reading consumer_pos and producer_pos, I’m fine with switching READ_ONCE to smp_load_acquire for overwrite_pos.

I'm not sure it matters much, but this function is called outside of rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while doing that WRITE_ONCE(). So it might not make any difference, but I have mild preference for smp_load_acquire() here.

...

...
...
}

static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }

[...]

...

...
...
  /* In overwrite mode, move overwrite_pos to the next record to be
   * overwritten if the ring buffer is full
   */
hm... here I think the important point is that we search for the next record boundary until which we need to overwrite data such that it fits newly reserved record. "next record to be overwritten" isn't that important (we might never need to overwrite it). Important are those aspects of a) staying on record boundary and b) consuming enough records to reserve the new one.

Can you please update the comment to mention the above points?
Sure, I'll update the comment to:

In overwrite mode, advance overwrite_pos when the ring buffer is full. The key points are to stay on record boundaries and consume enough records to fit the new one.

[...]

...

...
...
                     unsigned long rec_pos,
                     unsigned long cons_pos,
                     u32 len, u64 flags)
+{
  unsigned long rec_end;
  if (flags & BPF_RB_FORCE_WAKEUP)
          return true;
  if (flags & BPF_RB_NO_WAKEUP)
          return false;
  /* for non-overwrite mode, if consumer caught up and is waiting for
   * our record, notify about new data availability
   */
  if (likely(!rb->overwrite_mode))
          return cons_pos == rec_pos;
  /* for overwrite mode, to give the consumer a chance to catch up
   * before being overwritten, wake up consumer every half a round
   * ahead.
   */
  rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
  cons_pos &= (rb->mask >> 1);
  rec_pos &= (rb->mask >> 1);
  rec_end &= (rb->mask >> 1);
  if (cons_pos == rec_pos)
          return true;
  if (rec_pos < cons_pos && cons_pos < rec_end)
          return true;
  if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
          return true;
hm... ok, let's discuss this. Why do we need to do some half-round heuristic for overwrite mode? If a consumer is falling behind it should be actively trying to catch up and they don't need notification (that's the non-overwrite mode logic already).

So there is more to this than a brief comment you left, can you please elaborate?
The half-round wakeup was originally intended to work with libbpf in the v1 version. In that version, libbpf used a retry loop to safely copy data from the ring buffer that hadn’t been overwritten. By waking the consumer once every half round, there was always a period where the consumer and producer did not overlap, which helped reduce the number of retries.

I can't say I completely grok the logic here, but do you think we should still keep this half-round wakeup? It looks like an arbitrary heuristic, so I'd rather not have it.

...

...
pw-bot: cr

...
  return false;
+}

+static __always_inline
we didn't have always_inline before, any strong reason to add it now?
I just wanted to avoid introducing any performance regression. Before this patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but after the patch it wasn’t, so I added always_inline explicitly to keep it inlined.

how big of a difference was it in benchmarks? It's generally frowned upon using __always_inline without a good reason.

[...]

Xu Kuohai

10 Oct 10 Oct

7:07 a.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On 10/7/2025 6:10 AM, Andrii Nakryiko wrote:

[...]

...

...
...
...
  over_pos = READ_ONCE(rb->overwrite_pos);
  return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
I'm trying to understand why you need to min with `rb->mask + 1`, can you please elaborate?
We need the min because rb->producer_pos and rb->overwrite_pos are read at different times. During this gap, a fast producer may wrap once or more, making over_pos larger than prod_pos.
what if you read overwrite_pos before reading producer_pos? Then it can't be larger than producer_pos and available data would be producer_pos - max(consumer_pos, overwrite_pos)? would that work?

No, it won’t work. Between reading overwrite_pos and producer_pos, producer on a different CPU may have already moved producer_pos forward by more than one ring buffer size, causing prod_pos - max(cons_pos, over_pos) to exceed the ring buffer size.

...

...
...
And also, at least for consistency, use smp_load_acquire() for overwrite_pos?

Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(), where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos). So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos) to ensure a consistent view of the ring buffer.

For consistency when reading consumer_pos and producer_pos, I’m fine with switching READ_ONCE to smp_load_acquire for overwrite_pos.

I'm not sure it matters much, but this function is called outside of rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while doing that WRITE_ONCE(). So it might not make any difference, but I have mild preference for smp_load_acquire() here.

OK, I'll switch to smp_load_acquire.

...

...
...
...
}

static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }

[...]

...
...
...
  /* In overwrite mode, move overwrite_pos to the next record to be
   * overwritten if the ring buffer is full
   */
hm... here I think the important point is that we search for the next record boundary until which we need to overwrite data such that it fits newly reserved record. "next record to be overwritten" isn't that important (we might never need to overwrite it). Important are those aspects of a) staying on record boundary and b) consuming enough records to reserve the new one.

Can you please update the comment to mention the above points?
Sure, I'll update the comment to:

In overwrite mode, advance overwrite_pos when the ring buffer is full. The key points are to stay on record boundaries and consume enough records to fit the new one.
ok

[...]

...
...
...
                     unsigned long rec_pos,
                     unsigned long cons_pos,
                     u32 len, u64 flags)
+{
  unsigned long rec_end;
  if (flags & BPF_RB_FORCE_WAKEUP)
          return true;
  if (flags & BPF_RB_NO_WAKEUP)
          return false;
  /* for non-overwrite mode, if consumer caught up and is waiting for
   * our record, notify about new data availability
   */
  if (likely(!rb->overwrite_mode))
          return cons_pos == rec_pos;
  /* for overwrite mode, to give the consumer a chance to catch up
   * before being overwritten, wake up consumer every half a round
   * ahead.
   */
  rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
  cons_pos &= (rb->mask >> 1);
  rec_pos &= (rb->mask >> 1);
  rec_end &= (rb->mask >> 1);
  if (cons_pos == rec_pos)
          return true;
  if (rec_pos < cons_pos && cons_pos < rec_end)
          return true;
  if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
          return true;
hm... ok, let's discuss this. Why do we need to do some half-round heuristic for overwrite mode? If a consumer is falling behind it should be actively trying to catch up and they don't need notification (that's the non-overwrite mode logic already).

So there is more to this than a brief comment you left, can you please elaborate?
The half-round wakeup was originally intended to work with libbpf in the v1 version. In that version, libbpf used a retry loop to safely copy data from the ring buffer that hadn’t been overwritten. By waking the consumer once every half round, there was always a period where the consumer and producer did not overlap, which helped reduce the number of retries.
I can't say I completely grok the logic here, but do you think we should still keep this half-round wakeup? It looks like an arbitrary heuristic, so I'd rather not have it.

Sure, since the related libbpf code is no longer present, I’ll remove this logic in the next version.

...

...
...
pw-bot: cr

...
  return false;
+}

+static __always_inline
we didn't have always_inline before, any strong reason to add it now?
I just wanted to avoid introducing any performance regression. Before this patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but after the patch it wasn’t, so I added always_inline explicitly to keep it inlined.
how big of a difference was it in benchmarks? It's generally frowned upon using __always_inline without a good reason.

The difference is not noticeable on my arm64 test machine, but it is on my amd machine.

Below is the benchmark data on AMD EPYC 9654, with and without always_inline attribute.

- With always_inline

Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 13.070 ± 0.158M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.440 ± 0.017M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.860 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.444 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.788 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.802 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.560 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.227 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.141 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.960 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.913 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.854 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.818 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.779 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.758 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.812 ± 0.003M/s (drops 0.000 ± 0.000M/s)

- Without always_inline

Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 10.550 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 14.661 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.616 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.476 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.806 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.814 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.608 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.337 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.270 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.977 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.921 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.862 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.827 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.912 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.860 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.824 ± 0.001M/s (drops 0.000 ± 0.000M/s)

When nr_prod=1, the performance regression is significant, dropping from 13.070 ± 0.158 M/s with always_inline to 10.550 ± 0.032 M/s without it.

However, since the half-round wakeup logic will be removed in the next version, the changes to bpf_ringbuf_commit, including always_inline, will also be removed.

...

[...]

Andrii Nakryiko

13 Oct 13 Oct

11:22 p.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On Fri, Oct 10, 2025 at 12:07 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...

On 10/7/2025 6:10 AM, Andrii Nakryiko wrote:

[...]

...
...
...
...
  over_pos = READ_ONCE(rb->overwrite_pos);
  return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
I'm trying to understand why you need to min with `rb->mask + 1`, can you please elaborate?
We need the min because rb->producer_pos and rb->overwrite_pos are read at different times. During this gap, a fast producer may wrap once or more, making over_pos larger than prod_pos.
what if you read overwrite_pos before reading producer_pos? Then it can't be larger than producer_pos and available data would be producer_pos - max(consumer_pos, overwrite_pos)? would that work?
No, it won’t work. Between reading overwrite_pos and producer_pos, producer on a different CPU may have already moved producer_pos forward by more than one ring buffer size, causing prod_pos - max(cons_pos, over_pos) to exceed the ring buffer size.

True, but that was the case with this function before as well. ringbuf_avail_data_sz() is giving an estimate, we just need to make sure to not return a negative value. We didn't artificially cap the return to ring buf size before, why starting now? All of this calculation is done outside of the lock anyways, so it can never be relied upon for exactness.

...

...
...
...
And also, at least for consistency, use smp_load_acquire() for overwrite_pos?

Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(), where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos). So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos) to ensure a consistent view of the ring buffer.

For consistency when reading consumer_pos and producer_pos, I’m fine with switching READ_ONCE to smp_load_acquire for overwrite_pos.

I'm not sure it matters much, but this function is called outside of rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while doing that WRITE_ONCE(). So it might not make any difference, but I have mild preference for smp_load_acquire() here.

OK, I'll switch to smp_load_acquire.

...
...
...
...
}

static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }

[...]

...
...
...
  /* In overwrite mode, move overwrite_pos to the next record to be
   * overwritten if the ring buffer is full
   */
hm... here I think the important point is that we search for the next record boundary until which we need to overwrite data such that it fits newly reserved record. "next record to be overwritten" isn't that important (we might never need to overwrite it). Important are those aspects of a) staying on record boundary and b) consuming enough records to reserve the new one.

Can you please update the comment to mention the above points?
Sure, I'll update the comment to:

In overwrite mode, advance overwrite_pos when the ring buffer is full. The key points are to stay on record boundaries and consume enough records to fit the new one.
ok

[...]

...
...
...
                     unsigned long rec_pos,
                     unsigned long cons_pos,
                     u32 len, u64 flags)
+{
  unsigned long rec_end;
  if (flags & BPF_RB_FORCE_WAKEUP)
          return true;
  if (flags & BPF_RB_NO_WAKEUP)
          return false;
  /* for non-overwrite mode, if consumer caught up and is waiting for
   * our record, notify about new data availability
   */
  if (likely(!rb->overwrite_mode))
          return cons_pos == rec_pos;
  /* for overwrite mode, to give the consumer a chance to catch up
   * before being overwritten, wake up consumer every half a round
   * ahead.
   */
  rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
  cons_pos &= (rb->mask >> 1);
  rec_pos &= (rb->mask >> 1);
  rec_end &= (rb->mask >> 1);
  if (cons_pos == rec_pos)
          return true;
  if (rec_pos < cons_pos && cons_pos < rec_end)
          return true;
  if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
          return true;
hm... ok, let's discuss this. Why do we need to do some half-round heuristic for overwrite mode? If a consumer is falling behind it should be actively trying to catch up and they don't need notification (that's the non-overwrite mode logic already).

So there is more to this than a brief comment you left, can you please elaborate?
The half-round wakeup was originally intended to work with libbpf in the v1 version. In that version, libbpf used a retry loop to safely copy data from the ring buffer that hadn’t been overwritten. By waking the consumer once every half round, there was always a period where the consumer and producer did not overlap, which helped reduce the number of retries.
I can't say I completely grok the logic here, but do you think we should still keep this half-round wakeup? It looks like an arbitrary heuristic, so I'd rather not have it.
Sure, since the related libbpf code is no longer present, I’ll remove this logic in the next version.

...
...
...
pw-bot: cr

...
  return false;
+}

+static __always_inline
we didn't have always_inline before, any strong reason to add it now?
I just wanted to avoid introducing any performance regression. Before this patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but after the patch it wasn’t, so I added always_inline explicitly to keep it inlined.
how big of a difference was it in benchmarks? It's generally frowned upon using __always_inline without a good reason.
The difference is not noticeable on my arm64 test machine, but it is on my amd machine.

Below is the benchmark data on AMD EPYC 9654, with and without always_inline attribute.

With always_inline

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 13.070 ± 0.158M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.440 ± 0.017M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.860 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.444 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.788 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.802 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.560 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.227 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.141 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.960 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.913 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.854 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.818 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.779 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.758 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.812 ± 0.003M/s (drops 0.000 ± 0.000M/s)

Without always_inline

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 10.550 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 14.661 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.616 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.476 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.806 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.814 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.608 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.337 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.270 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.977 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.921 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.862 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.827 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.912 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.860 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.824 ± 0.001M/s (drops 0.000 ± 0.000M/s)

When nr_prod=1, the performance regression is significant, dropping from 13.070 ± 0.158 M/s with always_inline to 10.550 ± 0.032 M/s without it.

However, since the half-round wakeup logic will be removed in the next version, the changes to bpf_ringbuf_commit, including always_inline, will also be removed.

Ok, thanks for the data! Please send an updated version of the code, and let's have another round of review, thanks!

...

...
[...]

Xu Kuohai

14 Oct 14 Oct

1:35 p.m.

New subject: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On 10/14/2025 7:22 AM, Andrii Nakryiko wrote:

...

On Fri, Oct 10, 2025 at 12:07 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...
On 10/7/2025 6:10 AM, Andrii Nakryiko wrote:

[...]

...
...
...
...
  over_pos = READ_ONCE(rb->overwrite_pos);
  return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
I'm trying to understand why you need to min with `rb->mask + 1`, can you please elaborate?
We need the min because rb->producer_pos and rb->overwrite_pos are read at different times. During this gap, a fast producer may wrap once or more, making over_pos larger than prod_pos.
what if you read overwrite_pos before reading producer_pos? Then it can't be larger than producer_pos and available data would be producer_pos - max(consumer_pos, overwrite_pos)? would that work?
No, it won’t work. Between reading overwrite_pos and producer_pos, producer on a different CPU may have already moved producer_pos forward by more than one ring buffer size, causing prod_pos - max(cons_pos, over_pos) to exceed the ring buffer size.
True, but that was the case with this function before as well. ringbuf_avail_data_sz() is giving an estimate, we just need to make sure to not return a negative value. We didn't artificially cap the return to ring buf size before, why starting now? All of this calculation is done outside of the lock anyways, so it can never be relied upon for exactness.

Makes sense, will switch to producer_pos - max(consumer_pos, overwrite_pos).

...

...
...
...
...
And also, at least for consistency, use smp_load_acquire() for overwrite_pos?

Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(), where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos). So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos) to ensure a consistent view of the ring buffer.

For consistency when reading consumer_pos and producer_pos, I’m fine with switching READ_ONCE to smp_load_acquire for overwrite_pos.

I'm not sure it matters much, but this function is called outside of rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while doing that WRITE_ONCE(). So it might not make any difference, but I have mild preference for smp_load_acquire() here.

OK, I'll switch to smp_load_acquire.

...
...
...
...
}

static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
@@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }
[...]

...
...
...
  /* In overwrite mode, move overwrite_pos to the next record to be
   * overwritten if the ring buffer is full
   */
hm... here I think the important point is that we search for the next record boundary until which we need to overwrite data such that it fits newly reserved record. "next record to be overwritten" isn't that important (we might never need to overwrite it). Important are those aspects of a) staying on record boundary and b) consuming enough records to reserve the new one.

Can you please update the comment to mention the above points?
Sure, I'll update the comment to:

In overwrite mode, advance overwrite_pos when the ring buffer is full. The key points are to stay on record boundaries and consume enough records to fit the new one.
ok

[...]

...
...
...
                     unsigned long rec_pos,
                     unsigned long cons_pos,
                     u32 len, u64 flags)
+{
  unsigned long rec_end;
  if (flags & BPF_RB_FORCE_WAKEUP)
          return true;
  if (flags & BPF_RB_NO_WAKEUP)
          return false;
  /* for non-overwrite mode, if consumer caught up and is waiting for
   * our record, notify about new data availability
   */
  if (likely(!rb->overwrite_mode))
          return cons_pos == rec_pos;
  /* for overwrite mode, to give the consumer a chance to catch up
   * before being overwritten, wake up consumer every half a round
   * ahead.
   */
  rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
  cons_pos &= (rb->mask >> 1);
  rec_pos &= (rb->mask >> 1);
  rec_end &= (rb->mask >> 1);
  if (cons_pos == rec_pos)
          return true;
  if (rec_pos < cons_pos && cons_pos < rec_end)
          return true;
  if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
          return true;
hm... ok, let's discuss this. Why do we need to do some half-round heuristic for overwrite mode? If a consumer is falling behind it should be actively trying to catch up and they don't need notification (that's the non-overwrite mode logic already).

So there is more to this than a brief comment you left, can you please elaborate?
The half-round wakeup was originally intended to work with libbpf in the v1 version. In that version, libbpf used a retry loop to safely copy data from the ring buffer that hadn’t been overwritten. By waking the consumer once every half round, there was always a period where the consumer and producer did not overlap, which helped reduce the number of retries.
I can't say I completely grok the logic here, but do you think we should still keep this half-round wakeup? It looks like an arbitrary heuristic, so I'd rather not have it.
Sure, since the related libbpf code is no longer present, I’ll remove this logic in the next version.

...
...
...
pw-bot: cr

...
  return false;
+}

+static __always_inline
we didn't have always_inline before, any strong reason to add it now?
I just wanted to avoid introducing any performance regression. Before this patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but after the patch it wasn’t, so I added always_inline explicitly to keep it inlined.
how big of a difference was it in benchmarks? It's generally frowned upon using __always_inline without a good reason.
The difference is not noticeable on my arm64 test machine, but it is on my amd machine.

Below is the benchmark data on AMD EPYC 9654, with and without always_inline attribute.

With always_inline

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 13.070 ± 0.158M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.440 ± 0.017M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.860 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.444 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.788 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.802 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.560 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.227 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.141 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.960 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.913 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.854 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.818 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.779 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.758 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.812 ± 0.003M/s (drops 0.000 ± 0.000M/s)

Without always_inline

Ringbuf, multi-producer contention

rb-libbpf nr_prod 1 10.550 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 14.661 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.616 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.476 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.806 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.814 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.608 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.337 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.270 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.977 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.921 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.862 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.827 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.912 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.860 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.824 ± 0.001M/s (drops 0.000 ± 0.000M/s)

When nr_prod=1, the performance regression is significant, dropping from 13.070 ± 0.158 M/s with always_inline to 10.550 ± 0.032 M/s without it.

However, since the half-round wakeup logic will be removed in the next version, the changes to bpf_ringbuf_commit, including always_inline, will also be removed.
Ok, thanks for the data! Please send an updated version of the code, and let's have another round of review, thanks!

Sure, I’ll send the next version later this week.

...

...
...
[...]

Xu Kuohai

5 Sep 5 Sep

3:06 p.m.

New subject: [PATCH bpf-next v2 2/3] selftests/bpf: Add test for overwrite ring buffer

From: Xu Kuohai xukuohai@huawei.com

Add test for overwiret mode ring buffer. The test creates a bpf ring buffer in overwrite mode, then repeatlly reserves and commits data to check if the ring buffer works as expected both before and after overwrite happens.

Signed-off-by: Xu Kuohai xukuohai@huawei.com --- tools/testing/selftests/bpf/Makefile | 3 +- .../selftests/bpf/prog_tests/ringbuf.c | 74 ++++++++++++++ .../bpf/progs/test_ringbuf_overwrite.c | 98 +++++++++++++++++++ 3 files changed, 174 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 11d2a368db3e..e6c18e201555 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -499,7 +499,8 @@ LINKED_SKELS := test_static_linked.skel.h linked_funcs.skel.h \ LSKELS := fentry_test.c fexit_test.c fexit_sleep.c atomics.c \ trace_printk.c trace_vprintk.c map_ptr_kern.c \ core_kern.c core_kern_overflow.c test_ringbuf.c \ - test_ringbuf_n.c test_ringbuf_map_key.c test_ringbuf_write.c + test_ringbuf_n.c test_ringbuf_map_key.c test_ringbuf_write.c \ + test_ringbuf_overwrite.c

# Generate both light skeleton and libbpf skeleton for these LSKELS_EXTRA := test_ksyms_module.c test_ksyms_weak.c kfunc_call_test.c \ diff --git a/tools/testing/selftests/bpf/prog_tests/ringbuf.c b/tools/testing/selftests/bpf/prog_tests/ringbuf.c index d1e4cb28a72c..205a51c725a7 100644 --- a/tools/testing/selftests/bpf/prog_tests/ringbuf.c +++ b/tools/testing/selftests/bpf/prog_tests/ringbuf.c @@ -17,6 +17,7 @@ #include "test_ringbuf_n.lskel.h" #include "test_ringbuf_map_key.lskel.h" #include "test_ringbuf_write.lskel.h" +#include "test_ringbuf_overwrite.lskel.h"

#define EDONE 7777

@@ -497,6 +498,77 @@ static void ringbuf_map_key_subtest(void) test_ringbuf_map_key_lskel__destroy(skel_map_key); }

+static void ringbuf_overwrite_mode_subtest(void) +{ + unsigned long size, len1, len2, len3, len4, len5; + unsigned long expect_avail_data, expect_prod_pos, expect_over_pos; + struct test_ringbuf_overwrite_lskel *skel; + int err; + + skel = test_ringbuf_overwrite_lskel__open(); + if (!ASSERT_OK_PTR(skel, "skel_open")) + return; + + size = 0x1000; + len1 = 0x800; + len2 = 0x400; + len3 = size - len1 - len2 - BPF_RINGBUF_HDR_SZ * 3; /* 0x3e8 */ + len4 = len3 - 8; /* 0x3e0 */ + len5 = len3; /* retry with len3 */ + + skel->maps.ringbuf.max_entries = size; + skel->rodata->LEN1 = len1; + skel->rodata->LEN2 = len2; + skel->rodata->LEN3 = len3; + skel->rodata->LEN4 = len4; + skel->rodata->LEN5 = len5; + + skel->bss->pid = getpid(); + + err = test_ringbuf_overwrite_lskel__load(skel); + if (!ASSERT_OK(err, "skel_load")) + goto cleanup; + + err = test_ringbuf_overwrite_lskel__attach(skel); + if (!ASSERT_OK(err, "skel_attach")) + goto cleanup; + + syscall(__NR_getpgid); + + ASSERT_EQ(skel->bss->reserve1_fail, 0, "reserve 1"); + ASSERT_EQ(skel->bss->reserve2_fail, 0, "reserve 2"); + ASSERT_EQ(skel->bss->reserve3_fail, 1, "reserve 3"); + ASSERT_EQ(skel->bss->reserve4_fail, 0, "reserve 4"); + ASSERT_EQ(skel->bss->reserve5_fail, 0, "reserve 5"); + + CHECK(skel->bss->ring_size != size, + "check_ring_size", "exp %lu, got %lu\n", + size, skel->bss->ring_size); + + expect_avail_data = len2 + len4 + len5 + 3 * BPF_RINGBUF_HDR_SZ; + CHECK(skel->bss->avail_data != expect_avail_data, + "check_avail_size", "exp %lu, got %lu\n", + expect_avail_data, skel->bss->avail_data); + + CHECK(skel->bss->cons_pos != 0, + "check_cons_pos", "exp 0, got %lu\n", + skel->bss->cons_pos); + + expect_prod_pos = len1 + len2 + len4 + len5 + 4 * BPF_RINGBUF_HDR_SZ; + CHECK(skel->bss->prod_pos != expect_prod_pos, + "check_prod_pos", "exp %lu, got %lu\n", + expect_prod_pos, skel->bss->prod_pos); + + expect_over_pos = len1 + BPF_RINGBUF_HDR_SZ; + CHECK(skel->bss->over_pos != expect_over_pos, + "check_over_pos", "exp %lu, got %lu\n", + (unsigned long)expect_over_pos, skel->bss->over_pos); + + test_ringbuf_overwrite_lskel__detach(skel); +cleanup: + test_ringbuf_overwrite_lskel__destroy(skel); +} + void test_ringbuf(void) { if (test__start_subtest("ringbuf")) @@ -507,4 +579,6 @@ void test_ringbuf(void) ringbuf_map_key_subtest(); if (test__start_subtest("ringbuf_write")) ringbuf_write_subtest(); + if (test__start_subtest("ringbuf_overwrite_mode")) + ringbuf_overwrite_mode_subtest(); } diff --git a/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c b/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c new file mode 100644 index 000000000000..da89ba12a75c --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2025. Huawei Technologies Co., Ltd */ + +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include "bpf_misc.h" + +char _license[] SEC("license") = "GPL"; + +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(map_flags, BPF_F_OVERWRITE); +} ringbuf SEC(".maps"); + +int pid; + +const volatile unsigned long LEN1; +const volatile unsigned long LEN2; +const volatile unsigned long LEN3; +const volatile unsigned long LEN4; +const volatile unsigned long LEN5; + +long reserve1_fail = 0; +long reserve2_fail = 0; +long reserve3_fail = 0; +long reserve4_fail = 0; +long reserve5_fail = 0; + +unsigned long avail_data = 0; +unsigned long ring_size = 0; +unsigned long cons_pos = 0; +unsigned long prod_pos = 0; +unsigned long over_pos = 0; + +SEC("fentry/" SYS_PREFIX "sys_getpgid") +int test_overwrite_ringbuf(void *ctx) +{ + char *rec1, *rec2, *rec3, *rec4, *rec5; + int cur_pid = bpf_get_current_pid_tgid() >> 32; + + if (cur_pid != pid) + return 0; + + rec1 = bpf_ringbuf_reserve(&ringbuf, LEN1, 0); + if (!rec1) { + reserve1_fail = 1; + return 0; + } + + rec2 = bpf_ringbuf_reserve(&ringbuf, LEN2, 0); + if (!rec2) { + bpf_ringbuf_discard(rec1, 0); + reserve2_fail = 1; + return 0; + } + + rec3 = bpf_ringbuf_reserve(&ringbuf, LEN3, 0); + /* expect failure */ + if (!rec3) { + reserve3_fail = 1; + } else { + bpf_ringbuf_discard(rec1, 0); + bpf_ringbuf_discard(rec2, 0); + bpf_ringbuf_discard(rec3, 0); + return 0; + } + + rec4 = bpf_ringbuf_reserve(&ringbuf, LEN4, 0); + if (!rec4) { + reserve4_fail = 1; + bpf_ringbuf_discard(rec1, 0); + bpf_ringbuf_discard(rec2, 0); + return 0; + } + + bpf_ringbuf_submit(rec1, 0); + bpf_ringbuf_submit(rec2, 0); + bpf_ringbuf_submit(rec4, 0); + + rec5 = bpf_ringbuf_reserve(&ringbuf, LEN5, 0); + if (!rec5) { + reserve5_fail = 1; + return 0; + } + + for (int i = 0; i < LEN3; i++) + rec5[i] = 0xdd; + + bpf_ringbuf_submit(rec5, 0); + + ring_size = bpf_ringbuf_query(&ringbuf, BPF_RB_RING_SIZE); + avail_data = bpf_ringbuf_query(&ringbuf, BPF_RB_AVAIL_DATA); + cons_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_CONS_POS); + prod_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_PROD_POS); + over_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_OVER_POS); + + return 0; +}

-- 2.43.0

Andrii Nakryiko

19 Sep 19 Sep

10:10 p.m.

New subject: [PATCH bpf-next v2 2/3] selftests/bpf: Add test for overwrite ring buffer

On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...

From: Xu Kuohai xukuohai@huawei.com

Add test for overwiret mode ring buffer. The test creates a bpf ring buffer in overwrite mode, then repeatlly reserves and commits data to check if the ring buffer works as expected both before and after overwrite happens.

Signed-off-by: Xu Kuohai xukuohai@huawei.com

tools/testing/selftests/bpf/Makefile | 3 +- .../selftests/bpf/prog_tests/ringbuf.c | 74 ++++++++++++++ .../bpf/progs/test_ringbuf_overwrite.c | 98 +++++++++++++++++++ 3 files changed, 174 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 11d2a368db3e..e6c18e201555 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -499,7 +499,8 @@ LINKED_SKELS := test_static_linked.skel.h linked_funcs.skel.h \ LSKELS := fentry_test.c fexit_test.c fexit_sleep.c atomics.c \ trace_printk.c trace_vprintk.c map_ptr_kern.c \ core_kern.c core_kern_overflow.c test_ringbuf.c \
  test_ringbuf_n.c test_ringbuf_map_key.c test_ringbuf_write.c
  test_ringbuf_n.c test_ringbuf_map_key.c test_ringbuf_write.c    \
  test_ringbuf_overwrite.c
# Generate both light skeleton and libbpf skeleton for these LSKELS_EXTRA := test_ksyms_module.c test_ksyms_weak.c kfunc_call_test.c \ diff --git a/tools/testing/selftests/bpf/prog_tests/ringbuf.c b/tools/testing/selftests/bpf/prog_tests/ringbuf.c index d1e4cb28a72c..205a51c725a7 100644 --- a/tools/testing/selftests/bpf/prog_tests/ringbuf.c +++ b/tools/testing/selftests/bpf/prog_tests/ringbuf.c @@ -17,6 +17,7 @@ #include "test_ringbuf_n.lskel.h" #include "test_ringbuf_map_key.lskel.h" #include "test_ringbuf_write.lskel.h" +#include "test_ringbuf_overwrite.lskel.h"

#define EDONE 7777

@@ -497,6 +498,77 @@ static void ringbuf_map_key_subtest(void) test_ringbuf_map_key_lskel__destroy(skel_map_key); }

+static void ringbuf_overwrite_mode_subtest(void) +{
  unsigned long size, len1, len2, len3, len4, len5;
  unsigned long expect_avail_data, expect_prod_pos, expect_over_pos;
  struct test_ringbuf_overwrite_lskel *skel;
  int err;
  skel = test_ringbuf_overwrite_lskel__open();
  if (!ASSERT_OK_PTR(skel, "skel_open"))
          return;
  size = 0x1000;
  len1 = 0x800;
  len2 = 0x400;
  len3 = size - len1 - len2 - BPF_RINGBUF_HDR_SZ * 3; /* 0x3e8 */
  len4 = len3 - 8; /* 0x3e0 */
  len5 = len3; /* retry with len3 */
  skel->maps.ringbuf.max_entries = size;
  skel->rodata->LEN1 = len1;
  skel->rodata->LEN2 = len2;
  skel->rodata->LEN3 = len3;
  skel->rodata->LEN4 = len4;
  skel->rodata->LEN5 = len5;
  skel->bss->pid = getpid();
  err = test_ringbuf_overwrite_lskel__load(skel);
  if (!ASSERT_OK(err, "skel_load"))
          goto cleanup;
  err = test_ringbuf_overwrite_lskel__attach(skel);
  if (!ASSERT_OK(err, "skel_attach"))
          goto cleanup;
  syscall(__NR_getpgid);
  ASSERT_EQ(skel->bss->reserve1_fail, 0, "reserve 1");
  ASSERT_EQ(skel->bss->reserve2_fail, 0, "reserve 2");
  ASSERT_EQ(skel->bss->reserve3_fail, 1, "reserve 3");
  ASSERT_EQ(skel->bss->reserve4_fail, 0, "reserve 4");
  ASSERT_EQ(skel->bss->reserve5_fail, 0, "reserve 5");
  CHECK(skel->bss->ring_size != size,
        "check_ring_size", "exp %lu, got %lu\n",
        size, skel->bss->ring_size);

we don't use legacy CHECK() macros anymore, please use only ASSERT_xxx

...

  expect_avail_data = len2 + len4 + len5 + 3 * BPF_RINGBUF_HDR_SZ;

  CHECK(skel->bss->avail_data != expect_avail_data,

        "check_avail_size", "exp %lu, got %lu\n",

        expect_avail_data, skel->bss->avail_data);

```
  CHECK(skel->bss->cons_pos != 0,
```

        "check_cons_pos", "exp 0, got %lu\n",

```
        skel->bss->cons_pos);
```

  expect_prod_pos = len1 + len2 + len4 + len5 + 4 * BPF_RINGBUF_HDR_SZ;

  CHECK(skel->bss->prod_pos != expect_prod_pos,

        "check_prod_pos", "exp %lu, got %lu\n",

        expect_prod_pos, skel->bss->prod_pos);

  expect_over_pos = len1 + BPF_RINGBUF_HDR_SZ;

  CHECK(skel->bss->over_pos != expect_over_pos,

        "check_over_pos", "exp %lu, got %lu\n",

        (unsigned long)expect_over_pos, skel->bss->over_pos);

  test_ringbuf_overwrite_lskel__detach(skel);

+cleanup:

  test_ringbuf_overwrite_lskel__destroy(skel);

void test_ringbuf(void) { if (test__start_subtest("ringbuf")) @@ -507,4 +579,6 @@ void test_ringbuf(void) ringbuf_map_key_subtest(); if (test__start_subtest("ringbuf_write")) ringbuf_write_subtest();

  if (test__start_subtest("ringbuf_overwrite_mode"))

          ringbuf_overwrite_mode_subtest();

} diff --git a/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c b/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c new file mode 100644 index 000000000000..da89ba12a75c --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2025. Huawei Technologies Co., Ltd */

+#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include "bpf_misc.h"

+char _license[] SEC("license") = "GPL";

+struct {

```
  __uint(type, BPF_MAP_TYPE_RINGBUF);
```
```
  __uint(map_flags, BPF_F_OVERWRITE);
```

+} ringbuf SEC(".maps");

+int pid;

+const volatile unsigned long LEN1; +const volatile unsigned long LEN2; +const volatile unsigned long LEN3; +const volatile unsigned long LEN4; +const volatile unsigned long LEN5;

+long reserve1_fail = 0; +long reserve2_fail = 0; +long reserve3_fail = 0; +long reserve4_fail = 0; +long reserve5_fail = 0;

+unsigned long avail_data = 0; +unsigned long ring_size = 0; +unsigned long cons_pos = 0; +unsigned long prod_pos = 0; +unsigned long over_pos = 0;

+SEC("fentry/" SYS_PREFIX "sys_getpgid") +int test_overwrite_ringbuf(void *ctx) +{

  char *rec1, *rec2, *rec3, *rec4, *rec5;

  int cur_pid = bpf_get_current_pid_tgid() >> 32;

```
  if (cur_pid != pid)
```
```
          return 0;
```

  rec1 = bpf_ringbuf_reserve(&ringbuf, LEN1, 0);

```
  if (!rec1) {
```
```
          reserve1_fail = 1;
```
```
          return 0;
```
```
  }
```

  rec2 = bpf_ringbuf_reserve(&ringbuf, LEN2, 0);

```
  if (!rec2) {
```

          bpf_ringbuf_discard(rec1, 0);

```
          reserve2_fail = 1;
```
```
          return 0;
```
```
  }
```

  rec3 = bpf_ringbuf_reserve(&ringbuf, LEN3, 0);

```
  /* expect failure */
```
```
  if (!rec3) {
```
```
          reserve3_fail = 1;
```
```
  } else {
```

          bpf_ringbuf_discard(rec1, 0);

          bpf_ringbuf_discard(rec2, 0);

          bpf_ringbuf_discard(rec3, 0);

```
          return 0;
```
```
  }
```

  rec4 = bpf_ringbuf_reserve(&ringbuf, LEN4, 0);

```
  if (!rec4) {
```
```
          reserve4_fail = 1;
```

          bpf_ringbuf_discard(rec1, 0);

          bpf_ringbuf_discard(rec2, 0);

```
          return 0;
```
```
  }
```
```
  bpf_ringbuf_submit(rec1, 0);
```
```
  bpf_ringbuf_submit(rec2, 0);
```
```
  bpf_ringbuf_submit(rec4, 0);
```

  rec5 = bpf_ringbuf_reserve(&ringbuf, LEN5, 0);

```
  if (!rec5) {
```
```
          reserve5_fail = 1;
```
```
          return 0;
```
```
  }
```
```
  for (int i = 0; i < LEN3; i++)
```
```
          rec5[i] = 0xdd;
```
```
  bpf_ringbuf_submit(rec5, 0);
```

  ring_size = bpf_ringbuf_query(&ringbuf, BPF_RB_RING_SIZE);

  avail_data = bpf_ringbuf_query(&ringbuf, BPF_RB_AVAIL_DATA);

  cons_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_CONS_POS);

  prod_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_PROD_POS);

  over_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_OVER_POS);

```
  return 0;
```

+}

2.43.0

Xu Kuohai

29 Sep 29 Sep

1:41 p.m.

New subject: [PATCH bpf-next v2 2/3] selftests/bpf: Add test for overwrite ring buffer

On 9/20/2025 6:10 AM, Andrii Nakryiko wrote:

...

On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...

[...]

...

```
  CHECK(skel->bss->ring_size != size,
```

        "check_ring_size", "exp %lu, got %lu\n",

```
        size, skel->bss->ring_size);
```

we don't use legacy CHECK() macros anymore, please use only ASSERT_xxx

OK, will do

Xu Kuohai

5 Sep 5 Sep

3:06 p.m.

New subject: [PATCH bpf-next v2 3/3] selftests/bpf/benchs: Add producer and overwrite bench for ring buffer

From: Xu Kuohai xukuohai@huawei.com

Add rb-prod test for bpf ring buffer to bench producer performance without counsumer thread. And add --rb-overwrite option to bench ring buffer in overwrite mode.

For reference, below are bench numbers collected from x86_64 and arm64 CPUs.

- AMD EPYC 9654 (x86_64)

Ringbuf, overwrite mode with multi-producer contention, no consumer =================================================================== rb-prod nr_prod 1 32.295 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 9.591 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 8.895 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 9.206 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 9.220 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 4.595 ± 0.022M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 4.348 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 3.957 ± 0.017M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 3.787 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 3.603 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 3.707 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 3.562 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 3.616 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 3.598 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 3.555 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 52 3.463 ± 0.020M/s (drops 0.000 ± 0.000M/s)

- HiSilicon Kunpeng 920 (arm64)

Ringbuf, overwrite mode with multi-producer contention, no consumer =================================================================== rb-prod nr_prod 1 14.687 ± 0.058M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 22.263 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 5.736 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 4.934 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 4.661 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 3.753 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 3.706 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 3.660 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 3.610 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 3.238 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 3.270 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 2.892 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 2.995 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 2.830 ± 0.019M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 2.877 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 52 2.814 ± 0.015M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai xukuohai@huawei.com --- tools/testing/selftests/bpf/bench.c | 2 + .../selftests/bpf/benchs/bench_ringbufs.c | 95 +++++++++++++++++-- .../bpf/benchs/run_bench_ringbufs.sh | 4 + .../selftests/bpf/progs/ringbuf_bench.c | 10 ++ 4 files changed, 103 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index bd29bb2e6cb5..a98063f6436a 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -541,6 +541,7 @@ extern const struct bench bench_trig_uretprobe_multi_nop5;

extern const struct bench bench_rb_libbpf; extern const struct bench bench_rb_custom; +extern const struct bench bench_rb_prod; extern const struct bench bench_pb_libbpf; extern const struct bench bench_pb_custom; extern const struct bench bench_bloom_lookup; @@ -617,6 +618,7 @@ static const struct bench *benchs[] = { /* ringbuf/perfbuf benchmarks */ &bench_rb_libbpf, &bench_rb_custom, + &bench_rb_prod, &bench_pb_libbpf, &bench_pb_custom, &bench_bloom_lookup, diff --git a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c index e1ee979e6acc..6d58479fac91 100644 --- a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c +++ b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c @@ -19,6 +19,7 @@ static struct { int ringbuf_sz; /* per-ringbuf, in bytes */ bool ringbuf_use_output; /* use slower output API */ int perfbuf_sz; /* per-CPU size, in pages */ + bool overwrite; } args = { .back2back = false, .batch_cnt = 500, @@ -27,6 +28,7 @@ static struct { .ringbuf_sz = 512 * 1024, .ringbuf_use_output = false, .perfbuf_sz = 128, + .overwrite = false, };

enum { @@ -35,6 +37,7 @@ enum { ARG_RB_BATCH_CNT = 2002, ARG_RB_SAMPLED = 2003, ARG_RB_SAMPLE_RATE = 2004, + ARG_RB_OVERWRITE = 2005, };

static const struct argp_option opts[] = { @@ -43,6 +46,7 @@ static const struct argp_option opts[] = { { "rb-batch-cnt", ARG_RB_BATCH_CNT, "CNT", 0, "Set BPF-side record batch count"}, { "rb-sampled", ARG_RB_SAMPLED, NULL, 0, "Notification sampling"}, { "rb-sample-rate", ARG_RB_SAMPLE_RATE, "RATE", 0, "Notification sample rate"}, + { "rb-overwrite", ARG_RB_OVERWRITE, NULL, 0, "Overwrite mode"}, {}, };

@@ -72,6 +76,9 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state) argp_usage(state); } break; + case ARG_RB_OVERWRITE: + args.overwrite = true; + break; default: return ARGP_ERR_UNKNOWN; } @@ -95,8 +102,30 @@ static inline void bufs_trigger_batch(void)

static void bufs_validate(void) { - if (env.consumer_cnt != 1) { - fprintf(stderr, "rb-libbpf benchmark needs one consumer!\n"); + bool bench_prod = !strcmp(env.bench_name, "rb-prod"); + + if (args.overwrite && !bench_prod) { + fprintf(stderr, "overwite mode only works with benchmakr rb-prod!\n"); + exit(1); + } + + if (bench_prod && env.consumer_cnt != 0) { + fprintf(stderr, "rb-prod benchmark does not need consumer!\n"); + exit(1); + } + + if (bench_prod && args.back2back) { + fprintf(stderr, "back-to-back mode makes no sense for rb-prod!\n"); + exit(1); + } + + if (bench_prod && args.sampled) { + fprintf(stderr, "sampling mode makes no sense for rb-prod!\n"); + exit(1); + } + + if (!bench_prod && env.consumer_cnt != 1) { + fprintf(stderr, "benchmarks excluding rb-prod need one consumer!\n"); exit(1); }

@@ -132,8 +161,10 @@ static void ringbuf_libbpf_measure(struct bench_res *res) res->drops = atomic_swap(&ctx->skel->bss->dropped, 0); }

-static struct ringbuf_bench *ringbuf_setup_skeleton(void) +static struct ringbuf_bench *ringbuf_setup_skeleton(int bench_prod) { + __u32 flags; + struct bpf_map *ringbuf; struct ringbuf_bench *skel;

setup_libbpf(); @@ -146,12 +177,19 @@ static struct ringbuf_bench *ringbuf_setup_skeleton(void)

skel->rodata->batch_cnt = args.batch_cnt; skel->rodata->use_output = args.ringbuf_use_output ? 1 : 0; + skel->rodata->bench_prod = bench_prod;

if (args.sampled) /* record data + header take 16 bytes */ skel->rodata->wakeup_data_size = args.sample_rate * 16;

- bpf_map__set_max_entries(skel->maps.ringbuf, args.ringbuf_sz); + ringbuf = skel->maps.ringbuf; + if (args.overwrite) { + flags = bpf_map__map_flags(ringbuf) | BPF_F_OVERWRITE; + bpf_map__set_map_flags(ringbuf, flags); + } + + bpf_map__set_max_entries(ringbuf, args.ringbuf_sz);

if (ringbuf_bench__load(skel)) { fprintf(stderr, "failed to load skeleton\n"); @@ -171,10 +209,13 @@ static void ringbuf_libbpf_setup(void) { struct ringbuf_libbpf_ctx *ctx = &ringbuf_libbpf_ctx; struct bpf_link *link; + int map_fd;

- ctx->skel = ringbuf_setup_skeleton(); - ctx->ringbuf = ring_buffer__new(bpf_map__fd(ctx->skel->maps.ringbuf), - buf_process_sample, NULL, NULL); + ctx->skel = ringbuf_setup_skeleton(0); + + map_fd = bpf_map__fd(ctx->skel->maps.ringbuf); + ctx->ringbuf = ring_buffer__new(map_fd, buf_process_sample, + NULL, NULL); if (!ctx->ringbuf) { fprintf(stderr, "failed to create ringbuf\n"); exit(1); @@ -232,7 +273,7 @@ static void ringbuf_custom_setup(void) void *tmp; int err;

- ctx->skel = ringbuf_setup_skeleton(); + ctx->skel = ringbuf_setup_skeleton(0);

ctx->epoll_fd = epoll_create1(EPOLL_CLOEXEC); if (ctx->epoll_fd < 0) { @@ -277,6 +318,33 @@ static void ringbuf_custom_setup(void) } }

+/* RINGBUF-PRODUCER benchmark */ +static struct ringbuf_prod_ctx { + struct ringbuf_bench *skel; +} ringbuf_prod_ctx; + +static void ringbuf_prod_measure(struct bench_res *res) +{ + struct ringbuf_prod_ctx *ctx = &ringbuf_prod_ctx; + + res->hits = atomic_swap(&ctx->skel->bss->hits, 0); + res->drops = atomic_swap(&ctx->skel->bss->dropped, 0); +} + +static void ringbuf_prod_setup(void) +{ + struct ringbuf_prod_ctx *ctx = &ringbuf_prod_ctx; + struct bpf_link *link; + + ctx->skel = ringbuf_setup_skeleton(1); + + link = bpf_program__attach(ctx->skel->progs.bench_ringbuf); + if (!link) { + fprintf(stderr, "failed to attach program!\n"); + exit(1); + } +} + #define RINGBUF_BUSY_BIT (1 << 31) #define RINGBUF_DISCARD_BIT (1 << 30) #define RINGBUF_META_LEN 8 @@ -540,6 +608,17 @@ const struct bench bench_rb_custom = { .report_final = hits_drops_report_final, };

+const struct bench bench_rb_prod = { + .name = "rb-prod", + .argp = &bench_ringbufs_argp, + .validate = bufs_validate, + .setup = ringbuf_prod_setup, + .producer_thread = bufs_sample_producer, + .measure = ringbuf_prod_measure, + .report_progress = hits_drops_report_progress, + .report_final = hits_drops_report_final, +}; + const struct bench bench_pb_libbpf = { .name = "pb-libbpf", .argp = &bench_ringbufs_argp, diff --git a/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh b/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh index 91e3567962ff..84ae66beb0ec 100755 --- a/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh +++ b/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh @@ -49,3 +49,7 @@ for b in 1 2 3 4 8 12 16 20 24 28 32 36 40 44 48 52; do summarize "rb-libbpf nr_prod $b" "$($RUN_RB_BENCH -p$b --rb-batch-cnt 50 rb-libbpf)" done

+header "Ringbuf, overwrite mode with multi-producer contention, no consumer" +for b in 1 2 3 4 8 12 16 20 24 28 32 36 40 44 48 52; do + summarize "rb-prod nr_prod $b" "$($RUN_BENCH -p$b --rb-batch-cnt 50 --rb-overwrite rb-prod)" +done diff --git a/tools/testing/selftests/bpf/progs/ringbuf_bench.c b/tools/testing/selftests/bpf/progs/ringbuf_bench.c index 6a468496f539..c55282ba4038 100644 --- a/tools/testing/selftests/bpf/progs/ringbuf_bench.c +++ b/tools/testing/selftests/bpf/progs/ringbuf_bench.c @@ -14,9 +14,11 @@ struct {

const volatile int batch_cnt = 0; const volatile long use_output = 0; +const volatile long bench_prod = 0;

long sample_val = 42; long dropped __attribute__((aligned(128))) = 0; +long hits __attribute__((aligned(128))) = 0;

const volatile long wakeup_data_size = 0;

@@ -24,6 +26,9 @@ static __always_inline long get_flags() { long sz;

+ if (bench_prod) + return BPF_RB_NO_WAKEUP; + if (!wakeup_data_size) return 0;

@@ -47,6 +52,8 @@ int bench_ringbuf(void *ctx) *sample = sample_val; flags = get_flags(); bpf_ringbuf_submit(sample, flags); + if (bench_prod) + __sync_add_and_fetch(&hits, 1); } } } else { @@ -55,6 +62,9 @@ int bench_ringbuf(void *ctx) if (bpf_ringbuf_output(&ringbuf, &sample_val, sizeof(sample_val), flags)) __sync_add_and_fetch(&dropped, 1); + else if (bench_prod) + __sync_add_and_fetch(&hits, 1); + } } return 0;

-- 2.43.0

Andrii Nakryiko

19 Sep 19 Sep

10:15 p.m.

New subject: [PATCH bpf-next v2 3/3] selftests/bpf/benchs: Add producer and overwrite bench for ring buffer

On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...

From: Xu Kuohai xukuohai@huawei.com

Add rb-prod test for bpf ring buffer to bench producer performance without counsumer thread. And add --rb-overwrite option to bench ring buffer in overwrite mode.

For reference, below are bench numbers collected from x86_64 and arm64 CPUs.

AMD EPYC 9654 (x86_64)

Ringbuf, overwrite mode with multi-producer contention, no consumer

rb-prod nr_prod 1 32.295 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 9.591 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 8.895 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 9.206 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 9.220 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 4.595 ± 0.022M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 4.348 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 3.957 ± 0.017M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 3.787 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 3.603 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 3.707 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 3.562 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 3.616 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 3.598 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 3.555 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 52 3.463 ± 0.020M/s (drops 0.000 ± 0.000M/s)

HiSilicon Kunpeng 920 (arm64)

Ringbuf, overwrite mode with multi-producer contention, no consumer

rb-prod nr_prod 1 14.687 ± 0.058M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 22.263 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 5.736 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 4.934 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 4.661 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 3.753 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 3.706 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 3.660 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 3.610 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 3.238 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 3.270 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 2.892 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 2.995 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 2.830 ± 0.019M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 2.877 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 52 2.814 ± 0.015M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai xukuohai@huawei.com

tools/testing/selftests/bpf/bench.c | 2 + .../selftests/bpf/benchs/bench_ringbufs.c | 95 +++++++++++++++++-- .../bpf/benchs/run_bench_ringbufs.sh | 4 + .../selftests/bpf/progs/ringbuf_bench.c | 10 ++ 4 files changed, 103 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index bd29bb2e6cb5..a98063f6436a 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -541,6 +541,7 @@ extern const struct bench bench_trig_uretprobe_multi_nop5;

extern const struct bench bench_rb_libbpf; extern const struct bench bench_rb_custom; +extern const struct bench bench_rb_prod; extern const struct bench bench_pb_libbpf; extern const struct bench bench_pb_custom; extern const struct bench bench_bloom_lookup; @@ -617,6 +618,7 @@ static const struct bench *benchs[] = { /* ringbuf/perfbuf benchmarks */ &bench_rb_libbpf, &bench_rb_custom,
  &bench_rb_prod,
  &bench_pb_libbpf,
  &bench_pb_custom,
  &bench_bloom_lookup,
diff --git a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c index e1ee979e6acc..6d58479fac91 100644 --- a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c +++ b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c @@ -19,6 +19,7 @@ static struct { int ringbuf_sz; /* per-ringbuf, in bytes */ bool ringbuf_use_output; /* use slower output API */ int perfbuf_sz; /* per-CPU size, in pages */
  bool overwrite;
} args = { .back2back = false, .batch_cnt = 500, @@ -27,6 +28,7 @@ static struct { .ringbuf_sz = 512 * 1024, .ringbuf_use_output = false, .perfbuf_sz = 128,
  .overwrite = false,
};

enum { @@ -35,6 +37,7 @@ enum { ARG_RB_BATCH_CNT = 2002, ARG_RB_SAMPLED = 2003, ARG_RB_SAMPLE_RATE = 2004,
  ARG_RB_OVERWRITE = 2005,
};

static const struct argp_option opts[] = { @@ -43,6 +46,7 @@ static const struct argp_option opts[] = { { "rb-batch-cnt", ARG_RB_BATCH_CNT, "CNT", 0, "Set BPF-side record batch count"}, { "rb-sampled", ARG_RB_SAMPLED, NULL, 0, "Notification sampling"}, { "rb-sample-rate", ARG_RB_SAMPLE_RATE, "RATE", 0, "Notification sample rate"},
  { "rb-overwrite", ARG_RB_OVERWRITE, NULL, 0, "Overwrite mode"},
  {},
};

@@ -72,6 +76,9 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state) argp_usage(state); } break;
  case ARG_RB_OVERWRITE:
          args.overwrite = true;
          break;
  default:
          return ARGP_ERR_UNKNOWN;
  }
@@ -95,8 +102,30 @@ static inline void bufs_trigger_batch(void)

static void bufs_validate(void) {
  if (env.consumer_cnt != 1) {
          fprintf(stderr, "rb-libbpf benchmark needs one consumer!\n");
  bool bench_prod = !strcmp(env.bench_name, "rb-prod");
  if (args.overwrite && !bench_prod) {
          fprintf(stderr, "overwite mode only works with benchmakr rb-prod!\n");
          exit(1);
  }
  if (bench_prod && env.consumer_cnt != 0) {
          fprintf(stderr, "rb-prod benchmark does not need consumer!\n");
          exit(1);
  }
  if (bench_prod && args.back2back) {
          fprintf(stderr, "back-to-back mode makes no sense for rb-prod!\n");
          exit(1);
  }
  if (bench_prod && args.sampled) {
          fprintf(stderr, "sampling mode makes no sense for rb-prod!\n");
          exit(1);
  }
  if (!bench_prod && env.consumer_cnt != 1) {
          fprintf(stderr, "benchmarks excluding rb-prod need one consumer!\n");
          exit(1);
  }
@@ -132,8 +161,10 @@ static void ringbuf_libbpf_measure(struct bench_res *res) res->drops = atomic_swap(&ctx->skel->bss->dropped, 0); }

-static struct ringbuf_bench *ringbuf_setup_skeleton(void) +static struct ringbuf_bench *ringbuf_setup_skeleton(int bench_prod)

int because C doesn't support bool?...

but really, do we need another benchmark just to set overwritable mode?... can't you adapt existing benchmarks to optionally set overwritable mode?

(and please drop sdf@google.com from CC for the next revision, that email doesn't exist anymore)

[...]

Xu Kuohai

29 Sep 29 Sep

1:41 p.m.

New subject: [PATCH bpf-next v2 3/3] selftests/bpf/benchs: Add producer and overwrite bench for ring buffer

On 9/20/2025 6:15 AM, Andrii Nakryiko wrote:

...

On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:

...

[...]

...

...
-static struct ringbuf_bench *ringbuf_setup_skeleton(void) +static struct ringbuf_bench *ringbuf_setup_skeleton(int bench_prod)

int because C doesn't support bool?...

No special reason, just wrote it like that. I’ll switch it to bool

...

but really, do we need another benchmark just to set overwritable mode?... can't you adapt existing benchmarks to optionally set overwritable mode?

OK, I’ll adapt the existing benchmarks to support it.

...

(and please drop sdf@google.com from CC for the next revision, that email doesn't exist anymore)

...

[...]

days inactive

126

days old

linux-kselftest-mirror@lists.linaro.org

15 comments

participants

tags (0)

participants (2)

Andrii Nakryiko
Xu Kuohai