On 10/7/2025 6:10 AM, Andrii Nakryiko wrote:
[...]
over_pos = READ_ONCE(rb->overwrite_pos);
return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
I'm trying to understand why you need to min with `rb->mask + 1`, can you please elaborate?
We need the min because rb->producer_pos and rb->overwrite_pos are read at different times. During this gap, a fast producer may wrap once or more, making over_pos larger than prod_pos.
what if you read overwrite_pos before reading producer_pos? Then it can't be larger than producer_pos and available data would be producer_pos - max(consumer_pos, overwrite_pos)? would that work?
No, it won’t work. Between reading overwrite_pos and producer_pos, producer on a different CPU may have already moved producer_pos forward by more than one ring buffer size, causing prod_pos - max(cons_pos, over_pos) to exceed the ring buffer size.
And also, at least for consistency, use smp_load_acquire() for overwrite_pos?
Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(), where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos). So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos) to ensure a consistent view of the ring buffer.
For consistency when reading consumer_pos and producer_pos, I’m fine with switching READ_ONCE to smp_load_acquire for overwrite_pos.
I'm not sure it matters much, but this function is called outside of rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while doing that WRITE_ONCE(). So it might not make any difference, but I have mild preference for smp_load_acquire() here.
OK, I'll switch to smp_load_acquire.
}
static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) return (void*)((addr & PAGE_MASK) - off); }
[...]
/* In overwrite mode, move overwrite_pos to the next record to be
* overwritten if the ring buffer is full
*/
hm... here I think the important point is that we search for the next record boundary until which we need to overwrite data such that it fits newly reserved record. "next record to be overwritten" isn't that important (we might never need to overwrite it). Important are those aspects of a) staying on record boundary and b) consuming enough records to reserve the new one.
Can you please update the comment to mention the above points?
Sure, I'll update the comment to:
In overwrite mode, advance overwrite_pos when the ring buffer is full. The key points are to stay on record boundaries and consume enough records to fit the new one.
ok
[...]
unsigned long rec_pos,
unsigned long cons_pos,
u32 len, u64 flags)
+{
unsigned long rec_end;
if (flags & BPF_RB_FORCE_WAKEUP)
return true;
if (flags & BPF_RB_NO_WAKEUP)
return false;
/* for non-overwrite mode, if consumer caught up and is waiting for
* our record, notify about new data availability
*/
if (likely(!rb->overwrite_mode))
return cons_pos == rec_pos;
/* for overwrite mode, to give the consumer a chance to catch up
* before being overwritten, wake up consumer every half a round
* ahead.
*/
rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
cons_pos &= (rb->mask >> 1);
rec_pos &= (rb->mask >> 1);
rec_end &= (rb->mask >> 1);
if (cons_pos == rec_pos)
return true;
if (rec_pos < cons_pos && cons_pos < rec_end)
return true;
if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
return true;
hm... ok, let's discuss this. Why do we need to do some half-round heuristic for overwrite mode? If a consumer is falling behind it should be actively trying to catch up and they don't need notification (that's the non-overwrite mode logic already).
So there is more to this than a brief comment you left, can you please elaborate?
The half-round wakeup was originally intended to work with libbpf in the v1 version. In that version, libbpf used a retry loop to safely copy data from the ring buffer that hadn’t been overwritten. By waking the consumer once every half round, there was always a period where the consumer and producer did not overlap, which helped reduce the number of retries.
I can't say I completely grok the logic here, but do you think we should still keep this half-round wakeup? It looks like an arbitrary heuristic, so I'd rather not have it.
Sure, since the related libbpf code is no longer present, I’ll remove this logic in the next version.
pw-bot: cr
return false;
+}
+static __always_inline
we didn't have always_inline before, any strong reason to add it now?
I just wanted to avoid introducing any performance regression. Before this patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but after the patch it wasn’t, so I added always_inline explicitly to keep it inlined.
how big of a difference was it in benchmarks? It's generally frowned upon using __always_inline without a good reason.
The difference is not noticeable on my arm64 test machine, but it is on my amd machine.
Below is the benchmark data on AMD EPYC 9654, with and without always_inline attribute.
- With always_inline
Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 13.070 ± 0.158M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.440 ± 0.017M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.860 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.444 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.788 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.802 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.560 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.227 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.141 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.960 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.913 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.854 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.818 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.779 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.758 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.812 ± 0.003M/s (drops 0.000 ± 0.000M/s)
- Without always_inline
Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 10.550 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 14.661 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.616 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.476 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.806 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.814 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 2.608 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.337 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.270 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.977 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.921 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.862 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.827 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.912 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.860 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.824 ± 0.001M/s (drops 0.000 ± 0.000M/s)
When nr_prod=1, the performance regression is significant, dropping from 13.070 ± 0.158 M/s with always_inline to 10.550 ± 0.032 M/s without it.
However, since the half-round wakeup logic will be removed in the next version, the changes to bpf_ringbuf_commit, including always_inline, will also be removed.
[...]