Re: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

6 Oct 2025


      On Mon, Sep 29, 2025 at 6:41 AM Xu Kuohai xukuohai@huaweicloud.com wrote:
...
On 9/20/2025 6:10 AM, Andrii Nakryiko wrote:
...
On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai xukuohai@huaweicloud.com wrote:
...
From: Xu Kuohai xukuohai@huawei.com
When the bpf ring buffer is full, new events can not be recorded util
typo: until
ACK
...
...
the consumer consumes some events to free space. This may cause critical
events to be discarded, such as in fault diagnostic, where recent events
are more critical than older ones.
So add ovewrite mode for bpf ring buffer. In this mode, the new event
overwrite, BPF
ACK
...
...
overwrites the oldest event when the buffer is full.
The scheme is as follows:

producer_pos tracks the next position to write new data. When there
 is enough free space, producer simply moves producer_pos forward to
 make space for the new event.

To avoid waiting for consumer to free space when the buffer is full,
 a new variable overwrite_pos is introduced for producer. overwrite_pos
 tracks the next event to be overwritten (the oldest event committed) in
 the buffer. producer moves it forward to discard the oldest events when
 the buffer is full.

pending_pos tracks the oldest event under committing. producer ensures


"under committing" is confusing. Oldest event to be committed?
Yes, 'the oldest event to be committed'. Thanks!
...
...
producers_pos never passes pending_pos when making space for new events.
So multiple producers never write to the same position at the same time.



producer wakes up consumer every half a round ahead to give it a chance
 to retrieve data. However, for an overwrite-mode ring buffer, users
 typically only cares about the ring buffer snapshot before a fault occurs.
 In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
 to avoid unnecessary wakeups.

To make it clear, here are some example diagrams.

Let's say we have a ring buffer with size 4096.
At first, {producer,overwrite,pending,consumer}_pos are all set to 0
0       512      1024    1536     2048     2560     3072     3584       4096
  +-----------------------------------------------------------------------+
  |                                                                       |
  |                                                                       |
  |                                                                       |
  +-----------------------------------------------------------------------+
  ^
  |
  |


producer_pos = 0
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

Reserve event A, size 512.
There is enough free space, so A is allocated at offset 0 and producer_pos
  is moved to 512, the end of A. Since A is not submitted, the BUSY bit is
  set.
0       512      1024    1536     2048     2560     3072     3584       4096
  +-----------------------------------------------------------------------+
  |        |                                                              |
  |   A    |                                                              |
  | [BUSY] |                                                              |
  +-----------------------------------------------------------------------+
  ^        ^
  |        |
  |        |
  |    producer_pos = 512
  |


overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

Reserve event B, size 1024.
B is allocated at offset 512 with BUSY bit set, and producer_pos is moved
  to the end of B.
0       512      1024    1536     2048     2560     3072     3584       4096
  +-----------------------------------------------------------------------+
  |        |                 |                                            |
  |   A    |        B        |                                            |
  | [BUSY] |      [BUSY]     |                                            |
  +-----------------------------------------------------------------------+
  ^                          ^
  |                          |
  |                          |
  |                   producer_pos = 1536
  |


overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

Reserve event C, size 2048.
C is allocated at offset 1536 and producer_pos becomes 3584.
0       512      1024    1536     2048     2560     3072     3584       4096
  +-----------------------------------------------------------------------+
  |        |                 |                                   |        |
  |    A   |        B        |                 C                 |        |
  | [BUSY] |      [BUSY]     |               [BUSY]              |        |
  +-----------------------------------------------------------------------+
  ^                                                              ^
  |                                                              |
  |                                                              |
  |                                                    producer_pos = 3584
  |


overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

Submit event A.
The BUSY bit of A is cleared. B becomes the oldest event under writing, so


Now it's "under writing" :) To be committed? Or "pending committing"
or just "pending", I guess. But not under anything, it just confuses
readers. IMO.
Once again, 'oldest event to be committed'.
I should check it with an AI agent first.
...
...
 pending_pos is moved to 512, the start of B.

 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |        |                 |                                   |        |
 |    A   |        B        |                 C                 |        |
 |        |      [BUSY]     |               [BUSY]              |        |
 +-----------------------------------------------------------------------+
 ^        ^                                                     ^
 |        |                                                     |
 |        |                                                     |
 |   pending_pos = 512                                  producer_pos = 3584
 |

overwrite_pos = 0
consumer_pos = 0

Submit event B.
The BUSY bit of B is cleared, and pending_pos is moved to the start of C,
  which is the oldest event under writing now.


ditto
Again and again :(
...
...
 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |        |                 |                                   |        |
 |    A   |        B        |                 C                 |        |
 |        |                 |               [BUSY]              |        |
 +-----------------------------------------------------------------------+
 ^                          ^                                   ^
 |                          |                                   |
 |                          |                                   |
 |                     pending_pos = 1536               producer_pos = 3584
 |

overwrite_pos = 0
consumer_pos = 0

Reserve event D, size 1536 (3 * 512).
There are 2048 bytes not under writing between producer_pos and pending_pos,
  so D is allocated at offset 3584, and producer_pos is moved from 3584 to
  5120.
Since event D will overwrite all bytes of event A and the begining 512 bytes


typo: beginning, but really "first 512 bytes" would be clearer
OK, I’ll switch to 'first 512 bytes' for clarity.
...
...
 of event B, overwrite_pos is moved to the start of event C, the oldest event
 that is not overwritten.

 0       512      1024    1536     2048     2560     3072     3584       4096
 +-----------------------------------------------------------------------+
 |                 |        |                                   |        |
 |      D End      |        |                 C                 | D Begin|
 |      [BUSY]     |        |               [BUSY]              | [BUSY] |
 +-----------------------------------------------------------------------+
 ^                 ^        ^
 |                 |        |
 |                 |   pending_pos = 1536
 |                 |   overwrite_pos = 1536
 |                 |
 |             producer_pos=5120
 |

consumer_pos = 0

Reserve event E, size 1024.
Though there are 512 bytes not under writing between producer_pos and
  pending_pos, E can not be reserved, as it would overwrite the first 512
  bytes of event C, which is still under writing.

Submit event C and D.
pending_pos is moved to the end of D.
0       512      1024    1536     2048     2560     3072     3584       4096
  +-----------------------------------------------------------------------+
  |                 |        |                                   |        |
  |      D End      |        |                 C                 | D Begin|
  |                 |        |                                   |        |
  +-----------------------------------------------------------------------+
  ^                 ^        ^
  |                 |        |
  |                 |   overwrite_pos = 1536
  |                 |
  |             producer_pos=5120
  |             pending_pos=5120
  |


consumer_pos = 0
The performance data for overwrite mode will be provided in a follow-up
patch that adds overwrite mode benchs.
A sample of performance data for non-overwrite mode on an x86_64 and arm64
CPU, before and after this patch, is shown below. As we can see, no obvious
performance regression occurs.
[...]
...
...
...
Signed-off-by: Xu Kuohai xukuohai@huawei.com
include/uapi/linux/bpf.h       |   4 +
  kernel/bpf/ringbuf.c           | 159 +++++++++++++++++++++++++++------
  tools/include/uapi/linux/bpf.h |   4 +
  3 files changed, 141 insertions(+), 26 deletions(-)
[...]
...
...
...


  over_pos = READ_ONCE(rb->overwrite_pos);


  return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);



I'm trying to understand why you need to min with `rb->mask + 1`, can
you please elaborate?
We need the min because rb->producer_pos and rb->overwrite_pos are read
at different times. During this gap, a fast producer may wrap once or
more, making over_pos larger than prod_pos.
what if you read overwrite_pos before reading producer_pos? Then it
can't be larger than producer_pos and available data would be
producer_pos - max(consumer_pos, overwrite_pos)? would that work?
...
...
And also, at least for consistency, use smp_load_acquire() for overwrite_pos?
Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(),
where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos).
So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos)
to ensure a consistent view of the ring buffer.
For consistency when reading consumer_pos and producer_pos, I’m fine with
switching READ_ONCE to smp_load_acquire for overwrite_pos.
I'm not sure it matters much, but this function is called outside of
rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while
doing that WRITE_ONCE(). So it might not make any difference, but I
have mild preference for smp_load_acquire() here.
...
...
...
}
static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
@@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
         return (void*)((addr & PAGE_MASK) - off);
  }



[...]
...
...
...

  /* In overwrite mode, move overwrite_pos to the next record to be


   * overwritten if the ring buffer is full


   */



hm... here I think the important point is that we search for the next
record boundary until which we need to overwrite data such that it
fits newly reserved record. "next record to be overwritten" isn't that
important (we might never need to overwrite it). Important are those
aspects of a) staying on record boundary and b) consuming enough
records to reserve the new one.
Can you please update the comment to mention the above points?
Sure, I'll update the comment to:
In overwrite mode, advance overwrite_pos when the ring buffer is full.
The key points are to stay on record boundaries and consume enough
records to fit the new one.
ok
[...]
...
...
...

                     unsigned long rec_pos,


                     unsigned long cons_pos,


                     u32 len, u64 flags)



+{

  unsigned long rec_end;



  if (flags & BPF_RB_FORCE_WAKEUP)


          return true;



  if (flags & BPF_RB_NO_WAKEUP)


          return false;



  /* for non-overwrite mode, if consumer caught up and is waiting for


   * our record, notify about new data availability


   */


  if (likely(!rb->overwrite_mode))


          return cons_pos == rec_pos;



  /* for overwrite mode, to give the consumer a chance to catch up


   * before being overwritten, wake up consumer every half a round


   * ahead.


   */


  rec_end = rec_pos + ringbuf_round_up_hdr_len(len);



  cons_pos &= (rb->mask >> 1);


  rec_pos &= (rb->mask >> 1);


  rec_end &= (rb->mask >> 1);



  if (cons_pos == rec_pos)


          return true;



  if (rec_pos < cons_pos && cons_pos < rec_end)


          return true;



  if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))


          return true;




hm... ok, let's discuss this. Why do we need to do some half-round
heuristic for overwrite mode? If a consumer is falling behind it
should be actively trying to catch up and they don't need notification
(that's the non-overwrite mode logic already).
So there is more to this than a brief comment you left, can you please
elaborate?
The half-round wakeup was originally intended to work with libbpf in the
v1 version. In that version, libbpf used a retry loop to safely copy data
from the ring buffer that hadn’t been overwritten. By waking the consumer
once every half round, there was always a period where the consumer and
producer did not overlap, which helped reduce the number of retries.
I can't say I completely grok the logic here, but do you think we
should still keep this half-round wakeup? It looks like an arbitrary
heuristic, so I'd rather not have it.
...
...
pw-bot: cr
...

  return false;



+}



+static __always_inline
we didn't have always_inline before, any strong reason to add it now?
I just wanted to avoid introducing any performance regression. Before this
patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but
after the patch it wasn’t, so I added always_inline explicitly to keep it
inlined.
how big of a difference was it in benchmarks? It's generally frowned
upon using __always_inline without a good reason.
[...]

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

Signed-off-by: Xu Kuohai xukuohai@huawei.com