On 21/04/2025 10:58 pm, Yabin Cui wrote:
For PMUs like ARM ETM/ETE, contiguous AUX buffers are unnecessary and increase memory fragmentation.
This patch introduces PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES, allowing PMUs to request non-contiguous pages for their AUX buffers.
Signed-off-by: Yabin Cui yabinc@google.com
include/linux/perf_event.h | 1 + kernel/events/ring_buffer.c | 6 ++++++ 2 files changed, 7 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0069ba6866a4..26ca35d6a9f2 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -301,6 +301,7 @@ struct perf_event_pmu_context; #define PERF_PMU_CAP_AUX_OUTPUT 0x0080 #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100 #define PERF_PMU_CAP_AUX_PAUSE 0x0200 +#define PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES 0x0400 /**
- pmu::scope
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 5130b119d0ae..87f42f4e8edc 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -710,6 +710,12 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event, max_order = ilog2(nr_pages); watermark = 0; }
- /*
* When the PMU doesn't prefer contiguous AUX buffer pages, favor
* low-order allocations to reduce memory fragmentation.
*/
- if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES)
max_order = 0;
/* * kcalloc_node() is unable to allocate buffer if the size is larger
Hi Yabin,
I was wondering if this is just the opposite of PERF_PMU_CAP_AUX_NO_SG, and that order 0 should be used by default for all devices to solve the issue you describe. Because we already have PERF_PMU_CAP_AUX_NO_SG for devices that need contiguous pages. Then I found commit 5768402fd9c6 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") that explains that the current allocation strategy is an optimization.
Your change seems to decide that for certain devices we want to optimize for fragmentation rather than performance. If these are rarely used features specifically when looking at performance should we not continue to optimize for performance? Or at least make it user configurable?
Thanks James
* James Clark james.clark@linaro.org wrote:
On 21/04/2025 10:58 pm, Yabin Cui wrote:
For PMUs like ARM ETM/ETE, contiguous AUX buffers are unnecessary and increase memory fragmentation.
This patch introduces PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES, allowing PMUs to request non-contiguous pages for their AUX buffers.
Signed-off-by: Yabin Cui yabinc@google.com
include/linux/perf_event.h | 1 + kernel/events/ring_buffer.c | 6 ++++++ 2 files changed, 7 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0069ba6866a4..26ca35d6a9f2 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -301,6 +301,7 @@ struct perf_event_pmu_context; #define PERF_PMU_CAP_AUX_OUTPUT 0x0080 #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100 #define PERF_PMU_CAP_AUX_PAUSE 0x0200 +#define PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES 0x0400 /**
- pmu::scope
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 5130b119d0ae..87f42f4e8edc 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -710,6 +710,12 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event, max_order = ilog2(nr_pages); watermark = 0; }
- /*
* When the PMU doesn't prefer contiguous AUX buffer pages, favor
* low-order allocations to reduce memory fragmentation.
*/
- if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES)
/*max_order = 0;
- kcalloc_node() is unable to allocate buffer if the size is larger
Hi Yabin,
I was wondering if this is just the opposite of PERF_PMU_CAP_AUX_NO_SG, and that order 0 should be used by default for all devices to solve the issue you describe. Because we already have PERF_PMU_CAP_AUX_NO_SG for devices that need contiguous pages. Then I found commit 5768402fd9c6 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") that explains that the current allocation strategy is an optimization.
Your change seems to decide that for certain devices we want to optimize for fragmentation rather than performance. If these are rarely used features specifically when looking at performance should we not continue to optimize for performance? Or at least make it user configurable?
So there seems to be 3 categories:
- 1) Must have physically contiguous AUX buffers, it's a hardware ABI. (PERF_PMU_CAP_AUX_NO_SG for Intel BTS and PT.)
- 2) Would be nice to have continguous AUX buffers, for a bit more performance.
- 3) Doesn't really care.
So we do have #1, and it appears Yabin's usecase is #3?
I strongly suspect that #2 and #3 are mostly the same in practice, and that we don't really need a lot of differentiation and complexity here, just the AUX_NO_SG flag that must have a max-order allocation - all other cases should allocate the AUX buffer in a default-nice, MM-friendly way.
Thanks,
Ingo
On Tue, Apr 22, 2025 at 02:49:54PM +0200, Ingo Molnar wrote:
[...]
Hi Yabin,
I was wondering if this is just the opposite of PERF_PMU_CAP_AUX_NO_SG, and that order 0 should be used by default for all devices to solve the issue you describe. Because we already have PERF_PMU_CAP_AUX_NO_SG for devices that need contiguous pages. Then I found commit 5768402fd9c6 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") that explains that the current allocation strategy is an optimization.
Your change seems to decide that for certain devices we want to optimize for fragmentation rather than performance. If these are rarely used features specifically when looking at performance should we not continue to optimize for performance? Or at least make it user configurable?
So there seems to be 3 categories:
- Must have physically contiguous AUX buffers, it's a hardware ABI. (PERF_PMU_CAP_AUX_NO_SG for Intel BTS and PT.)
- Would be nice to have continguous AUX buffers, for a bit more performance.
- Doesn't really care.
So we do have #1, and it appears Yabin's usecase is #3?
In Yabin's case, the AUX buffer work as a bounce buffer. The hardware trace data is copied by a driver from low level's contiguous buffer to the AUX buffer.
In this case we cannot benefit much from continguous AUX buffers.
Thanks, Leo