Based on the discussion [1], this patch series is to refine the memory barriers for AUX ring buffer.
Patches 01 ~ 04 to address the barriers usage in the kernel. The first patch is to make clear comment for how to use the barriers between the data store and aux_head store, this asks the driver to make sure the data is visible. Patches 02 ~ 04 is to refine the drivers for barriers after the data store.
Patches 05 ~ 07 is to fix and clean up the memory barries in perf tool for AUX ring buffer.
Since the 64-bit value's atomicity is not promised on 32-bit perf, the last patch is to report error and let perf to directly exit for this case.
Have testes the patches on Arm64 Juno platform.
[1] https://lore.kernel.org/patchwork/patch/1431867/
Leo Yan (8): perf/ring_buffer: Add comment for barriers on AUX ring buffer coresight: tmc-etr: Add barrier after updating AUX ring buffer coresight: tmc-etf: Add comment for store ordering perf/x86: Add barrier after updating bts perf auxtrace: Change to use SMP memory barriers perf auxtrace: Drop legacy __sync functions perf auxtrace: Use WRITE_ONCE() for updating aux_tail perf record: Directly bail out for compat case
arch/x86/events/intel/bts.c | 3 +++ .../hwtracing/coresight/coresight-tmc-etf.c | 6 +++++ .../hwtracing/coresight/coresight-tmc-etr.c | 8 ++++++ kernel/events/ring_buffer.c | 9 +++++++ tools/perf/builtin-record.c | 17 ++++++++++++ tools/perf/util/auxtrace.h | 27 +++---------------- 6 files changed, 47 insertions(+), 23 deletions(-)
AUX ring buffer applies almost the same barriers as perf ring buffer, but there has an exception for ordering between writing the AUX trace data and updating user_page::aux_head.
This patch adds comment for how to use the barriers on AUX ring buffer, and gives comment to ask the drivers to flush the trace data into AUX ring buffer prior to updating user_page::aux_head.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/events/ring_buffer.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 52868716ec35..5cf6579be05e 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -509,6 +509,15 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size) perf_event_aux_event(handle->event, aux_head, size, handle->aux_flags);
+ /* + * See perf_output_put_handle(), AUX ring buffer applies the same + * barrier pairing as the perf ring buffer; except for B, since + * AUX ring buffer is written by hardware trace, we cannot simply + * use the generic memory barrier (like smp_wmb()) prior to update + * user_page::aux_head, the hardware trace driver takes the + * responsibility to ensure the trace data has been flushed into + * the AUX buffer before calling perf_aux_output_end(). + */ WRITE_ONCE(rb->user_page->aux_head, rb->aux_head); if (rb_need_aux_wakeup(rb)) wakeup = true;
On Wed, Jun 02, 2021 at 06:30:00PM +0800, Leo Yan wrote:
AUX ring buffer applies almost the same barriers as perf ring buffer, but there has an exception for ordering between writing the AUX trace data and updating user_page::aux_head.
This patch adds comment for how to use the barriers on AUX ring buffer, and gives comment to ask the drivers to flush the trace data into AUX ring buffer prior to updating user_page::aux_head.
Signed-off-by: Leo Yan leo.yan@linaro.org
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
kernel/events/ring_buffer.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 52868716ec35..5cf6579be05e 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -509,6 +509,15 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size) perf_event_aux_event(handle->event, aux_head, size, handle->aux_flags);
- /*
* See perf_output_put_handle(), AUX ring buffer applies the same
* barrier pairing as the perf ring buffer; except for B, since
* AUX ring buffer is written by hardware trace, we cannot simply
* use the generic memory barrier (like smp_wmb()) prior to update
* user_page::aux_head, the hardware trace driver takes the
* responsibility to ensure the trace data has been flushed into
* the AUX buffer before calling perf_aux_output_end().
WRITE_ONCE(rb->user_page->aux_head, rb->aux_head); if (rb_need_aux_wakeup(rb)) wakeup = true;*/
-- 2.25.1
Since a memory barrier is required between AUX trace data store and aux_head store, and the AUX trace data is filled with memcpy(), it's sufficient to use smp_wmb() so can ensure the trace data is visible prior to updating aux_head.
Signed-off-by: Leo Yan leo.yan@linaro.org --- drivers/hwtracing/coresight/coresight-tmc-etr.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c index acdb59e0e661..713205db15a1 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c @@ -1563,6 +1563,14 @@ tmc_update_etr_buffer(struct coresight_device *csdev, */ if (etr_perf->snapshot) handle->head += size; + + /* + * It requires the ordering between the AUX trace data and aux_head + * store, below smp_wmb() ensures the AUX trace data is visible prior + * to updating aux_head. + */ + smp_wmb(); + out: /* * Don't set the TRUNCATED flag in snapshot mode because 1) the
AUX ring buffer is required to separate the data store and aux_head store, since the function CS_LOCK() has contained memory barrier mb(), mb() is a more conservative barrier than smp_wmb() on Arm32/Arm64, thus it's needless to add any explicit barrier anymore.
Add comment to make clear for the barrier usage for ETF.
Signed-off-by: Leo Yan leo.yan@linaro.org --- drivers/hwtracing/coresight/coresight-tmc-etf.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etf.c b/drivers/hwtracing/coresight/coresight-tmc-etf.c index 45b85edfc690..9a42ee689921 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etf.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etf.c @@ -553,6 +553,12 @@ static unsigned long tmc_update_etf_buffer(struct coresight_device *csdev, if (buf->snapshot) handle->head += to_read;
+ /* + * AUX ring buffer requires to use memory barrier to separate the trace + * data store and aux_head store, because CS_LOCK() contains mb() which + * gives more heavy barrier than smp_wmb(), it's not necessary to + * explicitly invoke any barrier. + */ CS_LOCK(drvdata->base); out: spin_unlock_irqrestore(&drvdata->spinlock, flags);
Add barrier wmb() to separate the AUX data store and aux_head store.
Signed-off-by: Leo Yan leo.yan@linaro.org --- arch/x86/events/intel/bts.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 6320d2cfd9d3..4a015d160bc5 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -209,6 +209,9 @@ static void bts_update(struct bts_ctx *bts) } else { local_set(&buf->data_size, head); } + + /* The WMB separates data store and aux_head store matches. */ + wmb(); }
static int
On Wed, Jun 02, 2021 at 06:30:03PM +0800, Leo Yan wrote:
Add barrier wmb() to separate the AUX data store and aux_head store.
Signed-off-by: Leo Yan leo.yan@linaro.org
arch/x86/events/intel/bts.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 6320d2cfd9d3..4a015d160bc5 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -209,6 +209,9 @@ static void bts_update(struct bts_ctx *bts) } else { local_set(&buf->data_size, head); }
- /* The WMB separates data store and aux_head store matches. */
- wmb();
Alexander, do we indeed need an MFENCE here? or is the BTS hardware coherent, in which case a compiler barrier would be sufficient.
The kernel and the userspace tool can access the AUX ring buffer head and tail from different CPUs, thus SMP class of barriers are required on SMP system.
This patch changes to use SMP barriers to replace mb() and rmb() barriers.
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/auxtrace.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index a4fbb33b7245..42b7ef811bde 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -444,7 +444,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) u64 head = READ_ONCE(pc->aux_head);
/* Ensure all reads are done after we read the head */ - rmb(); + smp_rmb(); return head; }
@@ -458,7 +458,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) #endif
/* Ensure all reads are done after we read the head */ - rmb(); + smp_rmb(); return head; }
@@ -470,7 +470,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) #endif
/* Ensure all reads are done before we write the tail out */ - mb(); + smp_mb(); #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) pc->aux_tail = tail; #else
On 2/06/21 1:30 pm, Leo Yan wrote:
The kernel and the userspace tool can access the AUX ring buffer head and tail from different CPUs, thus SMP class of barriers are required on SMP system.
This patch changes to use SMP barriers to replace mb() and rmb() barriers.
Signed-off-by: Leo Yan leo.yan@linaro.org
Acked-by: Adrian Hunter adrian.hunter@intel.com
tools/perf/util/auxtrace.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index a4fbb33b7245..42b7ef811bde 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -444,7 +444,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) u64 head = READ_ONCE(pc->aux_head); /* Ensure all reads are done after we read the head */
- rmb();
- smp_rmb(); return head;
} @@ -458,7 +458,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) #endif /* Ensure all reads are done after we read the head */
- rmb();
- smp_rmb(); return head;
} @@ -470,7 +470,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) #endif /* Ensure all reads are done before we write the tail out */
- mb();
- smp_mb();
#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) pc->aux_tail = tail; #else
On Wed, Jun 02, 2021 at 06:30:04PM +0800, Leo Yan wrote:
The kernel and the userspace tool can access the AUX ring buffer head and tail from different CPUs, thus SMP class of barriers are required on SMP system.
This patch changes to use SMP barriers to replace mb() and rmb() barriers.
Signed-off-by: Leo Yan leo.yan@linaro.org
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
Em Mon, Jun 07, 2021 at 05:29:55PM +0200, Peter Zijlstra escreveu:
On Wed, Jun 02, 2021 at 06:30:04PM +0800, Leo Yan wrote:
The kernel and the userspace tool can access the AUX ring buffer head and tail from different CPUs, thus SMP class of barriers are required on SMP system.
This patch changes to use SMP barriers to replace mb() and rmb() barriers.
Signed-off-by: Leo Yan leo.yan@linaro.org
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
Thanks, applied.
- Arnaldo
The main purpose for using __sync built-in functions is to support compat mode for 32-bit perf with 64-bit kernel. But using these built-in functions might cause couple potential issues.
Firstly, __sync functions originally support Intel Itanium processoer [1] but it cannot promise to support all 32-bit archs. Now these functions have become the legacy functions.
As Peter also pointed out the logic issue in the function auxtrace_mmap__write_tail(), it does a cmpxchg with 0 values to load old_tail, and then executes a further cmpxchg with old_tail to write the new tail. If consider the aux_tail might be assigned to '0' in the middle of loops, this can introduce mess for AUX buffer if the kernel fetches the temporary value '0'.
Considering __sync functions cannot really fix the 64-bit value atomicity on 32-bit archs, thus this patch drops __sync functions.
Credits to Peter for detailed analysis.
[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005...
Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/auxtrace.h | 19 ------------------- 1 file changed, 19 deletions(-)
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index 42b7ef811bde..e625bc76cdde 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -432,12 +432,6 @@ struct auxtrace_cache;
#ifdef HAVE_AUXTRACE_SUPPORT
-/* - * In snapshot mode the mmapped page is read-only which makes using - * __sync_val_compare_and_swap() problematic. However, snapshot mode expects - * the buffer is not updated while the snapshot is made (e.g. Intel PT disables - * the event) so there is not a race anyway. - */ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; @@ -451,11 +445,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; -#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) u64 head = READ_ONCE(pc->aux_head); -#else - u64 head = __sync_val_compare_and_swap(&pc->aux_head, 0, 0); -#endif
/* Ensure all reads are done after we read the head */ smp_rmb(); @@ -465,19 +455,10 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) { struct perf_event_mmap_page *pc = mm->userpg; -#if BITS_PER_LONG != 64 && defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) - u64 old_tail; -#endif
/* Ensure all reads are done before we write the tail out */ smp_mb(); -#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) pc->aux_tail = tail; -#else - do { - old_tail = __sync_val_compare_and_swap(&pc->aux_tail, 0, 0); - } while (!__sync_bool_compare_and_swap(&pc->aux_tail, old_tail, tail)); -#endif }
int auxtrace_mmap__mmap(struct auxtrace_mmap *mm,
On 2/06/21 1:30 pm, Leo Yan wrote:
The main purpose for using __sync built-in functions is to support compat mode for 32-bit perf with 64-bit kernel. But using these built-in functions might cause couple potential issues.
Firstly, __sync functions originally support Intel Itanium processoer [1] but it cannot promise to support all 32-bit archs. Now these functions have become the legacy functions.
As Peter also pointed out the logic issue in the function auxtrace_mmap__write_tail(), it does a cmpxchg with 0 values to load old_tail, and then executes a further cmpxchg with old_tail to write the new tail. If consider the aux_tail might be assigned to '0' in the middle of loops, this can introduce mess for AUX buffer if the kernel fetches the temporary value '0'.
Considering __sync functions cannot really fix the 64-bit value atomicity on 32-bit archs, thus this patch drops __sync functions.
Credits to Peter for detailed analysis.
[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005...
Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/auxtrace.h | 19 ------------------- 1 file changed, 19 deletions(-)
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index 42b7ef811bde..e625bc76cdde 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -432,12 +432,6 @@ struct auxtrace_cache; #ifdef HAVE_AUXTRACE_SUPPORT -/*
- In snapshot mode the mmapped page is read-only which makes using
- __sync_val_compare_and_swap() problematic. However, snapshot mode expects
- the buffer is not updated while the snapshot is made (e.g. Intel PT disables
- the event) so there is not a race anyway.
- */
static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; @@ -451,11 +445,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; -#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
The test and setup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT is not needed anymore either.
u64 head = READ_ONCE(pc->aux_head); -#else
- u64 head = __sync_val_compare_and_swap(&pc->aux_head, 0, 0);
-#endif /* Ensure all reads are done after we read the head */ smp_rmb(); @@ -465,19 +455,10 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) { struct perf_event_mmap_page *pc = mm->userpg; -#if BITS_PER_LONG != 64 && defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
- u64 old_tail;
-#endif /* Ensure all reads are done before we write the tail out */ smp_mb(); -#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) pc->aux_tail = tail; -#else
- do {
old_tail = __sync_val_compare_and_swap(&pc->aux_tail, 0, 0);
- } while (!__sync_bool_compare_and_swap(&pc->aux_tail, old_tail, tail));
-#endif } int auxtrace_mmap__mmap(struct auxtrace_mmap *mm,
Hi Adrian,
On Wed, Jun 02, 2021 at 01:47:42PM +0300, Adrian Hunter wrote:
[...]
@@ -451,11 +445,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; -#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
The test and setup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT is not needed anymore either.
Yes, I think there have two files should be cleaned:
Makefile.config util/auxtrace.c
If still miss anything, please let me know (I remembered there have a test case for __sync_xxx_compare_and_swap, but I cannot find it now, so this is why I am concern if I miss anything not).
Thanks for review, Leo
On 2/06/21 2:16 pm, Leo Yan wrote:
Hi Adrian,
On Wed, Jun 02, 2021 at 01:47:42PM +0300, Adrian Hunter wrote:
[...]
@@ -451,11 +445,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; -#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
The test and setup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT is not needed anymore either.
Yes, I think there have two files should be cleaned:
Makefile.config util/auxtrace.c
If still miss anything, please let me know (I remembered there have a test case for __sync_xxx_compare_and_swap, but I cannot find it now, so this is why I am concern if I miss anything not).
tools/build/feature/test-sync-compare-and-swap.c
On Wed, Jun 02, 2021 at 02:21:05PM +0300, Adrian Hunter wrote:
On 2/06/21 2:16 pm, Leo Yan wrote:
Hi Adrian,
On Wed, Jun 02, 2021 at 01:47:42PM +0300, Adrian Hunter wrote:
[...]
@@ -451,11 +445,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; -#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
The test and setup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT is not needed anymore either.
Yes, I think there have two files should be cleaned:
Makefile.config util/auxtrace.c
If still miss anything, please let me know (I remembered there have a test case for __sync_xxx_compare_and_swap, but I cannot find it now, so this is why I am concern if I miss anything not).
tools/build/feature/test-sync-compare-and-swap.c
Thanks a lot!
Use WRITE_ONCE() for updating aux_tail, so can avoid unexpected memory behaviour.
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/auxtrace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index e625bc76cdde..abc4282f5272 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -458,7 +458,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail)
/* Ensure all reads are done before we write the tail out */ smp_mb(); - pc->aux_tail = tail; + WRITE_ONCE(pc->aux_tail, tail); }
int auxtrace_mmap__mmap(struct auxtrace_mmap *mm,
On 2/06/21 1:30 pm, Leo Yan wrote:
Use WRITE_ONCE() for updating aux_tail, so can avoid unexpected memory behaviour.
Signed-off-by: Leo Yan leo.yan@linaro.org
Acked-by: Adrian Hunter adrian.hunter@intel.com
tools/perf/util/auxtrace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index e625bc76cdde..abc4282f5272 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -458,7 +458,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) /* Ensure all reads are done before we write the tail out */ smp_mb();
- pc->aux_tail = tail;
- WRITE_ONCE(pc->aux_tail, tail);
} int auxtrace_mmap__mmap(struct auxtrace_mmap *mm,
On Wed, Jun 02, 2021 at 06:30:06PM +0800, Leo Yan wrote:
Use WRITE_ONCE() for updating aux_tail, so can avoid unexpected memory behaviour.
Signed-off-by: Leo Yan leo.yan@linaro.org
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
tools/perf/util/auxtrace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index e625bc76cdde..abc4282f5272 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -458,7 +458,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) /* Ensure all reads are done before we write the tail out */ smp_mb();
- pc->aux_tail = tail;
- WRITE_ONCE(pc->aux_tail, tail);
} int auxtrace_mmap__mmap(struct auxtrace_mmap *mm, -- 2.25.1
Em Mon, Jun 07, 2021 at 05:31:01PM +0200, Peter Zijlstra escreveu:
On Wed, Jun 02, 2021 at 06:30:06PM +0800, Leo Yan wrote:
Use WRITE_ONCE() for updating aux_tail, so can avoid unexpected memory behaviour.
Signed-off-by: Leo Yan leo.yan@linaro.org
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
Leo, this one is dependendant on the 6/8, will wait for a resubmission, keeping 5/8 tho, as was Acked and applies cleanly, perf/core.
tools/perf/util/auxtrace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index e625bc76cdde..abc4282f5272 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -458,7 +458,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) /* Ensure all reads are done before we write the tail out */ smp_mb();
- pc->aux_tail = tail;
- WRITE_ONCE(pc->aux_tail, tail);
} int auxtrace_mmap__mmap(struct auxtrace_mmap *mm, -- 2.25.1
Hi Arnaldo,
On Tue, Jun 08, 2021 at 02:04:30PM -0300, Arnaldo Carvalho de Melo wrote:
Em Mon, Jun 07, 2021 at 05:31:01PM +0200, Peter Zijlstra escreveu:
On Wed, Jun 02, 2021 at 06:30:06PM +0800, Leo Yan wrote:
Use WRITE_ONCE() for updating aux_tail, so can avoid unexpected memory behaviour.
Signed-off-by: Leo Yan leo.yan@linaro.org
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
Leo, this one is dependendant on the 6/8, will wait for a resubmission, keeping 5/8 tho, as was Acked and applies cleanly, perf/core.
Yeah, will respin for patches 6/8 and 7/8.
Thanks for merging patch 5/8.
Leo
Since the 64-bit atomicity is not promised in 32-bit perf, directly report the error and bail out for this case.
Now only applies on x86_64 and Arm64 platforms.
Suggested-by: Adrian Hunter adrian.hunter@intel.com Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/builtin-record.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 3337b5f93336..f47e298281f7 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -74,6 +74,7 @@ #include <linux/zalloc.h> #include <linux/bitmap.h> #include <sys/time.h> +#include <sys/utsname.h>
struct switch_output { bool enabled; @@ -848,6 +849,22 @@ static int record__mmap_evlist(struct record *rec, opts->auxtrace_sample_mode; char msg[512];
+#ifndef __LP64__ + struct utsname uts; + int ret; + + ret = uname(&uts); + if (ret < 0) + return ret; + + if (!strncmp(uts.machine, "x86_64", 6) || !strncmp(uts.machine, "aarch64", 7) || + !strncmp(uts.machine, "arm64", 5)) { + pr_err("Error, 32-bit perf cannot record from a 64-bit kernel.\n" + "Please use a 64-bit version of perf instead.\n"); + return -ENOTSUP; + } +#endif + if (opts->affinity != PERF_AFFINITY_SYS) cpu__setup_cpunode_map();
On 2/06/21 1:30 pm, Leo Yan wrote:
Since the 64-bit atomicity is not promised in 32-bit perf, directly report the error and bail out for this case.
Now only applies on x86_64 and Arm64 platforms.
Suggested-by: Adrian Hunter adrian.hunter@intel.com
Maybe we can do better for the compat case.
We can assume the upper 32-bits change very seldom, and always increase. So for the 'read' case:
u64 first, second, last; u64 mask = (u64)((u32)-1) << 32;
do { first = READ_ONCE(pc->aux_head); rmb(); second = READ_ONCE(pc->aux_head); rmb(); last = READ_ONCE(pc->aux_head); } while ((first & mask) != (last & mask)); return second;
For the write case, we can cause a fatal error only if the new tail has non-zero upper 32-bits. That gives up to 4GiB of data before aborting:
if (tail & mask) return -1; smp_mb(); WRITE_ONCE(pc->aux_tail, tail);
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/builtin-record.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 3337b5f93336..f47e298281f7 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -74,6 +74,7 @@ #include <linux/zalloc.h> #include <linux/bitmap.h> #include <sys/time.h> +#include <sys/utsname.h> struct switch_output { bool enabled; @@ -848,6 +849,22 @@ static int record__mmap_evlist(struct record *rec, opts->auxtrace_sample_mode; char msg[512]; +#ifndef __LP64__
- struct utsname uts;
- int ret;
- ret = uname(&uts);
- if (ret < 0)
return ret;
- if (!strncmp(uts.machine, "x86_64", 6) || !strncmp(uts.machine, "aarch64", 7) ||
!strncmp(uts.machine, "arm64", 5)) {
pr_err("Error, 32-bit perf cannot record from a 64-bit kernel.\n"
"Please use a 64-bit version of perf instead.\n");
return -ENOTSUP;
- }
+#endif
- if (opts->affinity != PERF_AFFINITY_SYS) cpu__setup_cpunode_map();
Hi Adrain,
On Wed, Jun 02, 2021 at 02:18:47PM +0300, Adrian Hunter wrote:
On 2/06/21 1:30 pm, Leo Yan wrote:
Since the 64-bit atomicity is not promised in 32-bit perf, directly report the error and bail out for this case.
Now only applies on x86_64 and Arm64 platforms.
Suggested-by: Adrian Hunter adrian.hunter@intel.com
Maybe we can do better for the compat case.
We can assume the upper 32-bits change very seldom, and always increase. So for the 'read' case:
u64 first, second, last; u64 mask = (u64)((u32)-1) << 32;
do { first = READ_ONCE(pc->aux_head); rmb(); second = READ_ONCE(pc->aux_head); rmb(); last = READ_ONCE(pc->aux_head); } while ((first & mask) != (last & mask)); return second;
For the write case, we can cause a fatal error only if the new tail has non-zero upper 32-bits. That gives up to 4GiB of data before aborting:
if (tail & mask) return -1; smp_mb(); WRITE_ONCE(pc->aux_tail, tail);
Seems to me, it's pointless to only support aux_head for 64-bit and support aux_tail for 32-bit. I understand this can be helpful for the snapshot mode which only uses aux_head, but it still fails to support the normal case for AUX ring buffer using 64-bit head/tail.
Thanks, Leo
On 2/06/21 3:38 pm, Leo Yan wrote:
Hi Adrain,
On Wed, Jun 02, 2021 at 02:18:47PM +0300, Adrian Hunter wrote:
On 2/06/21 1:30 pm, Leo Yan wrote:
Since the 64-bit atomicity is not promised in 32-bit perf, directly report the error and bail out for this case.
Now only applies on x86_64 and Arm64 platforms.
Suggested-by: Adrian Hunter adrian.hunter@intel.com
Maybe we can do better for the compat case.
We can assume the upper 32-bits change very seldom, and always increase. So for the 'read' case:
u64 first, second, last; u64 mask = (u64)((u32)-1) << 32;
do { first = READ_ONCE(pc->aux_head); rmb(); second = READ_ONCE(pc->aux_head); rmb(); last = READ_ONCE(pc->aux_head); } while ((first & mask) != (last & mask)); return second;
For the write case, we can cause a fatal error only if the new tail has non-zero upper 32-bits. That gives up to 4GiB of data before aborting:
if (tail & mask) return -1; smp_mb(); WRITE_ONCE(pc->aux_tail, tail);
Seems to me, it's pointless to only support aux_head for 64-bit and support aux_tail for 32-bit. I understand this can be helpful for the snapshot mode which only uses aux_head, but it still fails to support the normal case for AUX ring buffer using 64-bit head/tail.
I am not sure why you say it is pointless. 'perf record' would still be able to capture up to 4GiB of data. Do you mean you usually capture more than 4GiB of data?
I was thinking we would separate out the compat case:
#if BITS_PER_LONG == 32 if (kernel_is_64_bit) return compat_auxtrace_mmap__[read_head/write_tail]() #endif
So the non-compat cases would not be affected.
On Mon, Jun 07, 2021 at 01:23:43PM +0300, Adrian Hunter wrote:
On 2/06/21 3:38 pm, Leo Yan wrote:
Hi Adrain,
On Wed, Jun 02, 2021 at 02:18:47PM +0300, Adrian Hunter wrote:
On 2/06/21 1:30 pm, Leo Yan wrote:
Since the 64-bit atomicity is not promised in 32-bit perf, directly report the error and bail out for this case.
Now only applies on x86_64 and Arm64 platforms.
Suggested-by: Adrian Hunter adrian.hunter@intel.com
Maybe we can do better for the compat case.
We can assume the upper 32-bits change very seldom, and always increase. So for the 'read' case:
u64 first, second, last; u64 mask = (u64)((u32)-1) << 32;
do { first = READ_ONCE(pc->aux_head); rmb(); second = READ_ONCE(pc->aux_head); rmb(); last = READ_ONCE(pc->aux_head); } while ((first & mask) != (last & mask)); return second;
For the write case, we can cause a fatal error only if the new tail has non-zero upper 32-bits. That gives up to 4GiB of data before aborting:
if (tail & mask) return -1; smp_mb(); WRITE_ONCE(pc->aux_tail, tail);
Seems to me, it's pointless to only support aux_head for 64-bit and support aux_tail for 32-bit. I understand this can be helpful for the snapshot mode which only uses aux_head, but it still fails to support the normal case for AUX ring buffer using 64-bit head/tail.
I am not sure why you say it is pointless. 'perf record' would still be able to capture up to 4GiB of data. Do you mean you usually capture more than 4GiB of data?
Okay, understand. We can support 32-bit perf for compat mode when the trace data is less than 4GiB.
I was thinking we would separate out the compat case:
#if BITS_PER_LONG == 32 if (kernel_is_64_bit) return compat_auxtrace_mmap__[read_head/write_tail]() #endif
So the non-compat cases would not be affected.
Because I don't want to introduce the complexity for read/write head and tail, and we also need to handle the same issue for the perf ring buffer. So how about below change?
The main idea for below change is it allows the perf to run normally on the compat mode and exitly if detects the buffer head is close to the low 32-bit's overflow: when detect the low 32-bit value is bigger than 0xf0000000 (so we have 256MiB margin to the overflow), it reports error and exit.
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c index 1b4091a3b508..2a9965bfeab4 100644 --- a/tools/perf/util/auxtrace.c +++ b/tools/perf/util/auxtrace.c @@ -1693,6 +1693,14 @@ static int __auxtrace_mmap__read(struct mmap *map, pr_debug3("auxtrace idx %d old %#"PRIx64" head %#"PRIx64" diff %#"PRIx64"\n", mm->idx, old, head, head - old);
+#ifdef BITS_PER_LONG == 32 + if (kernel_is_64bit() && head >= 0xf0000000) { + pr_err("32-bit perf cannot read 64-bit value atomically;\n"); + pr_err("exit to avoid the 4GB (32-bit) AUX buffer overflow on compat mode.\n"); + return -ENOMEM; + } +#endif + if (mm->mask) { head_off = head & mm->mask; old_off = old & mm->mask; diff --git a/tools/perf/util/env.c b/tools/perf/util/env.c index 9130f6fad8d5..823b69895b85 100644 --- a/tools/perf/util/env.c +++ b/tools/perf/util/env.c @@ -405,3 +405,20 @@ int perf_env__numa_node(struct perf_env *env, int cpu)
return cpu >= 0 && cpu < env->nr_numa_map ? env->numa_map[cpu] : -1; } + +int perf_kernel_is_64bit(void) +{ + struct utsname uts; + int ret; + + ret = uname(&uts); + if (ret < 0) + return 0; + + if (!strncmp(uts.machine, "x86_64", 6) || + !strncmp(uts.machine, "aarch64", 7) || + !strncmp(uts.machine, "arm64", 5)) + return 1; + + return 0; +} diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h index ca249bf5e984..c6c034fc08f6 100644 --- a/tools/perf/util/env.h +++ b/tools/perf/util/env.h @@ -147,4 +147,6 @@ void perf_env__insert_btf(struct perf_env *env, struct btf_node *btf_node); struct btf_node *perf_env__find_btf(struct perf_env *env, __u32 btf_id);
int perf_env__numa_node(struct perf_env *env, int cpu); + +int perf_kernel_is_64bit(void); #endif /* __PERF_ENV_H */ diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c index ab7108d22428..f1d3725d599a 100644 --- a/tools/perf/util/mmap.c +++ b/tools/perf/util/mmap.c @@ -323,6 +323,14 @@ int perf_mmap__push(struct mmap *md, void *to, if (rc < 0) return (rc == -EAGAIN) ? 1 : -1;
+#ifdef BITS_PER_LONG == 32 + if (kernel_is_64bit() && head >= 0xf0000000) { + pr_err("32-bit perf cannot read 64-bit value atomically;\n"); + pr_err("exit to avoid the 4GB (32-bit) buffer overflow on compat mode.\n"); + return -ENOMEM; + } +#endif + size = md->core.end - md->core.start;
if ((md->core.start & md->core.mask) + size != (md->core.end & md->core.mask)) {
Thanks, Leo
On 7/06/21 6:09 pm, Leo Yan wrote:
On Mon, Jun 07, 2021 at 01:23:43PM +0300, Adrian Hunter wrote:
On 2/06/21 3:38 pm, Leo Yan wrote:
Hi Adrain,
On Wed, Jun 02, 2021 at 02:18:47PM +0300, Adrian Hunter wrote:
On 2/06/21 1:30 pm, Leo Yan wrote:
Since the 64-bit atomicity is not promised in 32-bit perf, directly report the error and bail out for this case.
Now only applies on x86_64 and Arm64 platforms.
Suggested-by: Adrian Hunter adrian.hunter@intel.com
Maybe we can do better for the compat case.
We can assume the upper 32-bits change very seldom, and always increase. So for the 'read' case:
u64 first, second, last; u64 mask = (u64)((u32)-1) << 32;
do { first = READ_ONCE(pc->aux_head); rmb(); second = READ_ONCE(pc->aux_head); rmb(); last = READ_ONCE(pc->aux_head); } while ((first & mask) != (last & mask)); return second;
For the write case, we can cause a fatal error only if the new tail has non-zero upper 32-bits. That gives up to 4GiB of data before aborting:
if (tail & mask) return -1; smp_mb(); WRITE_ONCE(pc->aux_tail, tail);
Seems to me, it's pointless to only support aux_head for 64-bit and support aux_tail for 32-bit. I understand this can be helpful for the snapshot mode which only uses aux_head, but it still fails to support the normal case for AUX ring buffer using 64-bit head/tail.
I am not sure why you say it is pointless. 'perf record' would still be able to capture up to 4GiB of data. Do you mean you usually capture more than 4GiB of data?
Okay, understand. We can support 32-bit perf for compat mode when the trace data is less than 4GiB.
I was thinking we would separate out the compat case:
#if BITS_PER_LONG == 32 if (kernel_is_64_bit) return compat_auxtrace_mmap__[read_head/write_tail]() #endif
So the non-compat cases would not be affected.
Because I don't want to introduce the complexity for read/write head and tail, and we also need to handle the same issue for the perf ring buffer. So how about below change?
The main idea for below change is it allows the perf to run normally on the compat mode and exitly if detects the buffer head is close to the low 32-bit's overflow: when detect the low 32-bit value is bigger than 0xf0000000 (so we have 256MiB margin to the overflow), it reports error and exit.
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c index 1b4091a3b508..2a9965bfeab4 100644 --- a/tools/perf/util/auxtrace.c +++ b/tools/perf/util/auxtrace.c @@ -1693,6 +1693,14 @@ static int __auxtrace_mmap__read(struct mmap *map, pr_debug3("auxtrace idx %d old %#"PRIx64" head %#"PRIx64" diff %#"PRIx64"\n", mm->idx, old, head, head - old); +#ifdef BITS_PER_LONG == 32
- if (kernel_is_64bit() && head >= 0xf0000000) {
You are assuming the head never increases by more than 256MiB which means you should limit the buffer size to 256MiB maximum.
To me this seems a bit too far from an ideal solution.
I would have thought separating out the compat case makes things simpler to understand.
pr_err("32-bit perf cannot read 64-bit value atomically;\n");
pr_err("exit to avoid the 4GB (32-bit) AUX buffer overflow on compat mode.\n");
return -ENOMEM;
- }
+#endif
- if (mm->mask) { head_off = head & mm->mask; old_off = old & mm->mask;
diff --git a/tools/perf/util/env.c b/tools/perf/util/env.c index 9130f6fad8d5..823b69895b85 100644 --- a/tools/perf/util/env.c +++ b/tools/perf/util/env.c @@ -405,3 +405,20 @@ int perf_env__numa_node(struct perf_env *env, int cpu) return cpu >= 0 && cpu < env->nr_numa_map ? env->numa_map[cpu] : -1; }
+int perf_kernel_is_64bit(void) +{
- struct utsname uts;
- int ret;
- ret = uname(&uts);
- if (ret < 0)
return 0;
- if (!strncmp(uts.machine, "x86_64", 6) ||
!strncmp(uts.machine, "aarch64", 7) ||
!strncmp(uts.machine, "arm64", 5))
return 1;
- return 0;
+}
Obviously, we don't need to keep checking uname. It could be a global variable that is always set up early.
diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h index ca249bf5e984..c6c034fc08f6 100644 --- a/tools/perf/util/env.h +++ b/tools/perf/util/env.h @@ -147,4 +147,6 @@ void perf_env__insert_btf(struct perf_env *env, struct btf_node *btf_node); struct btf_node *perf_env__find_btf(struct perf_env *env, __u32 btf_id); int perf_env__numa_node(struct perf_env *env, int cpu);
+int perf_kernel_is_64bit(void); #endif /* __PERF_ENV_H */ diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c index ab7108d22428..f1d3725d599a 100644 --- a/tools/perf/util/mmap.c +++ b/tools/perf/util/mmap.c @@ -323,6 +323,14 @@ int perf_mmap__push(struct mmap *md, void *to, if (rc < 0) return (rc == -EAGAIN) ? 1 : -1; +#ifdef BITS_PER_LONG == 32
- if (kernel_is_64bit() && head >= 0xf0000000) {
pr_err("32-bit perf cannot read 64-bit value atomically;\n");
pr_err("exit to avoid the 4GB (32-bit) buffer overflow on compat mode.\n");
return -ENOMEM;
- }
+#endif
- size = md->core.end - md->core.start;
if ((md->core.start & md->core.mask) + size != (md->core.end & md->core.mask)) {
Thanks, Leo
Hi Adrian,
On Wed, Jun 09, 2021 at 11:23:25AM +0300, Adrian Hunter wrote:
[...]
I was thinking we would separate out the compat case:
#if BITS_PER_LONG == 32 if (kernel_is_64_bit) return compat_auxtrace_mmap__[read_head/write_tail]() #endif
So the non-compat cases would not be affected.
Because I don't want to introduce the complexity for read/write head and tail, and we also need to handle the same issue for the perf ring buffer. So how about below change?
The main idea for below change is it allows the perf to run normally on the compat mode and exitly if detects the buffer head is close to the low 32-bit's overflow: when detect the low 32-bit value is bigger than 0xf0000000 (so we have 256MiB margin to the overflow), it reports error and exit.
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c index 1b4091a3b508..2a9965bfeab4 100644 --- a/tools/perf/util/auxtrace.c +++ b/tools/perf/util/auxtrace.c @@ -1693,6 +1693,14 @@ static int __auxtrace_mmap__read(struct mmap *map, pr_debug3("auxtrace idx %d old %#"PRIx64" head %#"PRIx64" diff %#"PRIx64"\n", mm->idx, old, head, head - old); +#ifdef BITS_PER_LONG == 32
- if (kernel_is_64bit() && head >= 0xf0000000) {
You are assuming the head never increases by more than 256MiB which means you should limit the buffer size to 256MiB maximum.
To me this seems a bit too far from an ideal solution.
I would have thought separating out the compat case makes things simpler to understand.
Agreed. I will follow up the suggestions to add compat variants for accessing AUX head and tail, and will distinguish compat case with global env variable for 64-bit kernel.
After get ready, will send out for review. Thanks a lot for suggestions!
Leo