On Tue, Jul 13, 2021 at 02:56:06PM +0200, Peter Zijlstra wrote:
> On Sun, Jul 11, 2021 at 06:40:57PM +0800, Leo Yan wrote:
> > AUX ring buffer is required to separate the data store and aux_head
> > store, since the function CS_LOCK() has contained memory barrier mb(),
> > mb() is a more conservative barrier than smp_wmb() on Arm32/Arm64, thus
> > it's needless to add any explicit barrier anymore.
> >
> > Add comment to make clear for the barrier usage for ETF.
> >
> > Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
> > ---
> > drivers/hwtracing/coresight/coresight-tmc-etf.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/drivers/hwtracing/coresight/coresight-tmc-etf.c b/drivers/hwtracing/coresight/coresight-tmc-etf.c
> > index 45b85edfc690..9a42ee689921 100644
> > --- a/drivers/hwtracing/coresight/coresight-tmc-etf.c
> > +++ b/drivers/hwtracing/coresight/coresight-tmc-etf.c
> > @@ -553,6 +553,12 @@ static unsigned long tmc_update_etf_buffer(struct coresight_device *csdev,
> > if (buf->snapshot)
> > handle->head += to_read;
> >
> > + /*
> > + * AUX ring buffer requires to use memory barrier to separate the trace
> > + * data store and aux_head store, because CS_LOCK() contains mb() which
> > + * gives more heavy barrier than smp_wmb(), it's not necessary to
> > + * explicitly invoke any barrier.
> > + */
> > CS_LOCK(drvdata->base);
>
> 'more heavy' is not a correctness argument :-)
>
> The argument to make here is that CS_LOCK() ensures completion /
> visibility of the hardware buffer.
Will correct for this, thanks for reminding!
On Tue, Jul 13, 2021 at 10:07:03AM +0300, Adrian Hunter wrote:
[...]
> > +/*
> > + * In the compat mode kernel runs in 64-bit and perf tool runs in 32-bit mode,
> > + * 32-bit perf tool cannot access 64-bit value atomically, which might lead to
> > + * the issues caused by the below sequence on multiple CPUs: when perf tool
> > + * accesses either the load operation or the store operation for 64-bit value,
> > + * on some architectures the operation is divided into two instructions, one
> > + * is for accessing the low 32-bit value and another is for the high 32-bit;
> > + * thus these two user operations can give the kernel chances to access the
> > + * 64-bit value, and thus leads to the unexpected load values.
> > + *
> > + * kernel (64-bit) user (32-bit)
> > + *
> > + * if (LOAD ->aux_tail) { --, LOAD ->aux_head_lo
> > + * STORE $aux_data | ,--->
> > + * FLUSH $aux_data | | LOAD ->aux_head_hi
> > + * STORE ->aux_head --|-------` smp_rmb()
> > + * } | LOAD $data
> > + * | smp_mb()
> > + * | STORE ->aux_tail_lo
> > + * `----------->
> > + * STORE ->aux_tail_hi
> > + *
> > + * For this reason, it's impossible for the perf tool to work correctly when
> > + * the AUX head or tail is bigger than 4GB (more than 32 bits length); and we
> > + * can not simply limit the AUX ring buffer to less than 4GB, the reason is
> > + * the pointers can be increased monotonically (e.g in snapshot mode), whatever
>
> At least for Intel PT, in snapshot mode, the head is always an offset
> into the buffer, so never more than 4GB for a 32-bit perf tool. So maybe
> leave out "(e.g in snapshot mode)"
Sure, will leave out "(e.g in snapshot mode)".
Thanks,
Leo
Hi Russell,
On Mon, Jul 12, 2021 at 03:44:11PM +0100, Russell King (Oracle) wrote:
> On Sun, Jul 11, 2021 at 06:41:05PM +0800, Leo Yan wrote:
> > When perf runs in compat mode (kernel in 64-bit mode and the perf is in
> > 32-bit mode), the 64-bit value atomicity in the user space cannot be
> > assured, E.g. on some architectures, the 64-bit value accessing is split
> > into two instructions, one is for the low 32-bit word accessing and
> > another is for the high 32-bit word.
>
> Does this apply to 32-bit ARM code on aarch64? I would not have thought
> it would, as the structure member is a __u64 and
> compat_auxtrace_mmap__read_head() doesn't seem to be marking anything
> as packed, so the compiler _should_ be able to use a LDRD instruction
> to load the value.
I think essentially your question is relevant to the memory model.
For 32-bit Arm application on aarch64, in the Armv8 architecture
reference manual ARM DDI 0487F.c, chapter "E2.2.1
Requirements for single-copy atomicity" describes:
"LDM, LDC, LDRD, STM, STC, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR, VSTM,
and VSTR instructions are executed as a sequence of word-aligned word
accesses. Each 32-bit word access is guaranteed to be single-copy
atomic. The architecture does not require subsequences of two or more
word accesses from the sequence to be single-copy atomic."
So I think LDRD/STRD instruction cannot promise the atomicity for
loading or storing two words in 32-bit Arm.
And another thought is the functions compat_auxtrace_mmap__read_head()
is a general function, I avoid to write it with any architecture
specific instructions.
> Is this a problem noticed on non-ARM architectures?
No, actually we just concluded the potential issue based on the analysis
for the weak memory model.
Thanks,
Leo
This patch series is to refine the memory barriers for AUX ring buffer.
Patches 01 ~ 04 to address the barriers usage in the kernel. The first
patch is to make clear comment for how to use the barriers between the
data store and aux_head store, this asks the driver to make sure the
data is visible. Patches 02 ~ 04 is to refine the drivers for barriers
after the data store.
Patch 05 is to use WRITE_ONCE() for updating aux_tail.
Patches 06 ~ 09 is to drop the legacy __sync functions, and polish for
duplicate code and cleanup the build and feature test after
SYNC_COMPARE_AND_SWAP is not used.
Patch 10 introduces a new global variable to indicate the kernel runs in
64-bit mode which can be used to confirm if in compat mode; patch 11
introduces variant functions for accessing AUX head/tail, it resolves
the aotmicity for reading head pointer, and returns error for the tail
is bigger than 4GB.
Have testes the patches on Arm64 Juno platform.
Changes from v3:
- Removed the inapprocate paragraph in the commit log for patch "perf
auxtrace: Drop legacy __sync functions" (Adrian);
- Added new patch to remove feature-sync-compare-and-swap test (Adrian);
- Th patch for "perf auxtrace: Use WRITE_ONCE() for updating aux_tail",
is a standlone and simple change, so moved it ahead in the patch set
for better ordering;
- Minor improvement for commit logs in the last two patches.
Changes from v2:
- Removed auxtrace_mmap__read_snapshot_head(), which has the duplicated
code with auxtrace_mmap__read_head();
- Cleanuped the build for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT (Adrian);
- Added global variable "kernel_is_64_bit" (Adrian);
- Added compat variants compat_auxtrace_mmap__{read_head|write_tail}
(Adrian).
Leo Yan (11):
perf/ring_buffer: Add comment for barriers on AUX ring buffer
coresight: tmc-etr: Add barrier after updating AUX ring buffer
coresight: tmc-etf: Add comment for store ordering
perf/x86: Add barrier after updating bts
perf auxtrace: Use WRITE_ONCE() for updating aux_tail
perf auxtrace: Drop legacy __sync functions
perf auxtrace: Remove auxtrace_mmap__read_snapshot_head()
perf: Cleanup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT
tools: Remove feature-sync-compare-and-swap feature detection
perf env: Set flag for kernel is 64-bit mode
perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail}
arch/x86/events/intel/bts.c | 3 +
.../hwtracing/coresight/coresight-tmc-etf.c | 6 +
.../hwtracing/coresight/coresight-tmc-etr.c | 8 ++
kernel/events/ring_buffer.c | 9 ++
tools/build/Makefile.feature | 1 -
tools/build/feature/Makefile | 4 -
tools/build/feature/test-all.c | 4 -
.../feature/test-sync-compare-and-swap.c | 15 ---
tools/perf/Makefile.config | 4 -
tools/perf/util/auxtrace.c | 19 ++-
tools/perf/util/auxtrace.h | 109 ++++++++++++++----
tools/perf/util/env.c | 17 ++-
tools/perf/util/env.h | 1 +
13 files changed, 136 insertions(+), 64 deletions(-)
delete mode 100644 tools/build/feature/test-sync-compare-and-swap.c
--
2.25.1
The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the
low level's architecture code, e.g. for Arm64, it maps the memory with
the attribution "Normal non-cacheable"; this can be concluded from the
definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h.
Later when access the AUX bounce buffer, since the memory mapping is
non-cacheable, it's low efficiency due to every load instruction must
reach out DRAM.
This patch changes to allocate pages with alloc_pages_node(), thus the
driver can access the memory with cacheable mapping in the kernel linear
virtual address; therefore, because load instructions can fetch data
from cache lines rather than always read data from DRAM, the driver can
boost memory coping performance. After using the cacheable mapping, the
driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to
read bounce buffer so can avoid read stale trace data.
By measurement the duration for function tmc_update_etr_buffer() with
ftrace function_graph tracer, it shows the performance significant
improvement for copying 4MiB data from bounce buffer:
# echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise
# echo tmc_update_etr_buffer > set_graph_function
# echo function_graph > current_tracer
before:
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) | tmc_update_etr_buffer() {
...
2) # 8148.320 us | }
after:
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) | tmc_update_etr_buffer() {
...
2) # 2463.980 us | }
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
Changes from v1:
Set "flat_buf->daddr" to 0 when fails to map DMA region; and dropped the
unexpected if condition change in tmc_etr_free_flat_buf().
.../hwtracing/coresight/coresight-tmc-etr.c | 56 ++++++++++++++++---
1 file changed, 49 insertions(+), 7 deletions(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
index acdb59e0e661..888b0f929d33 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
@@ -21,6 +21,7 @@
struct etr_flat_buf {
struct device *dev;
+ struct page *pages;
dma_addr_t daddr;
void *vaddr;
size_t size;
@@ -600,6 +601,7 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
{
struct etr_flat_buf *flat_buf;
struct device *real_dev = drvdata->csdev->dev.parent;
+ ssize_t aligned_size;
/* We cannot reuse existing pages for flat buf */
if (pages)
@@ -609,11 +611,18 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
if (!flat_buf)
return -ENOMEM;
- flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size,
- &flat_buf->daddr, GFP_KERNEL);
- if (!flat_buf->vaddr) {
- kfree(flat_buf);
- return -ENOMEM;
+ aligned_size = PAGE_ALIGN(etr_buf->size);
+ flat_buf->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO,
+ get_order(aligned_size));
+ if (!flat_buf->pages)
+ goto fail_alloc_pages;
+
+ flat_buf->vaddr = page_address(flat_buf->pages);
+ flat_buf->daddr = dma_map_page(real_dev, flat_buf->pages, 0,
+ aligned_size, DMA_FROM_DEVICE);
+ if (dma_mapping_error(real_dev, flat_buf->daddr)) {
+ flat_buf->daddr = 0;
+ goto fail_dma_map_page;
}
flat_buf->size = etr_buf->size;
@@ -622,6 +631,12 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
etr_buf->mode = ETR_MODE_FLAT;
etr_buf->private = flat_buf;
return 0;
+
+fail_dma_map_page:
+ __free_pages(flat_buf->pages, get_order(aligned_size));
+fail_alloc_pages:
+ kfree(flat_buf);
+ return -ENOMEM;
}
static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf)
@@ -630,15 +645,20 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf)
if (flat_buf && flat_buf->daddr) {
struct device *real_dev = flat_buf->dev->parent;
+ ssize_t aligned_size = PAGE_ALIGN(etr_buf->size);
- dma_free_coherent(real_dev, flat_buf->size,
- flat_buf->vaddr, flat_buf->daddr);
+ dma_unmap_page(real_dev, flat_buf->daddr, aligned_size,
+ DMA_FROM_DEVICE);
+ __free_pages(flat_buf->pages, get_order(aligned_size));
}
kfree(flat_buf);
}
static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
{
+ struct etr_flat_buf *flat_buf = etr_buf->private;
+ struct device *real_dev = flat_buf->dev->parent;
+
/*
* Adjust the buffer to point to the beginning of the trace data
* and update the available trace data.
@@ -648,6 +668,28 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
etr_buf->len = etr_buf->size;
else
etr_buf->len = rwp - rrp;
+
+ if (etr_buf->offset + etr_buf->len > etr_buf->size) {
+ int len1, len2;
+
+ /*
+ * If trace data is wrapped around, sync AUX bounce buffer
+ * for two chunks: "len1" is for the trace date length at
+ * the tail of bounce buffer, and "len2" is the length from
+ * the start of the buffer after wrapping around.
+ */
+ len1 = etr_buf->size - etr_buf->offset;
+ len2 = etr_buf->len - len1;
+ dma_sync_single_for_cpu(real_dev,
+ flat_buf->daddr + etr_buf->offset,
+ len1, DMA_FROM_DEVICE);
+ dma_sync_single_for_cpu(real_dev, flat_buf->daddr,
+ len2, DMA_FROM_DEVICE);
+ } else {
+ dma_sync_single_for_cpu(real_dev,
+ flat_buf->daddr + etr_buf->offset,
+ etr_buf->len, DMA_FROM_DEVICE);
+ }
}
static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf,
--
2.25.1
Current code syncs the buffer range is [offset, offset+len), it doesn't
consider the case when the trace data is wrapped around, in this case
'offset+len' is bigger than 'etr_buf->size'. Thus it syncs buffer out
of the memory buffer, and it also misses to sync buffer from the start
of the memory.
This patch corrects the memory sync ranges, when detects the wrapping
around case, it splits into two chunks: one chunk is the tail of the
buffer and another chunk is from the start of the buffer after wrapping
around.
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
.../hwtracing/coresight/coresight-tmc-etr.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
index 888b0f929d33..a1afefcbf175 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
@@ -780,7 +780,23 @@ static void tmc_etr_sync_sg_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
else
etr_buf->len = ((w_offset < r_offset) ? etr_buf->size : 0) +
w_offset - r_offset;
- tmc_sg_table_sync_data_range(table, r_offset, etr_buf->len);
+
+ if (r_offset + etr_buf->len > etr_buf->size) {
+ int len1, len2;
+
+ /*
+ * If trace data is wrapped around, sync AUX bounce buffer
+ * for two chunks: "len1" is for the trace date length at
+ * the tail of bounce buffer, and "len2" is the length from
+ * the start of the buffer after wrapping around.
+ */
+ len1 = etr_buf->size - r_offset;
+ len2 = etr_buf->len - len1;
+ tmc_sg_table_sync_data_range(table, r_offset, len1);
+ tmc_sg_table_sync_data_range(table, 0, len2);
+ } else {
+ tmc_sg_table_sync_data_range(table, r_offset, etr_buf->len);
+ }
}
static const struct etr_buf_operations etr_sg_buf_ops = {
--
2.25.1
Hi Adrian,
On Sat, Jul 10, 2021 at 03:36:53PM +0300, Adrian Hunter wrote:
> On 4/07/21 10:16 am, Leo Yan wrote:
> > Since the __sync functions have been dropped, This patch removes unused
> > build and checking for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT in perf tool.
> >
> > Note, there have a test for SYNC_COMPARE_AND_SWAP and the test file is
> > located in build/feature/test-sync-compare-and-swap.c. Since there
> > still has several components using the sync functions, it's deliberately
> > to not be removed.
>
> I don't quite follow that. If they aren't using the feature test
> macro, then why keep the feature test?
There are files are still using __sync_xxx_compare_and_swap() functions,
e.g. in the folder tools/testing/selftests/bpf. On the other hand,
after drop __sync functions from perf, there have no any Makefile check
the feature 'feature-sync-compare-and-swap'. So it's safe to remove the
feature test.
Sorry for confusion. Will drop the feature test in new patch set.
Thanks,
Leo
On Sat, Jul 10, 2021 at 03:34:24PM +0300, Adrian Hunter wrote:
> On 4/07/21 10:16 am, Leo Yan wrote:
> > The main purpose for using __sync built-in functions is to support
> > compat mode for 32-bit perf with 64-bit kernel. But using these
> > built-in functions might cause couple potential issues.
> >
> > Firstly, __sync functions originally support Intel Itanium processoer [1]
> > but it cannot promise to support all 32-bit archs. Now these
> > functions have become the legacy functions.
> >
> > As Peter also pointed out the logic issue in the function
> > auxtrace_mmap__write_tail(), it does a cmpxchg with 0 values to load
> > old_tail, and then executes a further cmpxchg with old_tail to write
> > the new tail. If consider the aux_tail might be assigned to '0' in the
> > middle of loops, this can introduce mess for AUX buffer if the kernel
> > fetches the temporary value '0'.
>
> That is not exactly true. The definition of __sync_*_compare_and_swap is
> "if the current value of *ptr is oldval, then write newval into *pt"
> so replacing zero with zero won't make any difference, but it will return
> the old value in any case. Probably better to leave out that paragraph.
Okay, I admit the paragraph is not right, will drop it to avoid
confusion. Thanks for review!
Leo