CoreSight July 2021

coresight@lists.linaro.org

9 participants
46 discussions

Re: [PATCH v4 03/11] coresight: tmc-etf: Add comment for store ordering

by Leo Yan

On Tue, Jul 13, 2021 at 02:56:06PM +0200, Peter Zijlstra wrote: > On Sun, Jul 11, 2021 at 06:40:57PM +0800, Leo Yan wrote: > > AUX ring buffer is required to separate the data store and aux_head > > store, since the function CS_LOCK() has contained memory barrier mb(), > > mb() is a more conservative barrier than smp_wmb() on Arm32/Arm64, thus > > it's needless to add any explicit barrier anymore. > > > > Add comment to make clear for the barrier usage for ETF. > > > > Signed-off-by: Leo Yan <leo.yan(a)linaro.org> > > --- > > drivers/hwtracing/coresight/coresight-tmc-etf.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/drivers/hwtracing/coresight/coresight-tmc-etf.c b/drivers/hwtracing/coresight/coresight-tmc-etf.c > > index 45b85edfc690..9a42ee689921 100644 > > --- a/drivers/hwtracing/coresight/coresight-tmc-etf.c > > +++ b/drivers/hwtracing/coresight/coresight-tmc-etf.c > > @@ -553,6 +553,12 @@ static unsigned long tmc_update_etf_buffer(struct coresight_device *csdev, > > if (buf->snapshot) > > handle->head += to_read; > > > > + /* > > + * AUX ring buffer requires to use memory barrier to separate the trace > > + * data store and aux_head store, because CS_LOCK() contains mb() which > > + * gives more heavy barrier than smp_wmb(), it's not necessary to > > + * explicitly invoke any barrier. > > + */ > > CS_LOCK(drvdata->base); > > 'more heavy' is not a correctness argument :-) > > The argument to make here is that CS_LOCK() ensures completion / > visibility of the hardware buffer. Will correct for this, thanks for reminding!

3 years, 12 months

Re: [PATCH v4 11/11] perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail}

by Leo Yan

On Tue, Jul 13, 2021 at 10:07:03AM +0300, Adrian Hunter wrote: [...] > > +/* > > + * In the compat mode kernel runs in 64-bit and perf tool runs in 32-bit mode, > > + * 32-bit perf tool cannot access 64-bit value atomically, which might lead to > > + * the issues caused by the below sequence on multiple CPUs: when perf tool > > + * accesses either the load operation or the store operation for 64-bit value, > > + * on some architectures the operation is divided into two instructions, one > > + * is for accessing the low 32-bit value and another is for the high 32-bit; > > + * thus these two user operations can give the kernel chances to access the > > + * 64-bit value, and thus leads to the unexpected load values. > > + * > > + * kernel (64-bit) user (32-bit) > > + * > > + * if (LOAD ->aux_tail) { --, LOAD ->aux_head_lo > > + * STORE $aux_data | ,---> > > + * FLUSH $aux_data | | LOAD ->aux_head_hi > > + * STORE ->aux_head --|-------` smp_rmb() > > + * } | LOAD $data > > + * | smp_mb() > > + * | STORE ->aux_tail_lo > > + * `-----------> > > + * STORE ->aux_tail_hi > > + * > > + * For this reason, it's impossible for the perf tool to work correctly when > > + * the AUX head or tail is bigger than 4GB (more than 32 bits length); and we > > + * can not simply limit the AUX ring buffer to less than 4GB, the reason is > > + * the pointers can be increased monotonically (e.g in snapshot mode), whatever > > At least for Intel PT, in snapshot mode, the head is always an offset > into the buffer, so never more than 4GB for a 32-bit perf tool. So maybe > leave out "(e.g in snapshot mode)" Sure, will leave out "(e.g in snapshot mode)". Thanks, Leo

3 years, 12 months

Re: [PATCH v4 11/11] perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail}

by Leo Yan

Hi Russell, On Mon, Jul 12, 2021 at 03:44:11PM +0100, Russell King (Oracle) wrote: > On Sun, Jul 11, 2021 at 06:41:05PM +0800, Leo Yan wrote: > > When perf runs in compat mode (kernel in 64-bit mode and the perf is in > > 32-bit mode), the 64-bit value atomicity in the user space cannot be > > assured, E.g. on some architectures, the 64-bit value accessing is split > > into two instructions, one is for the low 32-bit word accessing and > > another is for the high 32-bit word. > > Does this apply to 32-bit ARM code on aarch64? I would not have thought > it would, as the structure member is a __u64 and > compat_auxtrace_mmap__read_head() doesn't seem to be marking anything > as packed, so the compiler _should_ be able to use a LDRD instruction > to load the value. I think essentially your question is relevant to the memory model. For 32-bit Arm application on aarch64, in the Armv8 architecture reference manual ARM DDI 0487F.c, chapter "E2.2.1 Requirements for single-copy atomicity" describes: "LDM, LDC, LDRD, STM, STC, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR, VSTM, and VSTR instructions are executed as a sequence of word-aligned word accesses. Each 32-bit word access is guaranteed to be single-copy atomic. The architecture does not require subsequences of two or more word accesses from the sequence to be single-copy atomic." So I think LDRD/STRD instruction cannot promise the atomicity for loading or storing two words in 32-bit Arm. And another thought is the functions compat_auxtrace_mmap__read_head() is a general function, I avoid to write it with any architecture specific instructions. > Is this a problem noticed on non-ARM architectures? No, actually we just concluded the potential issue based on the analysis for the weak memory model. Thanks, Leo

3 years, 12 months

[PATCH v4 00/11] perf: Refine barriers for AUX ring buffer

by Leo Yan

This patch series is to refine the memory barriers for AUX ring buffer. Patches 01 ~ 04 to address the barriers usage in the kernel. The first patch is to make clear comment for how to use the barriers between the data store and aux_head store, this asks the driver to make sure the data is visible. Patches 02 ~ 04 is to refine the drivers for barriers after the data store. Patch 05 is to use WRITE_ONCE() for updating aux_tail. Patches 06 ~ 09 is to drop the legacy __sync functions, and polish for duplicate code and cleanup the build and feature test after SYNC_COMPARE_AND_SWAP is not used. Patch 10 introduces a new global variable to indicate the kernel runs in 64-bit mode which can be used to confirm if in compat mode; patch 11 introduces variant functions for accessing AUX head/tail, it resolves the aotmicity for reading head pointer, and returns error for the tail is bigger than 4GB. Have testes the patches on Arm64 Juno platform. Changes from v3: - Removed the inapprocate paragraph in the commit log for patch "perf auxtrace: Drop legacy __sync functions" (Adrian); - Added new patch to remove feature-sync-compare-and-swap test (Adrian); - Th patch for "perf auxtrace: Use WRITE_ONCE() for updating aux_tail", is a standlone and simple change, so moved it ahead in the patch set for better ordering; - Minor improvement for commit logs in the last two patches. Changes from v2: - Removed auxtrace_mmap__read_snapshot_head(), which has the duplicated code with auxtrace_mmap__read_head(); - Cleanuped the build for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT (Adrian); - Added global variable "kernel_is_64_bit" (Adrian); - Added compat variants compat_auxtrace_mmap__{read_head|write_tail} (Adrian). Leo Yan (11): perf/ring_buffer: Add comment for barriers on AUX ring buffer coresight: tmc-etr: Add barrier after updating AUX ring buffer coresight: tmc-etf: Add comment for store ordering perf/x86: Add barrier after updating bts perf auxtrace: Use WRITE_ONCE() for updating aux_tail perf auxtrace: Drop legacy __sync functions perf auxtrace: Remove auxtrace_mmap__read_snapshot_head() perf: Cleanup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT tools: Remove feature-sync-compare-and-swap feature detection perf env: Set flag for kernel is 64-bit mode perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail} arch/x86/events/intel/bts.c | 3 + .../hwtracing/coresight/coresight-tmc-etf.c | 6 + .../hwtracing/coresight/coresight-tmc-etr.c | 8 ++ kernel/events/ring_buffer.c | 9 ++ tools/build/Makefile.feature | 1 - tools/build/feature/Makefile | 4 - tools/build/feature/test-all.c | 4 - .../feature/test-sync-compare-and-swap.c | 15 --- tools/perf/Makefile.config | 4 - tools/perf/util/auxtrace.c | 19 ++- tools/perf/util/auxtrace.h | 109 ++++++++++++++---- tools/perf/util/env.c | 17 ++- tools/perf/util/env.h | 1 + 13 files changed, 136 insertions(+), 64 deletions(-) delete mode 100644 tools/build/feature/test-sync-compare-and-swap.c -- 2.25.1

3 years, 12 months

Re: [PATCH v4 07/11] perf auxtrace: Remove auxtrace_mmap__read_snapshot_head()

by Leo Yan

On Mon, Jul 12, 2021 at 05:32:11PM +0300, Adrian Hunter wrote: [...] > > --- a/tools/perf/util/auxtrace.c > > +++ b/tools/perf/util/auxtrace.c > > @@ -1686,13 +1686,12 @@ static int __auxtrace_mmap__read(struct mmap *map, > > union perf_event ev; > > void *data1, *data2; > > > > + head = auxtrace_mmap__read_head(mm); > > + > > if (snapshot) { > > - head = auxtrace_mmap__read_snapshot_head(mm); > > if (auxtrace_record__find_snapshot(itr, mm->idx, mm, data, > > &head, &old)) > > That leaves a nested 'if' which is not kernel style i.e. could be > > if (snapshot && > auxtrace_record__find_snapshot(itr, mm->idx, mm, data, &head, &old)) Will refine in next spin, thanks for suggestion! > > return -1; > > - } else { > > - head = auxtrace_mmap__read_head(mm); > > } > > > > if (old == head)

3 years, 12 months

[PATCH v2] coresight: tmc-etr: Speed up for bounce buffer in flat mode

by Leo Yan

The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the low level's architecture code, e.g. for Arm64, it maps the memory with the attribution "Normal non-cacheable"; this can be concluded from the definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h. Later when access the AUX bounce buffer, since the memory mapping is non-cacheable, it's low efficiency due to every load instruction must reach out DRAM. This patch changes to allocate pages with alloc_pages_node(), thus the driver can access the memory with cacheable mapping in the kernel linear virtual address; therefore, because load instructions can fetch data from cache lines rather than always read data from DRAM, the driver can boost memory coping performance. After using the cacheable mapping, the driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to read bounce buffer so can avoid read stale trace data. By measurement the duration for function tmc_update_etr_buffer() with ftrace function_graph tracer, it shows the performance significant improvement for copying 4MiB data from bounce buffer: # echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise # echo tmc_update_etr_buffer > set_graph_function # echo function_graph > current_tracer before: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 8148.320 us | } after: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 2463.980 us | } Signed-off-by: Leo Yan <leo.yan(a)linaro.org> --- Changes from v1: Set "flat_buf->daddr" to 0 when fails to map DMA region; and dropped the unexpected if condition change in tmc_etr_free_flat_buf(). .../hwtracing/coresight/coresight-tmc-etr.c | 56 ++++++++++++++++--- 1 file changed, 49 insertions(+), 7 deletions(-) diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c index acdb59e0e661..888b0f929d33 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c @@ -21,6 +21,7 @@ struct etr_flat_buf { struct device *dev; + struct page *pages; dma_addr_t daddr; void *vaddr; size_t size; @@ -600,6 +601,7 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, { struct etr_flat_buf *flat_buf; struct device *real_dev = drvdata->csdev->dev.parent; + ssize_t aligned_size; /* We cannot reuse existing pages for flat buf */ if (pages) @@ -609,11 +611,18 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, if (!flat_buf) return -ENOMEM; - flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size, - &flat_buf->daddr, GFP_KERNEL); - if (!flat_buf->vaddr) { - kfree(flat_buf); - return -ENOMEM; + aligned_size = PAGE_ALIGN(etr_buf->size); + flat_buf->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, + get_order(aligned_size)); + if (!flat_buf->pages) + goto fail_alloc_pages; + + flat_buf->vaddr = page_address(flat_buf->pages); + flat_buf->daddr = dma_map_page(real_dev, flat_buf->pages, 0, + aligned_size, DMA_FROM_DEVICE); + if (dma_mapping_error(real_dev, flat_buf->daddr)) { + flat_buf->daddr = 0; + goto fail_dma_map_page; } flat_buf->size = etr_buf->size; @@ -622,6 +631,12 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, etr_buf->mode = ETR_MODE_FLAT; etr_buf->private = flat_buf; return 0; + +fail_dma_map_page: + __free_pages(flat_buf->pages, get_order(aligned_size)); +fail_alloc_pages: + kfree(flat_buf); + return -ENOMEM; } static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf) @@ -630,15 +645,20 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf) if (flat_buf && flat_buf->daddr) { struct device *real_dev = flat_buf->dev->parent; + ssize_t aligned_size = PAGE_ALIGN(etr_buf->size); - dma_free_coherent(real_dev, flat_buf->size, - flat_buf->vaddr, flat_buf->daddr); + dma_unmap_page(real_dev, flat_buf->daddr, aligned_size, + DMA_FROM_DEVICE); + __free_pages(flat_buf->pages, get_order(aligned_size)); } kfree(flat_buf); } static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) { + struct etr_flat_buf *flat_buf = etr_buf->private; + struct device *real_dev = flat_buf->dev->parent; + /* * Adjust the buffer to point to the beginning of the trace data * and update the available trace data. @@ -648,6 +668,28 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) etr_buf->len = etr_buf->size; else etr_buf->len = rwp - rrp; + + if (etr_buf->offset + etr_buf->len > etr_buf->size) { + int len1, len2; + + /* + * If trace data is wrapped around, sync AUX bounce buffer + * for two chunks: "len1" is for the trace date length at + * the tail of bounce buffer, and "len2" is the length from + * the start of the buffer after wrapping around. + */ + len1 = etr_buf->size - etr_buf->offset; + len2 = etr_buf->len - len1; + dma_sync_single_for_cpu(real_dev, + flat_buf->daddr + etr_buf->offset, + len1, DMA_FROM_DEVICE); + dma_sync_single_for_cpu(real_dev, flat_buf->daddr, + len2, DMA_FROM_DEVICE); + } else { + dma_sync_single_for_cpu(real_dev, + flat_buf->daddr + etr_buf->offset, + etr_buf->len, DMA_FROM_DEVICE); + } } static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf, -- 2.25.1

4 years

[PATCH 1/2] coresight: tmc-etr: Correct memory sync ranges in SG mode

by Leo Yan

Current code syncs the buffer range is [offset, offset+len), it doesn't consider the case when the trace data is wrapped around, in this case 'offset+len' is bigger than 'etr_buf->size'. Thus it syncs buffer out of the memory buffer, and it also misses to sync buffer from the start of the memory. This patch corrects the memory sync ranges, when detects the wrapping around case, it splits into two chunks: one chunk is the tail of the buffer and another chunk is from the start of the buffer after wrapping around. Signed-off-by: Leo Yan <leo.yan(a)linaro.org> --- .../hwtracing/coresight/coresight-tmc-etr.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c index 888b0f929d33..a1afefcbf175 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c @@ -780,7 +780,23 @@ static void tmc_etr_sync_sg_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) else etr_buf->len = ((w_offset < r_offset) ? etr_buf->size : 0) + w_offset - r_offset; - tmc_sg_table_sync_data_range(table, r_offset, etr_buf->len); + + if (r_offset + etr_buf->len > etr_buf->size) { + int len1, len2; + + /* + * If trace data is wrapped around, sync AUX bounce buffer + * for two chunks: "len1" is for the trace date length at + * the tail of bounce buffer, and "len2" is the length from + * the start of the buffer after wrapping around. + */ + len1 = etr_buf->size - r_offset; + len2 = etr_buf->len - len1; + tmc_sg_table_sync_data_range(table, r_offset, len1); + tmc_sg_table_sync_data_range(table, 0, len2); + } else { + tmc_sg_table_sync_data_range(table, r_offset, etr_buf->len); + } } static const struct etr_buf_operations etr_sg_buf_ops = { -- 2.25.1

4 years

Re: [PATCH v3 07/10] perf: Cleanup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT

by Leo Yan

Hi Adrian, On Sat, Jul 10, 2021 at 03:36:53PM +0300, Adrian Hunter wrote: > On 4/07/21 10:16 am, Leo Yan wrote: > > Since the __sync functions have been dropped, This patch removes unused > > build and checking for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT in perf tool. > > > > Note, there have a test for SYNC_COMPARE_AND_SWAP and the test file is > > located in build/feature/test-sync-compare-and-swap.c. Since there > > still has several components using the sync functions, it's deliberately > > to not be removed. > > I don't quite follow that. If they aren't using the feature test > macro, then why keep the feature test? There are files are still using __sync_xxx_compare_and_swap() functions, e.g. in the folder tools/testing/selftests/bpf. On the other hand, after drop __sync functions from perf, there have no any Makefile check the feature 'feature-sync-compare-and-swap'. So it's safe to remove the feature test. Sorry for confusion. Will drop the feature test in new patch set. Thanks, Leo

4 years

Re: [PATCH v3 05/10] perf auxtrace: Drop legacy __sync functions

by Leo Yan

On Sat, Jul 10, 2021 at 03:34:24PM +0300, Adrian Hunter wrote: > On 4/07/21 10:16 am, Leo Yan wrote: > > The main purpose for using __sync built-in functions is to support > > compat mode for 32-bit perf with 64-bit kernel. But using these > > built-in functions might cause couple potential issues. > > > > Firstly, __sync functions originally support Intel Itanium processoer [1] > > but it cannot promise to support all 32-bit archs. Now these > > functions have become the legacy functions. > > > > As Peter also pointed out the logic issue in the function > > auxtrace_mmap__write_tail(), it does a cmpxchg with 0 values to load > > old_tail, and then executes a further cmpxchg with old_tail to write > > the new tail. If consider the aux_tail might be assigned to '0' in the > > middle of loops, this can introduce mess for AUX buffer if the kernel > > fetches the temporary value '0'. > > That is not exactly true. The definition of __sync_*_compare_and_swap is > "if the current value of *ptr is oldval, then write newval into *pt" > so replacing zero with zero won't make any difference, but it will return > the old value in any case. Probably better to leave out that paragraph. Okay, I admit the paragraph is not right, will drop it to avoid confusion. Thanks for review! Leo

4 years

[PATCH v1] coresight: tmc-etr: Speed up for bounce buffer in flat mode

by Leo Yan

The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the low level's architecture code, e.g. for Arm64, it maps the memory with the attribution "Normal non-cacheable"; this can be concluded from the definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h. Later when access the AUX bounce buffer, since the memory mapping is non-cacheable, it's low efficiency due to every load instruction must reach out DRAM. This patch changes to allocate pages with alloc_pages_node(), thus the driver can access the memory with cacheable mapping in the kernel linear virtual address; therefore, because load instructions can fetch data from cache lines rather than always read data from DRAM, the driver can boost memory coping performance. After using the cacheable mapping, the driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to read bounce buffer so can avoid read stale trace data. By measurement the duration for function tmc_update_etr_buffer() with ftrace function_graph tracer, it shows the performance significant improvement for copying 4MiB data from bounce buffer: # echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise # echo tmc_update_etr_buffer > set_graph_function # echo function_graph > current_tracer before: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 8148.320 us | } after: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 2463.980 us | } Signed-off-by: Leo Yan <leo.yan(a)linaro.org> --- .../hwtracing/coresight/coresight-tmc-etr.c | 58 ++++++++++++++++--- 1 file changed, 49 insertions(+), 9 deletions(-) diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c index acdb59e0e661..94558dc5bcf4 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c @@ -21,6 +21,7 @@ struct etr_flat_buf { struct device *dev; + struct page *pages; dma_addr_t daddr; void *vaddr; size_t size; @@ -600,6 +601,7 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, { struct etr_flat_buf *flat_buf; struct device *real_dev = drvdata->csdev->dev.parent; + ssize_t aligned_size; /* We cannot reuse existing pages for flat buf */ if (pages) @@ -609,12 +611,17 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, if (!flat_buf) return -ENOMEM; - flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size, - &flat_buf->daddr, GFP_KERNEL); - if (!flat_buf->vaddr) { - kfree(flat_buf); - return -ENOMEM; - } + aligned_size = PAGE_ALIGN(etr_buf->size); + flat_buf->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, + get_order(aligned_size)); + if (!flat_buf->pages) + goto fail_alloc_pages; + + flat_buf->vaddr = page_address(flat_buf->pages); + flat_buf->daddr = dma_map_page(real_dev, flat_buf->pages, 0, + aligned_size, DMA_FROM_DEVICE); + if (dma_mapping_error(real_dev, flat_buf->daddr)) + goto fail_dma_map_page; flat_buf->size = etr_buf->size; flat_buf->dev = &drvdata->csdev->dev; @@ -622,23 +629,34 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, etr_buf->mode = ETR_MODE_FLAT; etr_buf->private = flat_buf; return 0; + +fail_dma_map_page: + __free_pages(flat_buf->pages, get_order(aligned_size)); +fail_alloc_pages: + kfree(flat_buf); + return -ENOMEM; } static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf) { struct etr_flat_buf *flat_buf = etr_buf->private; - if (flat_buf && flat_buf->daddr) { + if (flat_buf && flat_buf->vaddr) { struct device *real_dev = flat_buf->dev->parent; + ssize_t aligned_size = PAGE_ALIGN(etr_buf->size); - dma_free_coherent(real_dev, flat_buf->size, - flat_buf->vaddr, flat_buf->daddr); + dma_unmap_page(real_dev, flat_buf->daddr, aligned_size, + DMA_FROM_DEVICE); + __free_pages(flat_buf->pages, get_order(aligned_size)); } kfree(flat_buf); } static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) { + struct etr_flat_buf *flat_buf = etr_buf->private; + struct device *real_dev = flat_buf->dev->parent; + /* * Adjust the buffer to point to the beginning of the trace data * and update the available trace data. @@ -648,6 +666,28 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) etr_buf->len = etr_buf->size; else etr_buf->len = rwp - rrp; + + if (etr_buf->offset + etr_buf->len > etr_buf->size) { + int len1, len2; + + /* + * If trace data is wrapped around, sync AUX bounce buffer + * for two chunks: "len1" is for the trace date length at + * the tail of bounce buffer, and "len2" is the length from + * the start of the buffer after wrapping around. + */ + len1 = etr_buf->size - etr_buf->offset; + len2 = etr_buf->len - len1; + dma_sync_single_for_cpu(real_dev, + flat_buf->daddr + etr_buf->offset, + len1, DMA_FROM_DEVICE); + dma_sync_single_for_cpu(real_dev, flat_buf->daddr, + len2, DMA_FROM_DEVICE); + } else { + dma_sync_single_for_cpu(real_dev, + flat_buf->daddr + etr_buf->offset, + etr_buf->len, DMA_FROM_DEVICE); + } } static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf, -- 2.25.1

4 years

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

CoreSight July 2021