In addition to my previous comment, i would like to add that i haven't implemented the full power of the "double-buffer" approach. 
Using a double-buffer in the implementation i have uploaded in the patch has pretty much the same effect as a single-buffer. To use the full power,
i would like the user to be able to read from the read buffer while the trace is being written to the write buffer. Currently, this is not the case....(the user will 
lose trace data) anyway, like i said this is a simple POC and sort of a request for "approval" for the idea, for me to continue in this direction.

Thanks,
Mike.

On Mon, Aug 20, 2018 at 12:21 PM, Mike Bazov <mike@perception-point.io> wrote:
Greetings,

I have attached a small POC(patch file) i wrote for the kernel ETM API. Basically, i created a new mode called "CS_MODE_API", 
and made a new module called "coresight-etm-api.c" that exposes these APIs:

- coresight_etm_create_session(cpu): 
  Create a coresight ETM session. In implementation terms, this simply means creating a path from an ETM source to a sink,
  and enable the whole patch except the source.

- coresight_etm_destroy_session(cpu): 
  Destroy a coresight ETM session. In implementation terms, this simply means disabling a coresight path(except the source), and 
  freeing it.

- coresight_etm_play(session, config):
  Start playing trace data. In implementation terms, this simply means enabling a source.

- coresight_etm_pause(session)
  Stop playing trace data. In implementation terms, this simply means disabling a source.

- coresight_etm_alloc_buffer(size)
  Allocate a trace buffer data. Only supported in TMC-ETR.

- coresight_etm_read(session)
  Read trace data from the session's sink.

To make this change as small as possible(hopefully), the etm-api implementation takes an activated sink(that
is activated via sysfs) as the sink for the path. I didn't want to add a method of activating a sink from the API, just yet.

I don't deal with the "different TMC configuration" issues. If 2 kernel clients request the TMC for different configurations, i wouldn't know
about it just as the sysfs/perf modes don't know about it.

To allow kernel-clients to read trace data "on-the-fly", and to avoid the "prepare()->read()->unprepare()" way, i have implemented a double-buffer
in the CS_MODE_API. A read buffer and a write buffer. Whenever the read buffer is empty, a buffer swap occurs.

I'd appreciate an input, review, tips, hints, improvements. Please tell me if something is lacking in my explanation. 
This is highly experimental(it does work though). Notice this patch file can be applied on the latest coresight "next" branch TIP
(as of the writing of this message).

Thanks, Mike.

On Fri, Aug 17, 2018 at 12:15 AM, Mike Bazov <mike@perception-point.io> wrote:
Greetings,

When tracing via sysFS and keeping the default configuration,
everything that is happening on a processor is logged.  That is called
"CPU-wide".  Any process that get scheduled out of the processor won't
be traced.  On perf one can execute:
# perf record -e cs_etm/@20070000.etr/ --per-thread my_application  (example 1)
# perf record -e cs_etm/@20070000.etr/ -C 2,3,4 my_application  (example 2)
For example 1, perf will switch on the tracer associated to the CPU
where my_application has been installed for execution.  If the process
gets scheduled on a different CPU perf will do the right thing and
follow it around.  That is called "per-thread".
In example 2 everything that is happening on CPU 2,3,4 will be traced
for as long as my_application is executing, regardless of where
my_application is executing.  That is also a CPU-wide trace scenario.

I was more under the impression that CPU-wide records everything that the CPU executes
(this is achieved by using sysfs like you described), regardless of the term "thread".
I don't really understand why CPU-wide is the right term for example 2. Both of the examples
record "per-thread", except of the CPU mask. Example 1 doesn't mask any CPUs, where
example 2 masks all CPUs except 2, 3, 4, It still doesn't record "CPU-wide", only "per-thread",
but on non-masked CPUs(if it weren't "per-thread", it wouldn't care about scheduling a thread and disable/enable accordingly). 
I'm a little confused, It really seems like sysfs==cpu-wide and perf==per-thread. Perhaps chagning
the modes to "CS_MODE_PER_THREAD", "CS_MODE_CPU_WIDE" and make the sysfs and perf implementation
use these modes is something that solves the puzzle for me.
 
As such you will find places like that where things aren't exactly how 
they should be - heck, I find them in my original code all the time.
You should test it but once again I think you are correct -
coresight_enable_source() should be called from etm_event_start().

After a second look, actually i think calling coresight_enable_source() will be problematic.
Using the sysfs implementation(coresight.c) from perf is problematic, since it maintains a reference count
per-device. If there's a sysfs session running on a tracer, and perf uses the __same__ path to the
sink and the same source, using coresight_enable_path() will result in simply increasing the reference count without returning any errors..
So calling  source_ops(source)->enable() directly actually will result in an error, which is the expected behavior(?)

Also, just out of curiosity, what happens when perf is requested to record a multithreaded process? the "per-thread" mechanism doesn't seem to fit here, because of the "single tracer to single-sink" rule.