Hi Mathieu,
Apologies if mailman does not see this as a reply. I'm not sure if Outlook handles In-Reply-To properly.
I'm testing CPU-wide tracing on Zynq Ultrascale+ MPSoC and I have some comments I'd like to share. Some introduction at first: - I'm using mainline Linux from a couple of days ago (12ad143e1b80 Merge branch 'perf-urgent-for-linus'...) - on top of it I have a couple of my changes introducing CoreSight support on US+ - on top of this I cherry-picked your two patch sets with CPU-wide tracing
I prepared a test program that's supposed to generate deterministic trace. I created a function that should, depending on the argument, create either continuous E atoms or E/N atoms alternately. In main() I spawn two threads with affinity attributes: - the first thread is set up as atom E generator, pinned to CPU1 - the other as E/N generator, pinned to CPU2 The main thread is pinned to CPU0.
The atom generator function's body looks like below. If *atom == 'n', branch is not taken, thus atom N should be generated, and if *atom == 'e', branch is taken and atom E should be generated. After that, another atom E is expected, since the while loop branches back to the start. It's counter-intuitive when you look at the C code, but the if-condition is actually evaluated to b.ne instruction, so getting inside the condition happens when the branch is not taken.
volatile int sum = 0; while (1) { // Reference by pointer, so it's not optimized out. if (*atom == 'n') // compiler creates b.ne here sum += 0xdeadbeef * (*atom + 1); }
Here are my observations:
1. -C option works well. I run perf with: # perf record -e cs_etm/@fe940000.etf1/u -C1 ./atom_gen
In perf report I can see lots of E atoms in packets marked with ID:12. If I collect trace with -C2 instead, I see E/N atoms in packets with ID:14. Everything works as expected each time I trace this application.
2. -a option works unreliable. I run perf with: # perf record -e cs_etm/@fe940000.etf1/u -a ./atom_gen
What I expect is perf.data containing similar output to what I got with -C1 plus what I got with -C2, i.e. ID:12 Atom E packets and ID:14 atom E/N packets. What actually happens is inconsistent each time I try this command. Sometimes I have no atom packets associated with IDs 12 and 14 but I have some with ID:16. Sometimes I get ID:14 atoms but no ID:12. Sometimes I get expected trace but still some noise in ID:16 packets, which I would not expect at all, since the program schedules nothing on CPU3. I wonder if I'm missing something here in my understanding of CoreSight. Is this behaviour expected?
3. I'm not able to use filters. I'd like to narrow down tracing to the while(1) loop in trace generator, to filter out noise from other instructions. However, I find it impossible to use --filter flag along with -C or -a:
# perf record -e cs_etm/@fe940000.etf1/u --filter 'filter atom_n_generator @./atom_gen' -a ./atom_gen failed to set filter "filter 0x90c/0x8c@/root/atom_gen" on event cs_etm/@fe940000.etf1/u with 95 (Operation not supported)
It works fine with --per-thread. Is the behaviour expected, or is this a bug?
4. Kernel crashes if used with no -a, -C or --per-thread. If I call perf with: # perf record -e cs_etm/@fe940000.etf1/u ./atom_gen
I can see some printfs from the program, but immediately kernel gets NULL pointer dereference. Please find a log below. My serial connection drops characters sometimes, sorry for that. The crash happens in tmc_enable_etf_sink+0x90, which is:
/* Get a handle on the pid of the process to monitor */ if (handle->event->owner) pid = task_pid_nr(handle->event->owner);
The handle->event->owner seems to be NULL.
[ 1313.650726Unable to handle kernel NULL pointer dereference at virtual adess 00000000000003b8 [ 1313.659501] Mem abort info: [ 1313.662281] ESR = 0x96000006 [ 1313.665320] Exption class = DABT (current EL), IL = 32 bits [ 1313.671232] SET = 0, FnV = 0 [ 1313.674277] EA = 0, S1PTW = 0 [ 1313.677401] Data abort info: [ 1313.680266] ISV = 0, ISS =x00000006 [ 1313.684085] CM = 0, WnR = 0 [ 1313.687039] us pgtable: 4k pages, 39-bit VAs, pgdp = 000000003b61a770 [ 1313.693644] [00000000000003b8] pgd=000000006c6da003, pud=0000006c6da003, pmd=0000000000000000 [ 1313.702336] Internal err: Oops: 96000006 [#1] SMP [ 1313.707201] Modules linked in: [ 1313.710250] CPU: 1 PID: 3255 Comm: multithread-two N tainted 5.0.0-10411-g66431e6376c4-dirty #26 [ 1313.719200] Hdware name: ZynqMP ZCU104 RevA (DT) [ 1313.723981] pstate: 20000085 (nzCv daIf -PAN -UAO) [ 1313.728770] pc : tmc_enle_etf_sink+0x90/0x3b0 [ 1313.733286] lr : tmc_enable_etf_sin0x64/0x3b0 [ 1313.737806] sp : ffffff8011263b40 [ 1313.741104] x29: ffffff8011263b40 x28: 0000000000000000 [ 1313.6409] x27: 0000000000000000 x26: ffffffc06d4ce180 [ 1313.7512] x25: 0000000000000001 x24: ffffffc06faa4ce0 [ 1313.757015] x23: 0000000000000002 x22: 0000000000000080 [ 1313.7319] x21: ffffffc06faa4ce0 x20: ffffffc06cf07c00 [ 1313.7676] x19: ffffffc06d560e80 x18: 0000000000000000 [ 1313.772926] x17: 0000000000000000 x16: 0000000000000000 [ 1313.7729] x15: 0000000000000000 x14: ffffff8010879388 [ 1313.78353 x13: 0000000000000000 x12: 0000000002e8fc00 [ 1313.788836] x11: 0000000000000000 x10: 00000000000007f0 [ 1313.7940] x9 : 0000000000000000 x8 : 0000000000000000 [ 1313.799443x7 : 0000000000000030 x6 : ffffffc06c279030 [ 1313.804747] x5 : 0000000000000030 x4 : 0000000000000002 [ 1313.8100] x3 : ffffffc06d560ee8 x2 : 0000000000000001 [ 1313.815354]1 : 0000000000000000 x0 : 0000000000000000 [ 1313.820659] Process multithread-two (pid: 3255, stack limit = 0x00000073629f1e) [ 1313.828133] Call trace: [ 1313.830571] tmc_enae_etf_sink+0x90/0x3b0 [ 1313.834748] coresight_enable_path+0xe4/0x1f8 [ 1313.839096] etm_event_start+0x8c/0x120 [313.842923] etm_event_add+0x38/0x58 [ 1313.846492] event_scd_in.isra.61.part.62+0x94/0x1b0 [ 1313.851620] group_sched_in+0xa0/0x1c8 [ 1313.855360] flexible_sched_in+0xac/0x1 [ 1313.859364] visit_groups_merge+0x144/0x1f8 [ 1313.86353 ctx_sched_in.isra.39+0x128/0x138 [ 1313.867887] perf_event_sched_in.isra.41+0x54/0x80 [ 1313.872669] __perf_eventask_sched_in+0x16c/0x180 [ 1313.877540] finish_task_switch+104/0x1d8 [ 1313.881715] schedule_tail+0xc/0x98 [ 1313.885195] ret_from_fork+0x4/0x18 [ 1313.888677] Code: 540016 f9001bb7 f94002a0 f9414400 (b943b817) [ 1313.894761] ---[ e trace 99bb09dc83a83a1a ]---
Best regards, Wojciech
Good day Wojciech,
On Thu, 14 Mar 2019 at 10:00, Wojciech Żmuda wzmuda@n7space.com wrote:
Hi Mathieu,
Apologies if mailman does not see this as a reply. I'm not sure if Outlook handles In-Reply-To properly.
I'm testing CPU-wide tracing on Zynq Ultrascale+ MPSoC and I have some comments I'd like to share. Some introduction at first:
- I'm using mainline Linux from a couple of days ago (12ad143e1b80 Merge branch 'perf-urgent-for-linus'...)
- on top of it I have a couple of my changes introducing CoreSight support on US+
- on top of this I cherry-picked your two patch sets with CPU-wide tracing
I prepared a test program that's supposed to generate deterministic trace. I created a function that should, depending on the argument, create either continuous E atoms or E/N atoms alternately. In main() I spawn two threads with affinity attributes:
- the first thread is set up as atom E generator, pinned to CPU1
- the other as E/N generator, pinned to CPU2
The main thread is pinned to CPU0.
How do you pin threads to a CPU within a user space program?
The atom generator function's body looks like below. If *atom == 'n', branch is not taken, thus atom N should be generated, and if *atom == 'e', branch is taken and atom E should be generated. After that, another atom E is expected, since the while loop branches back to the start. It's counter-intuitive when you look at the C code, but the if-condition is actually evaluated to b.ne instruction, so getting inside the condition happens when the branch is not taken.
volatile int sum = 0; while (1) { // Reference by pointer, so it's not optimized out. if (*atom == 'n') // compiler creates b.ne here sum += 0xdeadbeef * (*atom + 1); }
Here are my observations:
- -C option works well. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -C1 ./atom_gen
Note that this command does not exclusively trace the application atom_gen on CPU1. This command traces everything that is happening on CPU1 (in user space) for as long as application atom_gen is alive. If you want to trace atom_gen use either --per-thread or the "taskset" utility.
In perf report I can see lots of E atoms in packets marked with ID:12. If I collect trace with -C2 instead, I see E/N atoms in packets with ID:14. Everything works as expected each time I trace this application.
- -a option works unreliable. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -a ./atom_gen
What I expect is perf.data containing similar output to what I got with -C1 plus what I got with -C2, i.e. ID:12 Atom E packets and ID:14 atom E/N packets. What actually happens is inconsistent each time I try this command. Sometimes I have no atom packets associated with IDs 12 and 14 but I have some with ID:16. Sometimes I get ID:14 atoms but no ID:12. Sometimes I get expected trace but still some noise in ID:16 packets, which I would not expect at all, since the program schedules nothing on CPU3. I wonder if I'm missing something here in my understanding of CoreSight. Is this behaviour expected?
On top of the explanation I have given above note that CPU-wide scenarios currently use the same sink. Because the memory buffer of the sink is limited it is easy to clobber traces from one CPU with traces from another. This is a problem inherent to topologies where sinks are shared between CPUs and unfortunately there is nothing to be done about it.
- I'm not able to use filters.
I'd like to narrow down tracing to the while(1) loop in trace generator, to filter out noise from other instructions. However, I find it impossible to use --filter flag along with -C or -a:
# perf record -e cs_etm/@fe940000.etf1/u --filter 'filter atom_n_generator @./atom_gen' -a ./atom_gen failed to set filter "filter 0x90c/0x8c@/root/atom_gen" on event cs_etm/@fe940000.etf1/u with 95 (Operation not supported)
I haven't tried that before - thanks for reporting this. I will try on my side.
It works fine with --per-thread. Is the behaviour expected, or is this a bug?
- Kernel crashes if used with no -a, -C or --per-thread.
If I call perf with: # perf record -e cs_etm/@fe940000.etf1/u ./atom_gen
This should be the same thing as:
# perf record -e cs_etm/@fe940000.etf1/u -a ./atom_gen
I can see some printfs from the program, but immediately kernel gets NULL pointer dereference. Please find a log below. My serial connection drops characters sometimes, sorry for that. The crash happens in tmc_enable_etf_sink+0x90, which is:
/* Get a handle on the pid of the process to monitor */ if (handle->event->owner) pid = task_pid_nr(handle->event->owner);
The handle->event->owner seems to be NULL.
Wonderful...
I will have to get back to you once I have tried this on my side.
Thanks for the input, Mathieu
[ 1313.650726Unable to handle kernel NULL pointer dereference at virtual adess 00000000000003b8 [ 1313.659501] Mem abort info: [ 1313.662281] ESR = 0x96000006 [ 1313.665320] Exption class = DABT (current EL), IL = 32 bits [ 1313.671232] SET = 0, FnV = 0 [ 1313.674277] EA = 0, S1PTW = 0 [ 1313.677401] Data abort info: [ 1313.680266] ISV = 0, ISS =x00000006 [ 1313.684085] CM = 0, WnR = 0 [ 1313.687039] us pgtable: 4k pages, 39-bit VAs, pgdp = 000000003b61a770 [ 1313.693644] [00000000000003b8] pgd=000000006c6da003, pud=0000006c6da003, pmd=0000000000000000 [ 1313.702336] Internal err: Oops: 96000006 [#1] SMP [ 1313.707201] Modules linked in: [ 1313.710250] CPU: 1 PID: 3255 Comm: multithread-two N tainted 5.0.0-10411-g66431e6376c4-dirty #26 [ 1313.719200] Hdware name: ZynqMP ZCU104 RevA (DT) [ 1313.723981] pstate: 20000085 (nzCv daIf -PAN -UAO) [ 1313.728770] pc : tmc_enle_etf_sink+0x90/0x3b0 [ 1313.733286] lr : tmc_enable_etf_sin0x64/0x3b0 [ 1313.737806] sp : ffffff8011263b40 [ 1313.741104] x29: ffffff8011263b40 x28: 0000000000000000 [ 1313.6409] x27: 0000000000000000 x26: ffffffc06d4ce180 [ 1313.7512] x25: 0000000000000001 x24: ffffffc06faa4ce0 [ 1313.757015] x23: 0000000000000002 x22: 0000000000000080 [ 1313.7319] x21: ffffffc06faa4ce0 x20: ffffffc06cf07c00 [ 1313.7676] x19: ffffffc06d560e80 x18: 0000000000000000 [ 1313.772926] x17: 0000000000000000 x16: 0000000000000000 [ 1313.7729] x15: 0000000000000000 x14: ffffff8010879388 [ 1313.78353 x13: 0000000000000000 x12: 0000000002e8fc00 [ 1313.788836] x11: 0000000000000000 x10: 00000000000007f0 [ 1313.7940] x9 : 0000000000000000 x8 : 0000000000000000 [ 1313.799443x7 : 0000000000000030 x6 : ffffffc06c279030 [ 1313.804747] x5 : 0000000000000030 x4 : 0000000000000002 [ 1313.8100] x3 : ffffffc06d560ee8 x2 : 0000000000000001 [ 1313.815354]1 : 0000000000000000 x0 : 0000000000000000 [ 1313.820659] Process multithread-two (pid: 3255, stack limit = 0x00000073629f1e) [ 1313.828133] Call trace: [ 1313.830571] tmc_enae_etf_sink+0x90/0x3b0 [ 1313.834748] coresight_enable_path+0xe4/0x1f8 [ 1313.839096] etm_event_start+0x8c/0x120 [313.842923] etm_event_add+0x38/0x58 [ 1313.846492] event_scd_in.isra.61.part.62+0x94/0x1b0 [ 1313.851620] group_sched_in+0xa0/0x1c8 [ 1313.855360] flexible_sched_in+0xac/0x1 [ 1313.859364] visit_groups_merge+0x144/0x1f8 [ 1313.86353 ctx_sched_in.isra.39+0x128/0x138 [ 1313.867887] perf_event_sched_in.isra.41+0x54/0x80 [ 1313.872669] __perf_eventask_sched_in+0x16c/0x180 [ 1313.877540] finish_task_switch+104/0x1d8 [ 1313.881715] schedule_tail+0xc/0x98 [ 1313.885195] ret_from_fork+0x4/0x18 [ 1313.888677] Code: 540016 f9001bb7 f94002a0 f9414400 (b943b817) [ 1313.894761] ---[ e trace 99bb09dc83a83a1a ]---
Best regards, Wojciech
Hello Mathieu,
I prepared a test program that's supposed to generate deterministic trace. I created a function that should, depending on the argument, create either continuous E atoms or E/N atoms alternately. In main() I spawn two threads with affinity attributes:
- the first thread is set up as atom E generator, pinned to CPU1
- the other as E/N generator, pinned to CPU2
The main thread is pinned to CPU0.
How do you pin threads to a CPU within a user space program?
The before calling pthread_create(), I specify attributes for each pthread:
pthread_attr_t attr; cpu_set_t cpus; pthread_attr_init(&attr); CPU_ZERO(&cpus); CPU_SET(core_id, &cpus); pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus);
Then, a thread is spawned with &attr passed as a second argument to pthread_create(). I do CPU_ZERO(), CPU_SET() and pthread_attr_setaffinity_np() separately for each thread. I print return code of sched_getcpu() in the thread worker function, to prove threads are scheduled according to this code. It seems to work as desired.
- -C option works well. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -C1 ./atom_gen
Note that this command does not exclusively trace the application atom_gen on CPU1. This command traces everything that is happening on CPU1 (in user space) for as long as application atom_gen is alive.
Thank you for pointing that out, that explains the inconsistent noise I sometimes observe. If we had filters working with -C, setting filter to some code range in ./atom_gen would result in tracing this application exclusively, correct?
If you want to trace atom_gen use either --per-thread or the "taskset" utility.
Actually I tried it at the very beginning, before you published the CPU-wide tracing feature. My observation was that with --per-thread I am not able to get any trace (besides some initialization packets) from multithreaded applications. Regardless how long my two atom generator threads were spinning - AUX sections of perf.data were nearly empty, with zero atom packets. Only single-threaded application tracing works well with --per-thread for me.
Do you have any suggestions on how should I trace multithreaded applications?
- -a option works unreliable. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -a ./atom_gen
What I expect is perf.data containing similar output to what I got with -C1 plus what I got with -C2, i.e. ID:12 Atom E packets and ID:14 atom E/N packets. What actually happens is inconsistent each time I try this command. Sometimes I have no atom packets associated with IDs 12 and 14 but I have some with ID:16. Sometimes I get ID:14 atoms but no ID:12. Sometimes I get expected trace but still some noise in ID:16 packets, which I would not expect at all, since the program schedules nothing on CPU3. I wonder if I'm missing something here in my understanding of CoreSight. Is this behaviour expected?
On top of the explanation I have given above note that CPU-wide scenarios currently use the same sink. Because the memory buffer of the sink is limited it is easy to clobber traces from one CPU with traces from another. This is a problem inherent to topologies where sinks are shared between CPUs and unfortunately there is nothing to be done about it.
Can this issue be solved with TMC-ETR? I was under impression that, contrary to ETB that used dedicated SRAM as a trace buffer, TMC-ETR routes trace to system RAM. Is is possible to configure TMC-ETR to target some specific RAM region of an arbitrary size?
Best regards, Wojciech
On Fri, 15 Mar 2019 at 12:19, Wojciech Żmuda wzmuda@n7space.com wrote:
Hello Mathieu,
I prepared a test program that's supposed to generate deterministic trace. I created a function that should, depending on the argument, create either continuous E atoms or E/N atoms alternately. In main() I spawn two threads with affinity attributes:
- the first thread is set up as atom E generator, pinned to CPU1
- the other as E/N generator, pinned to CPU2
The main thread is pinned to CPU0.
How do you pin threads to a CPU within a user space program?
The before calling pthread_create(), I specify attributes for each pthread:
pthread_attr_t attr; cpu_set_t cpus; pthread_attr_init(&attr); CPU_ZERO(&cpus); CPU_SET(core_id, &cpus); pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus);
Then, a thread is spawned with &attr passed as a second argument to pthread_create(). I do CPU_ZERO(), CPU_SET() and pthread_attr_setaffinity_np() separately for each thread. I print return code of sched_getcpu() in the thread worker function, to prove threads are scheduled according to this code. It seems to work as desired.
- -C option works well. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -C1 ./atom_gen
Note that this command does not exclusively trace the application atom_gen on CPU1. This command traces everything that is happening on CPU1 (in user space) for as long as application atom_gen is alive.
Thank you for pointing that out, that explains the inconsistent noise I sometimes observe. If we had filters working with -C, setting filter to some code range in ./atom_gen would result in tracing this application exclusively, correct?
Right, but as I pointed out in my previous email filtering isn't supported in CPU-wide trace scenarios.
If you want to trace atom_gen use either --per-thread or the "taskset" utility.
Actually I tried it at the very beginning, before you published the CPU-wide tracing feature. My observation was that with --per-thread I am not able to get any trace (besides some initialization packets) from multithreaded applications. Regardless how long my two atom generator threads were spinning - AUX sections of perf.data were nearly empty, with zero atom packets. Only single-threaded application tracing works well with --per-thread for me.
Do you have any suggestions on how should I trace multithreaded applications?
It is currently not possible to do multi-threaded applications. In fact it would be quite complex to implement and with topologies where CPUs use the same sink, it would be even harder. On top of things it would yield unpredictable results since traces could be overwritten because sinks don't generate an interrupt when full.
- -a option works unreliable. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -a ./atom_gen
What I expect is perf.data containing similar output to what I got with -C1 plus what I got with -C2, i.e. ID:12 Atom E packets and ID:14 atom E/N packets. What actually happens is inconsistent each time I try this command. Sometimes I have no atom packets associated with IDs 12 and 14 but I have some with ID:16. Sometimes I get ID:14 atoms but no ID:12. Sometimes I get expected trace but still some noise in ID:16 packets, which I would not expect at all, since the program schedules nothing on CPU3. I wonder if I'm missing something here in my understanding of CoreSight. Is this behaviour expected?
On top of the explanation I have given above note that CPU-wide scenarios currently use the same sink. Because the memory buffer of the sink is limited it is easy to clobber traces from one CPU with traces from another. This is a problem inherent to topologies where sinks are shared between CPUs and unfortunately there is nothing to be done about it.
Can this issue be solved with TMC-ETR? I was under impression that, contrary to ETB that used dedicated SRAM as a trace buffer, TMC-ETR routes trace to system RAM. Is is possible to configure TMC-ETR to target some specific RAM region of an arbitrary size?
You are correct but the ETR will still use a single area to aggregate traces from several CPUs. The size of the AUX mmap'ed area can be changed using the "-m,x" option, where x is the number of pages to use but the end result is the same: it is highly possible that traces will get overwritten (same reason as above, i.e no interrupt when buffer is full).
Mathieu
Best regards, Wojciech
On Thu, 14 Mar 2019 at 10:00, Wojciech Żmuda wzmuda@n7space.com wrote:
Hi Mathieu,
Apologies if mailman does not see this as a reply. I'm not sure if Outlook handles In-Reply-To properly.
I'm testing CPU-wide tracing on Zynq Ultrascale+ MPSoC and I have some comments I'd like to share. Some introduction at first:
- I'm using mainline Linux from a couple of days ago (12ad143e1b80 Merge branch 'perf-urgent-for-linus'...)
- on top of it I have a couple of my changes introducing CoreSight support on US+
- on top of this I cherry-picked your two patch sets with CPU-wide tracing
I prepared a test program that's supposed to generate deterministic trace. I created a function that should, depending on the argument, create either continuous E atoms or E/N atoms alternately. In main() I spawn two threads with affinity attributes:
- the first thread is set up as atom E generator, pinned to CPU1
- the other as E/N generator, pinned to CPU2
The main thread is pinned to CPU0.
The atom generator function's body looks like below. If *atom == 'n', branch is not taken, thus atom N should be generated, and if *atom == 'e', branch is taken and atom E should be generated. After that, another atom E is expected, since the while loop branches back to the start. It's counter-intuitive when you look at the C code, but the if-condition is actually evaluated to b.ne instruction, so getting inside the condition happens when the branch is not taken.
volatile int sum = 0; while (1) { // Reference by pointer, so it's not optimized out. if (*atom == 'n') // compiler creates b.ne here sum += 0xdeadbeef * (*atom + 1); }
Here are my observations:
- -C option works well. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -C1 ./atom_gen
In perf report I can see lots of E atoms in packets marked with ID:12. If I collect trace with -C2 instead, I see E/N atoms in packets with ID:14. Everything works as expected each time I trace this application.
- -a option works unreliable. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -a ./atom_gen
What I expect is perf.data containing similar output to what I got with -C1 plus what I got with -C2, i.e. ID:12 Atom E packets and ID:14 atom E/N packets. What actually happens is inconsistent each time I try this command. Sometimes I have no atom packets associated with IDs 12 and 14 but I have some with ID:16. Sometimes I get ID:14 atoms but no ID:12. Sometimes I get expected trace but still some noise in ID:16 packets, which I would not expect at all, since the program schedules nothing on CPU3. I wonder if I'm missing something here in my understanding of CoreSight. Is this behaviour expected?
- I'm not able to use filters.
I'd like to narrow down tracing to the while(1) loop in trace generator, to filter out noise from other instructions. However, I find it impossible to use --filter flag along with -C or -a:
# perf record -e cs_etm/@fe940000.etf1/u --filter 'filter atom_n_generator @./atom_gen' -a ./atom_gen failed to set filter "filter 0x90c/0x8c@/root/atom_gen" on event cs_etm/@fe940000.etf1/u with 95 (Operation not supported)
I was able to reproduce - let me investigate.
It works fine with --per-thread. Is the behaviour expected, or is this a bug?
- Kernel crashes if used with no -a, -C or --per-thread.
If I call perf with: # perf record -e cs_etm/@fe940000.etf1/u ./atom_gen
As I suspected the perf tools assume a "-a" option when a mode isn't specified. I tried the above on both Juno and the dragonboard 410c without problem. What kind of perf tools are you using? Is it something coming from a busybox or another kind of trim down user space?
I can see some printfs from the program, but immediately kernel gets NULL pointer dereference. Please find a log below. My serial connection drops characters sometimes, sorry for that. The crash happens in tmc_enable_etf_sink+0x90, which is:
/* Get a handle on the pid of the process to monitor */ if (handle->event->owner) pid = task_pid_nr(handle->event->owner);
The handle->event->owner seems to be NULL.
[ 1313.650726Unable to handle kernel NULL pointer dereference at virtual adess 00000000000003b8 [ 1313.659501] Mem abort info: [ 1313.662281] ESR = 0x96000006 [ 1313.665320] Exption class = DABT (current EL), IL = 32 bits [ 1313.671232] SET = 0, FnV = 0 [ 1313.674277] EA = 0, S1PTW = 0 [ 1313.677401] Data abort info: [ 1313.680266] ISV = 0, ISS =x00000006 [ 1313.684085] CM = 0, WnR = 0 [ 1313.687039] us pgtable: 4k pages, 39-bit VAs, pgdp = 000000003b61a770 [ 1313.693644] [00000000000003b8] pgd=000000006c6da003, pud=0000006c6da003, pmd=0000000000000000 [ 1313.702336] Internal err: Oops: 96000006 [#1] SMP [ 1313.707201] Modules linked in: [ 1313.710250] CPU: 1 PID: 3255 Comm: multithread-two N tainted 5.0.0-10411-g66431e6376c4-dirty #26 [ 1313.719200] Hdware name: ZynqMP ZCU104 RevA (DT) [ 1313.723981] pstate: 20000085 (nzCv daIf -PAN -UAO) [ 1313.728770] pc : tmc_enle_etf_sink+0x90/0x3b0 [ 1313.733286] lr : tmc_enable_etf_sin0x64/0x3b0 [ 1313.737806] sp : ffffff8011263b40 [ 1313.741104] x29: ffffff8011263b40 x28: 0000000000000000 [ 1313.6409] x27: 0000000000000000 x26: ffffffc06d4ce180 [ 1313.7512] x25: 0000000000000001 x24: ffffffc06faa4ce0 [ 1313.757015] x23: 0000000000000002 x22: 0000000000000080 [ 1313.7319] x21: ffffffc06faa4ce0 x20: ffffffc06cf07c00 [ 1313.7676] x19: ffffffc06d560e80 x18: 0000000000000000 [ 1313.772926] x17: 0000000000000000 x16: 0000000000000000 [ 1313.7729] x15: 0000000000000000 x14: ffffff8010879388 [ 1313.78353 x13: 0000000000000000 x12: 0000000002e8fc00 [ 1313.788836] x11: 0000000000000000 x10: 00000000000007f0 [ 1313.7940] x9 : 0000000000000000 x8 : 0000000000000000 [ 1313.799443x7 : 0000000000000030 x6 : ffffffc06c279030 [ 1313.804747] x5 : 0000000000000030 x4 : 0000000000000002 [ 1313.8100] x3 : ffffffc06d560ee8 x2 : 0000000000000001 [ 1313.815354]1 : 0000000000000000 x0 : 0000000000000000 [ 1313.820659] Process multithread-two (pid: 3255, stack limit = 0x00000073629f1e) [ 1313.828133] Call trace: [ 1313.830571] tmc_enae_etf_sink+0x90/0x3b0 [ 1313.834748] coresight_enable_path+0xe4/0x1f8 [ 1313.839096] etm_event_start+0x8c/0x120 [313.842923] etm_event_add+0x38/0x58 [ 1313.846492] event_scd_in.isra.61.part.62+0x94/0x1b0 [ 1313.851620] group_sched_in+0xa0/0x1c8 [ 1313.855360] flexible_sched_in+0xac/0x1 [ 1313.859364] visit_groups_merge+0x144/0x1f8 [ 1313.86353 ctx_sched_in.isra.39+0x128/0x138 [ 1313.867887] perf_event_sched_in.isra.41+0x54/0x80 [ 1313.872669] __perf_eventask_sched_in+0x16c/0x180 [ 1313.877540] finish_task_switch+104/0x1d8 [ 1313.881715] schedule_tail+0xc/0x98 [ 1313.885195] ret_from_fork+0x4/0x18 [ 1313.888677] Code: 540016 f9001bb7 f94002a0 f9414400 (b943b817) [ 1313.894761] ---[ e trace 99bb09dc83a83a1a ]---
Best regards, Wojciech
- Kernel crashes if used with no -a, -C or --per-thread.
If I call perf with: # perf record -e cs_etm/@fe940000.etf1/u ./atom_gen
As I suspected the perf tools assume a "-a" option when a mode isn't specified. I tried the above on both Juno and the dragonboard 410c without problem. What kind of perf tools are you using? Is it something coming from a busybox or another kind of trim down user space?
My user space is not trimmed down. It's Debian built with multistrap. I'm building my own perf tools from sources of kernel I'm running on my platform. This is mainline kernel + the CPU-wide feature commits on top. I double-checked and it looks like I cherry-picked everything from c9795326a43a coresight: pmu: Adding ITRACE property to cs_etm PMU to 66431e6376c4 perf tools: Add support for CPU-wide trace scenarios from your mathieu.poirier/coresight.git/next-20190305-cpu-wide branch.
Anyway, thanks for checking this on your side. If it works for you, perhaps I introduced the bug on my side. I'll investigate it further. Good to know that I'm not missing anything - for now I can just specify "-a" explicitly.
Thank you for help and best regards, Wojciech
________________________________________ From: Mathieu Poirier mathieu.poirier@linaro.org Sent: Thursday, March 14, 2019 9:46:54 PM To: Wojciech Żmuda Cc: coresight@lists.linaro.org; Michał Kurowski; Krzysztof Pilch; Michal Mosdorf Subject: Re: [RFC 00/20] coresight: Add support for CPU-wide trace scenarios
On Thu, 14 Mar 2019 at 10:00, Wojciech Żmuda wzmuda@n7space.com wrote:
Hi Mathieu,
Apologies if mailman does not see this as a reply. I'm not sure if Outlook handles In-Reply-To properly.
I'm testing CPU-wide tracing on Zynq Ultrascale+ MPSoC and I have some comments I'd like to share. Some introduction at first:
- I'm using mainline Linux from a couple of days ago (12ad143e1b80 Merge branch 'perf-urgent-for-linus'...)
- on top of it I have a couple of my changes introducing CoreSight support on US+
- on top of this I cherry-picked your two patch sets with CPU-wide tracing
I prepared a test program that's supposed to generate deterministic trace. I created a function that should, depending on the argument, create either continuous E atoms or E/N atoms alternately. In main() I spawn two threads with affinity attributes:
- the first thread is set up as atom E generator, pinned to CPU1
- the other as E/N generator, pinned to CPU2
The main thread is pinned to CPU0.
The atom generator function's body looks like below. If *atom == 'n', branch is not taken, thus atom N should be generated, and if *atom == 'e', branch is taken and atom E should be generated. After that, another atom E is expected, since the while loop branches back to the start. It's counter-intuitive when you look at the C code, but the if-condition is actually evaluated to b.ne instruction, so getting inside the condition happens when the branch is not taken.
volatile int sum = 0; while (1) { // Reference by pointer, so it's not optimized out. if (*atom == 'n') // compiler creates b.ne here sum += 0xdeadbeef * (*atom + 1); }
Here are my observations:
- -C option works well. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -C1 ./atom_gen
In perf report I can see lots of E atoms in packets marked with ID:12. If I collect trace with -C2 instead, I see E/N atoms in packets with ID:14. Everything works as expected each time I trace this application.
- -a option works unreliable. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -a ./atom_gen
What I expect is perf.data containing similar output to what I got with -C1 plus what I got with -C2, i.e. ID:12 Atom E packets and ID:14 atom E/N packets. What actually happens is inconsistent each time I try this command. Sometimes I have no atom packets associated with IDs 12 and 14 but I have some with ID:16. Sometimes I get ID:14 atoms but no ID:12. Sometimes I get expected trace but still some noise in ID:16 packets, which I would not expect at all, since the program schedules nothing on CPU3. I wonder if I'm missing something here in my understanding of CoreSight. Is this behaviour expected?
- I'm not able to use filters.
I'd like to narrow down tracing to the while(1) loop in trace generator, to filter out noise from other instructions. However, I find it impossible to use --filter flag along with -C or -a:
# perf record -e cs_etm/@fe940000.etf1/u --filter 'filter atom_n_generator @./atom_gen' -a ./atom_gen failed to set filter "filter 0x90c/0x8c@/root/atom_gen" on event cs_etm/@fe940000.etf1/u with 95 (Operation not supported)
I was able to reproduce - let me investigate.
It works fine with --per-thread. Is the behaviour expected, or is this a bug?
- Kernel crashes if used with no -a, -C or --per-thread.
If I call perf with: # perf record -e cs_etm/@fe940000.etf1/u ./atom_gen
As I suspected the perf tools assume a "-a" option when a mode isn't specified. I tried the above on both Juno and the dragonboard 410c without problem. What kind of perf tools are you using? Is it something coming from a busybox or another kind of trim down user space?
I can see some printfs from the program, but immediately kernel gets NULL pointer dereference. Please find a log below. My serial connection drops characters sometimes, sorry for that. The crash happens in tmc_enable_etf_sink+0x90, which is:
/* Get a handle on the pid of the process to monitor */ if (handle->event->owner) pid = task_pid_nr(handle->event->owner);
The handle->event->owner seems to be NULL.
[ 1313.650726Unable to handle kernel NULL pointer dereference at virtual adess 00000000000003b8 [ 1313.659501] Mem abort info: [ 1313.662281] ESR = 0x96000006 [ 1313.665320] Exption class = DABT (current EL), IL = 32 bits [ 1313.671232] SET = 0, FnV = 0 [ 1313.674277] EA = 0, S1PTW = 0 [ 1313.677401] Data abort info: [ 1313.680266] ISV = 0, ISS =x00000006 [ 1313.684085] CM = 0, WnR = 0 [ 1313.687039] us pgtable: 4k pages, 39-bit VAs, pgdp = 000000003b61a770 [ 1313.693644] [00000000000003b8] pgd=000000006c6da003, pud=0000006c6da003, pmd=0000000000000000 [ 1313.702336] Internal err: Oops: 96000006 [#1] SMP [ 1313.707201] Modules linked in: [ 1313.710250] CPU: 1 PID: 3255 Comm: multithread-two N tainted 5.0.0-10411-g66431e6376c4-dirty #26 [ 1313.719200] Hdware name: ZynqMP ZCU104 RevA (DT) [ 1313.723981] pstate: 20000085 (nzCv daIf -PAN -UAO) [ 1313.728770] pc : tmc_enle_etf_sink+0x90/0x3b0 [ 1313.733286] lr : tmc_enable_etf_sin0x64/0x3b0 [ 1313.737806] sp : ffffff8011263b40 [ 1313.741104] x29: ffffff8011263b40 x28: 0000000000000000 [ 1313.6409] x27: 0000000000000000 x26: ffffffc06d4ce180 [ 1313.7512] x25: 0000000000000001 x24: ffffffc06faa4ce0 [ 1313.757015] x23: 0000000000000002 x22: 0000000000000080 [ 1313.7319] x21: ffffffc06faa4ce0 x20: ffffffc06cf07c00 [ 1313.7676] x19: ffffffc06d560e80 x18: 0000000000000000 [ 1313.772926] x17: 0000000000000000 x16: 0000000000000000 [ 1313.7729] x15: 0000000000000000 x14: ffffff8010879388 [ 1313.78353 x13: 0000000000000000 x12: 0000000002e8fc00 [ 1313.788836] x11: 0000000000000000 x10: 00000000000007f0 [ 1313.7940] x9 : 0000000000000000 x8 : 0000000000000000 [ 1313.799443x7 : 0000000000000030 x6 : ffffffc06c279030 [ 1313.804747] x5 : 0000000000000030 x4 : 0000000000000002 [ 1313.8100] x3 : ffffffc06d560ee8 x2 : 0000000000000001 [ 1313.815354]1 : 0000000000000000 x0 : 0000000000000000 [ 1313.820659] Process multithread-two (pid: 3255, stack limit = 0x00000073629f1e) [ 1313.828133] Call trace: [ 1313.830571] tmc_enae_etf_sink+0x90/0x3b0 [ 1313.834748] coresight_enable_path+0xe4/0x1f8 [ 1313.839096] etm_event_start+0x8c/0x120 [313.842923] etm_event_add+0x38/0x58 [ 1313.846492] event_scd_in.isra.61.part.62+0x94/0x1b0 [ 1313.851620] group_sched_in+0xa0/0x1c8 [ 1313.855360] flexible_sched_in+0xac/0x1 [ 1313.859364] visit_groups_merge+0x144/0x1f8 [ 1313.86353 ctx_sched_in.isra.39+0x128/0x138 [ 1313.867887] perf_event_sched_in.isra.41+0x54/0x80 [ 1313.872669] __perf_eventask_sched_in+0x16c/0x180 [ 1313.877540] finish_task_switch+104/0x1d8 [ 1313.881715] schedule_tail+0xc/0x98 [ 1313.885195] ret_from_fork+0x4/0x18 [ 1313.888677] Code: 540016 f9001bb7 f94002a0 f9414400 (b943b817) [ 1313.894761] ---[ e trace 99bb09dc83a83a1a ]---
Best regards, Wojciech
Good morning,
On Thu, 14 Mar 2019 at 10:00, Wojciech Żmuda wzmuda@n7space.com wrote:
Hi Mathieu,
Apologies if mailman does not see this as a reply. I'm not sure if Outlook handles In-Reply-To properly.
I'm testing CPU-wide tracing on Zynq Ultrascale+ MPSoC and I have some comments I'd like to share. Some introduction at first:
- I'm using mainline Linux from a couple of days ago (12ad143e1b80 Merge branch 'perf-urgent-for-linus'...)
- on top of it I have a couple of my changes introducing CoreSight support on US+
- on top of this I cherry-picked your two patch sets with CPU-wide tracing
I prepared a test program that's supposed to generate deterministic trace. I created a function that should, depending on the argument, create either continuous E atoms or E/N atoms alternately. In main() I spawn two threads with affinity attributes:
- the first thread is set up as atom E generator, pinned to CPU1
- the other as E/N generator, pinned to CPU2
The main thread is pinned to CPU0.
The atom generator function's body looks like below. If *atom == 'n', branch is not taken, thus atom N should be generated, and if *atom == 'e', branch is taken and atom E should be generated. After that, another atom E is expected, since the while loop branches back to the start. It's counter-intuitive when you look at the C code, but the if-condition is actually evaluated to b.ne instruction, so getting inside the condition happens when the branch is not taken.
volatile int sum = 0; while (1) { // Reference by pointer, so it's not optimized out. if (*atom == 'n') // compiler creates b.ne here sum += 0xdeadbeef * (*atom + 1); }
Here are my observations:
- -C option works well. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -C1 ./atom_gen
In perf report I can see lots of E atoms in packets marked with ID:12. If I collect trace with -C2 instead, I see E/N atoms in packets with ID:14. Everything works as expected each time I trace this application.
- -a option works unreliable. I run perf with:
# perf record -e cs_etm/@fe940000.etf1/u -a ./atom_gen
What I expect is perf.data containing similar output to what I got with -C1 plus what I got with -C2, i.e. ID:12 Atom E packets and ID:14 atom E/N packets. What actually happens is inconsistent each time I try this command. Sometimes I have no atom packets associated with IDs 12 and 14 but I have some with ID:16. Sometimes I get ID:14 atoms but no ID:12. Sometimes I get expected trace but still some noise in ID:16 packets, which I would not expect at all, since the program schedules nothing on CPU3. I wonder if I'm missing something here in my understanding of CoreSight. Is this behaviour expected?
- I'm not able to use filters.
I'd like to narrow down tracing to the while(1) loop in trace generator, to filter out noise from other instructions. However, I find it impossible to use --filter flag along with -C or -a:
# perf record -e cs_etm/@fe940000.etf1/u --filter 'filter atom_n_generator @./atom_gen' -a ./atom_gen failed to set filter "filter 0x90c/0x8c@/root/atom_gen" on event cs_etm/@fe940000.etf1/u with 95 (Operation not supported)
It works fine with --per-thread. Is the behaviour expected, or is this a bug?
I had time to look into this problem and found the answer in a comment in the code:
https://elixir.bootlin.com/linux/latest/source/kernel/events/core.c#L9012
Mathieu
- Kernel crashes if used with no -a, -C or --per-thread.
If I call perf with: # perf record -e cs_etm/@fe940000.etf1/u ./atom_gen
I can see some printfs from the program, but immediately kernel gets NULL pointer dereference. Please find a log below. My serial connection drops characters sometimes, sorry for that. The crash happens in tmc_enable_etf_sink+0x90, which is:
/* Get a handle on the pid of the process to monitor */ if (handle->event->owner) pid = task_pid_nr(handle->event->owner);
The handle->event->owner seems to be NULL.
[ 1313.650726Unable to handle kernel NULL pointer dereference at virtual adess 00000000000003b8 [ 1313.659501] Mem abort info: [ 1313.662281] ESR = 0x96000006 [ 1313.665320] Exption class = DABT (current EL), IL = 32 bits [ 1313.671232] SET = 0, FnV = 0 [ 1313.674277] EA = 0, S1PTW = 0 [ 1313.677401] Data abort info: [ 1313.680266] ISV = 0, ISS =x00000006 [ 1313.684085] CM = 0, WnR = 0 [ 1313.687039] us pgtable: 4k pages, 39-bit VAs, pgdp = 000000003b61a770 [ 1313.693644] [00000000000003b8] pgd=000000006c6da003, pud=0000006c6da003, pmd=0000000000000000 [ 1313.702336] Internal err: Oops: 96000006 [#1] SMP [ 1313.707201] Modules linked in: [ 1313.710250] CPU: 1 PID: 3255 Comm: multithread-two N tainted 5.0.0-10411-g66431e6376c4-dirty #26 [ 1313.719200] Hdware name: ZynqMP ZCU104 RevA (DT) [ 1313.723981] pstate: 20000085 (nzCv daIf -PAN -UAO) [ 1313.728770] pc : tmc_enle_etf_sink+0x90/0x3b0 [ 1313.733286] lr : tmc_enable_etf_sin0x64/0x3b0 [ 1313.737806] sp : ffffff8011263b40 [ 1313.741104] x29: ffffff8011263b40 x28: 0000000000000000 [ 1313.6409] x27: 0000000000000000 x26: ffffffc06d4ce180 [ 1313.7512] x25: 0000000000000001 x24: ffffffc06faa4ce0 [ 1313.757015] x23: 0000000000000002 x22: 0000000000000080 [ 1313.7319] x21: ffffffc06faa4ce0 x20: ffffffc06cf07c00 [ 1313.7676] x19: ffffffc06d560e80 x18: 0000000000000000 [ 1313.772926] x17: 0000000000000000 x16: 0000000000000000 [ 1313.7729] x15: 0000000000000000 x14: ffffff8010879388 [ 1313.78353 x13: 0000000000000000 x12: 0000000002e8fc00 [ 1313.788836] x11: 0000000000000000 x10: 00000000000007f0 [ 1313.7940] x9 : 0000000000000000 x8 : 0000000000000000 [ 1313.799443x7 : 0000000000000030 x6 : ffffffc06c279030 [ 1313.804747] x5 : 0000000000000030 x4 : 0000000000000002 [ 1313.8100] x3 : ffffffc06d560ee8 x2 : 0000000000000001 [ 1313.815354]1 : 0000000000000000 x0 : 0000000000000000 [ 1313.820659] Process multithread-two (pid: 3255, stack limit = 0x00000073629f1e) [ 1313.828133] Call trace: [ 1313.830571] tmc_enae_etf_sink+0x90/0x3b0 [ 1313.834748] coresight_enable_path+0xe4/0x1f8 [ 1313.839096] etm_event_start+0x8c/0x120 [313.842923] etm_event_add+0x38/0x58 [ 1313.846492] event_scd_in.isra.61.part.62+0x94/0x1b0 [ 1313.851620] group_sched_in+0xa0/0x1c8 [ 1313.855360] flexible_sched_in+0xac/0x1 [ 1313.859364] visit_groups_merge+0x144/0x1f8 [ 1313.86353 ctx_sched_in.isra.39+0x128/0x138 [ 1313.867887] perf_event_sched_in.isra.41+0x54/0x80 [ 1313.872669] __perf_eventask_sched_in+0x16c/0x180 [ 1313.877540] finish_task_switch+104/0x1d8 [ 1313.881715] schedule_tail+0xc/0x98 [ 1313.885195] ret_from_fork+0x4/0x18 [ 1313.888677] Code: 540016 f9001bb7 f94002a0 f9414400 (b943b817) [ 1313.894761] ---[ e trace 99bb09dc83a83a1a ]---
Best regards, Wojciech