Hi there,
There's considerable activity in the subject of the scheduler lately and how to adapt it to the peculiarities of the new class of hardware coming out lately, like the big.LITTLE class of devices from a number of manufacturers.
The platforms that Linux runs are very diverse, and run differing workloads. For example most consumer devices will very likely run something like Android, with common use cases such as audio and/or video playback. Methods to achieve lower power consumption using a power aware scheduler are under investigation.
Similarly for server applications, or VM hosting, the behavior of the scheduler shouldn't have adverse performance implications; any power saving on top of that would be a welcome improvement.
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
The way that the two hosts respond under the same load generated by the playback application can be compared, so that the performance of the two scheduler implementations measured in various metrics (like performance, power consumption etc.) can be evaluated.
The fidelity of the this approximation is of great importance but it is debatable if it is possible to have a fully identical load generated, since details of the hosts might differ in such a way that such a thing is impossible. I believe that it should be possible at least to simulate a purely CPU load, and the blocking behavior of tasks, in such a way that it would result in scheduler decisions that can be compared and shared among developers.
The recording part I believe can be handled by the kernel's tracing infrastructure, either by using the existing tracepoints, or need be adding more; possibly even creating a new tracer solely for this purpose. Since some applications can adapt their behavior according to insufficient system resources (think media players that can drop frames for example), I think it would be beneficial to record such events to the same trace file.
The trace file should have a portable format so that it can be freely shared between developers. An ASCII format like we currently use should be OK, as long as it doesn't cause too much of an effect during execution of the recording.
The playback application can be implemented via two ways.
One way, which is the LinSched way would be to have the full scheduler implementation compiled as part of said application, and use application specific methods to evaluate performance. While this will work, it won't allow comparison of the two hosts in a meaningful manner.
For both scheduler and platform evaluation, the playback application will generate the load on the running host by simulating the source host's recorded work load session. That means emulating process activity like forks, thread spawning, blocking on resources etc. It is not clear to me yet if that's possible without using some kind of kernel level helper module, but not requiring such is desirable.
Since one would have the full trace of scheduling activity: past, present and future; there would be the possibility of generating a perfect schedule (as defined by best performance, or best power consumption), and use it as a yardstick of evaluation against the actual scheduler. Comparing the results, you would get an estimate of the best case improvement that could be achieved if the ideal scheduler existed.
I know this is a bit long, but I hope this can be a basis of thinking on how to go about developing this.
Regards
-- Pantelis
On Thu, Mar 08, 2012 at 03:20:53PM +0200, Pantelis Antoniou wrote:
Hi there,
There's considerable activity in the subject of the scheduler lately and how to adapt it to the peculiarities of the new class of hardware coming out lately, like the big.LITTLE class of devices from a number of manufacturers.
The platforms that Linux runs are very diverse, and run differing workloads. For example most consumer devices will very likely run something like Android, with common use cases such as audio and/or video playback. Methods to achieve lower power consumption using a power aware scheduler are under investigation.
Similarly for server applications, or VM hosting, the behavior of the scheduler shouldn't have adverse performance implications; any power saving on top of that would be a welcome improvement.
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
The way that the two hosts respond under the same load generated by the playback application can be compared, so that the performance of the two scheduler implementations measured in various metrics (like performance, power consumption etc.) can be evaluated.
The fidelity of the this approximation is of great importance but it is debatable if it is possible to have a fully identical load generated, since details of the hosts might differ in such a way that such a thing is impossible. I believe that it should be possible at least to simulate a purely CPU load, and the blocking behavior of tasks, in such a way that it would result in scheduler decisions that can be compared and shared among developers.
The recording part I believe can be handled by the kernel's tracing infrastructure, either by using the existing tracepoints, or need be adding more; possibly even creating a new tracer solely for this purpose. Since some applications can adapt their behavior according to insufficient system resources (think media players that can drop frames for example), I think it would be beneficial to record such events to the same trace file.
The trace file should have a portable format so that it can be freely shared between developers. An ASCII format like we currently use should be OK, as long as it doesn't cause too much of an effect during execution of the recording.
The playback application can be implemented via two ways.
One way, which is the LinSched way would be to have the full scheduler implementation compiled as part of said application, and use application specific methods to evaluate performance. While this will work, it won't allow comparison of the two hosts in a meaningful manner.
For both scheduler and platform evaluation, the playback application will generate the load on the running host by simulating the source host's recorded work load session. That means emulating process activity like forks, thread spawning, blocking on resources etc. It is not clear to me yet if that's possible without using some kind of kernel level helper module, but not requiring such is desirable.
Since one would have the full trace of scheduling activity: past, present and future; there would be the possibility of generating a perfect schedule (as defined by best performance, or best power consumption), and use it as a yardstick of evaluation against the actual scheduler. Comparing the results, you would get an estimate of the best case improvement that could be achieved if the ideal scheduler existed.
I know this is a bit long, but I hope this can be a basis of thinking on how to go about developing this.
Regards
-- Pantelis
Hi,
May be you could have a look at the perf sched tool. It has a replay feature. I think it perform well basic replay but it can certainly be enhanced.
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Frederic,
On Mar 8, 2012, at 5:45 PM, Frederic Weisbecker wrote:
On Thu, Mar 08, 2012 at 03:20:53PM +0200, Pantelis Antoniou wrote:
Hi there,
[snip]
Hi,
May be you could have a look at the perf sched tool. It has a replay feature. I think it perform well basic replay but it can certainly be enhanced.
Yes I am aware of perf-sched, and I know that it can do basic record and replay of the wakeup, switch & fork events.
It looks like it can be the starting point of what we're trying to do, but I doubt this will cover all the cases.
That why I'm trying to have a discussion about this problem; to find out what is there, what is missing, and what is there to fix or develop anew.
So what do you think?
Regards
-- Pantelis
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Mar 8, 2012 at 6:50 PM, Pantelis Antoniou panto@antoniou-consulting.com wrote:
Hi there,
There's considerable activity in the subject of the scheduler lately and how to adapt it to the peculiarities of the new class of hardware coming out lately, like the big.LITTLE class of devices from a number of manufacturers.
The platforms that Linux runs are very diverse, and run differing workloads. For example most consumer devices will very likely run something like Android, with common use cases such as audio and/or video playback. Methods to achieve lower power consumption using a power aware scheduler are under investigation.
Similarly for server applications, or VM hosting, the behavior of the scheduler shouldn't have adverse performance implications; any power saving on top of that would be a welcome improvement.
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
I believe many people would have had this simple idea, but I don't know why, or if, it's bad. So I am going to ask.
Why not have much coarser, but deterministic, load patterns using user space apps (perhaps modified to log important characteristics of execution) ?
We could have, say, three sets of stress patterns one each for Server, Desktop and Mobile. Only deterministic would be top-level usage pattern (say by having some app-spawning script running from init, with least or none external influence)
Say the 'Mobile-profile' script could spawn multimedia playback, encoding/decoding, 3d game playback, storage access and suspend/resume cycles in some parallel and serial manner. Each task at the end tells how it was treated during its lifetime (total dropped frames, average latency, overall power consumed etc). From which we calculate a 'GPA'. For any modification in the scheduler, we could see how it affects the current score for each profile running on respective reference platforms.
Kind Regards Yadi
ps: I had to drop Amit Kucheria <amit.kucheria@li, otherwise my post wouldn't fire.
--------------------- Jo darr gaya, so marr gaya!
Hi Yadi,
On Mar 8, 2012, at 7:40 PM, Yadwinder Singh Brar wrote:
On Thu, Mar 8, 2012 at 6:50 PM, Pantelis Antoniou panto@antoniou-consulting.com wrote:
Hi there,
There's considerable activity in the subject of the scheduler lately and how to adapt it to the peculiarities of the new class of hardware coming out lately, like the big.LITTLE class of devices from a number of manufacturers.
The platforms that Linux runs are very diverse, and run differing workloads. For example most consumer devices will very likely run something like Android, with common use cases such as audio and/or video playback. Methods to achieve lower power consumption using a power aware scheduler are under investigation.
Similarly for server applications, or VM hosting, the behavior of the scheduler shouldn't have adverse performance implications; any power saving on top of that would be a welcome improvement.
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
I believe many people would have had this simple idea, but I don't know why, or if, it's bad. So I am going to ask.
Why not have much coarser, but deterministic, load patterns using user space apps (perhaps modified to log important characteristics of execution) ?
We could have, say, three sets of stress patterns one each for Server, Desktop and Mobile. Only deterministic would be top-level usage pattern (say by having some app-spawning script running from init, with least or none external influence)
Say the 'Mobile-profile' script could spawn multimedia playback, encoding/decoding, 3d game playback, storage access and suspend/resume cycles in some parallel and serial manner. Each task at the end tells how it was treated during its lifetime (total dropped frames, average latency, overall power consumed etc). From which we calculate a 'GPA'. For any modification in the scheduler, we could see how it affects the current score for each profile running on respective reference platforms.
Kind Regards Yadi
ps: I had to drop Amit Kucheria <amit.kucheria@li, otherwise my post wouldn't fire.
The problem is defining that characteristic load pattern. Which is it? It might one set of things today, something different tomorrow. Not only the kernel is evolving, media application evolve too. In the end you will end up with some workloads that are treated as benchmarks, and manufacturers will start tweaking for them.
On top of that, the most common consumer linux platform is Android. I bet that most kernel developers do not run Android as their main platform; but they do care to test if their changes affect Android performance.
There is value however in recording these characteristic use-cases, and keeping them in a repository of traces, so when one hacks on the scheduler can compare results.
Regards
-- Pantelis
Jo darr gaya, so marr gaya!
On 03/08/2012 05:20 PM, Pantelis Antoniou wrote:
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
Have you tried to investigate whether 'perf' tool with 'sched record' and 'sched replay' features might be useful for such a purpose?
I tried to record and replay the various types of commonly used benchmarks, including CPU, I/O and network intensive workloads, and have to say that the recording and (especially) replaying overhead is quite high, at least for the default Panda board configuration (where main I/O is slow due to root file system on SD card). Simple things like 'perf sched record sleep 10' works for the most of the cases (but still may cause sample loss, up to 10-20%). But, when I tried to add some I/O, for example, with 'find /', the total workload becomes too high and the system (almost) hangs with a lot of messages like:
INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: rcu_preempt detected stalls on CPUs/tasks: 8055ec64 0 512 2 0x00000000 INFO: Stall ended before state dump start INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task flush-179:0:511 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Now I'm checking whether it's possible to do some partial recording (by skipping some kinds of unrelated samples) and offload the kernel tracing subsystem to get more CPUs time for the user-space tasks.
Do you have any thoughts about this?
Thanks, Dmitry
Dmitry,
Already seen that, and have spend a few weeks coming up with a solution. At this point it's internal to TI and ARM, but it doesn't hurt to get it out for broader discussion.
Please note that this is a proof of concept so far.
Regards
-- Pantelis
I've managed to clean up the patch, and rebase it against the current mainline: I'm also including the source file.
What the patch does is add an extra perf command: perf sched spr-replay <options> which allows you to analyze an existing perf trace, simulate the load that's exhibiting, or even generate and replay a human readable load form.
Example session:
# perf sched record
< run process, etc.>
^C <- perf ends producing a perf.data file.
# perf sched spr-replay -l
<lists tasks and pids in the trace>
# perf sched spr-replay -s <pid#1> -s <name> -d -n -g <see the trace program for the processes with pid#1 and named <name>
# perf sched spr-replay -s <pid#1> -g >test.spr <generates test.spr which contains the program of pid#1>
# perf sched spr-replay -f test.spr -d
<execute the test.spr program>
Make sure you apply the compile-fix patch first cause without it perf doesn't compile at mainline.
As you well know perf does have a capability of recording and playback of scheduling activity. This is done by issuing perf sched record & perf sched replay commands.
How this works is as follows: perf record uses the tracing facilities of the kernel to record events relating to scheduling activity, and perf record takes those traces and constructs an in-memory representation of the activity of a task as something called a scheduling atom. Those atoms can be only of the following 4 types.
- SCHED_EVENT_RUN - for recording duration task execution
- SCHED_EVENT_SLEEP - for recording a task's wait for a resource (or sleeping)
- SCHED_EVENT_WAKEUP - for recording another tasks waking up of another
- SCHED_EVENT_MIGRATION - migrations off the cpu
The problem with this method is apparent when dumping those atoms and examining the schedule produced for a simple task which is hogging the cpu for 5 secs. The program in question can be simple like this:
main() { spin_secs(5); }
As you can see, every activity is recorded, even superfluous wakeups due to the fork handling activity, idle task activity, disk flush activity etc. The SLEEP R (running) at regular intervals, this is the scheduling out of the cpu. It is arguable that this is not what is intended when we want to capture activity of a number of user space processes. We do want to 'catch' cross-task sleep and waking up, but we don't care about activity that the kernel will re-generate when we simulate the task. On top of that, something like this is impossible to be written by hand by a developer which wants to have a simplified scheduling test case that wants to share with others.
00001427795410 FORK_CHILD 1802 00001602966309 WAKEUP [1801@00001450469971] 00001790466309 WAKEUP [1801@00001630523682] 00001985778809 WAKEUP [1801@00001818725586] 00002181091309 WAKEUP [1801@00002014709473] 00002296478272 WAKEUP [1565@00001302673340] 00002296569824 RUN 00000862213134 00002296569824 SLEEP R (running) 0 00002360778809 WAKEUP [1801@00002210021973] 00002556060791 WAKEUP [1801@00002390869141] 00002751403809 WAKEUP [1801@00002584991455] 00002946716309 WAKEUP [1801@00002780395508] 00003134216309 WAKEUP [1801@00002975799561] 00003296508789 WAKEUP [1565@00002302764893] 00003296752930 RUN 00000993988037 00003296752930 SLEEP R (running) 0 00003306091309 WAKEUP [1801@00003162384033] 00003509216309 WAKEUP [1801@00003341369629] 00003704528809 WAKEUP [1801@00003530242920] 00003899871826 WAKEUP [1801@00003734405518] 00004095153809 WAKEUP [1801@00003928863525] 00004282623291 WAKEUP [1801@00004124206543] 00004296722412 WAKEUP [1565@00003302825928] 00004296875000 RUN 00000994049072 00004296875000 SLEEP R (running) 0 00004466857910 WAKEUP [1801@00004333312988] 00004657653809 WAKEUP [1801@00004516815186] 00004852935791 WAKEUP [1801@00004681182861] 00005048278809 WAKEUP [1801@00004882690430] 00005243591309 WAKEUP [1801@00005085571289] 00005296508789 WAKEUP [1565@00004303466797] 00005296752930 RUN 00000993286133 00005296752930 SLEEP R (running) 0 00005332183838 RUN 00000029357910 00005332183838 SLEEP R (running) 0 00005415466309 WAKEUP [1801@00005272033691] 00005610748291 WAKEUP [1801@00005437622070] 00005798278809 WAKEUP [1801@00005640533447] 00005993591309 WAKEUP [1801@00005826477051] 00006188873291 WAKEUP [1801@00006029937744] 00006296478272 WAKEUP [1565@00005302825928] 00006296569824 RUN 00000957977295 00006296569824 SLEEP R (running) 0 00006368591309 WAKEUP [1801@00006217071533] 00006424987793 EXIT 00006428070068 WAKEUP [1802@00001434356690] 00006434265137 RUN 00000131439209 00006434265137 SLEEP x (dead) 0
The solution I came up with is to transform this into a sequence of arbitrary simplified task actions, where task actions are simplified task sequences; so far I've defined the following (plus a few others but not used for now):
- TA_BURN - Burn CPU by spinning for a given duration
- TA_SLEEP - Sleep for a given duration
- TA_WAIT_ID - Wait for a signal on the given ID
- TA_SIGNAL_ID - Signal a given ID
- TA_EXIT - process terminates
- TA_CLONE_PARENT - A clone is executed at this point and the executing task is the father of the child task given as parameter
- TA_CLONE_CHILD - (Always first in sequence) Child executes, started by the parent task given as parameter.
- TA_END - End of task actions
By processing the scheduling atoms in list above, it is possible to generate a program (when given as the sole interesting process the one which does the spin loop) which has the form:
TASK: name :taskA5-101, pid 1803 [00000000000000] BURN 00004562927245ns [00004562927245] EXIT [00004562927245] END
As you can see this is pretty obvious, and it is trivial for a human to understand what the program does, and more importantly tweak it and share it with other developers.
There are a whole bunch of details, but the gist is this; given a perf record trace we can generate a simplified program sequence that has the same effect scheduling wise with the original recorded program, removing anomalies which have to do with the kernel's tasks of the host on which the program has run.
Similarly for a program like this:
main() { pid = fork(); if (pid == 0) { spin_for(5); exit(0); } wait(pid); }
The following simplified program is generated:
Task #1: (Parent) [00000000000000] BURN 00001074249265ns [00001074249265] CLONE_PARENT1803 [00001074249265] BURN 00000006561280ns [00001080810545] SLEEP 00004997039794ns [00006077850339] BURN 00000000579834ns [00006078430173] EXIT [00006078430173] END
Task #2: Child [00000000000000] CLONE_CHILD 1802 [00000000000000] BURN 00004562927245ns [00004562927245] EXIT [00004562927245] END
The original program was this: http://pastebin.com/iarw20zH
Hope this will get some discussion started until I get my patches done and give you something to play with. Please note that the code is nowhere near the place to be submitted anywhere yet, it's a proof of concept.
On Apr 2, 2012, at 1:09 PM, Dmitry Antipov wrote:
On 03/08/2012 05:20 PM, Pantelis Antoniou wrote:
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
Have you tried to investigate whether 'perf' tool with 'sched record' and 'sched replay' features might be useful for such a purpose?
I tried to record and replay the various types of commonly used benchmarks, including CPU, I/O and network intensive workloads, and have to say that the recording and (especially) replaying overhead is quite high, at least for the default Panda board configuration (where main I/O is slow due to root file system on SD card). Simple things like 'perf sched record sleep 10' works for the most of the cases (but still may cause sample loss, up to 10-20%). But, when I tried to add some I/O, for example, with 'find /', the total workload becomes too high and the system (almost) hangs with a lot of messages like:
INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: rcu_preempt detected stalls on CPUs/tasks: 8055ec64 0 512 2 0x00000000 INFO: Stall ended before state dump start INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task flush-179:0:511 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Now I'm checking whether it's possible to do some partial recording (by skipping some kinds of unrelated samples) and offload the kernel tracing subsystem to get more CPUs time for the user-space tasks.
Do you have any thoughts about this?
Thanks, Dmitry
On 04/02/2012 02:15 PM, Pantelis Antoniou wrote:
Example session:
# perf sched record
< run process, etc.>
^C<- perf ends producing a perf.data file.
# perf sched spr-replay -l
<lists tasks and pids in the trace>
# perf sched spr-replay -s<pid#1> -s<name> -d -n -g <see the trace program for the processes with pid#1 and named<name>
# perf sched spr-replay -s<pid#1> -g>test.spr <generates test.spr which contains the program of pid#1>
# perf sched spr-replay -f test.spr -d
<execute the test.spr program>
Do you have any thoughts on how to replay multithreaded workloads? We definitely wants to replay per-process, not per-thread workloads.
Dmitry
Hi Dmitry,
There is big difference between threads and processes as far as the kernel is concerned; both are performed by just a clone syscall, and share or not address space/fd's etc.
If you want to replay per-process workloads, you can just record only the activity of the pids of the threads of the process.
In the future I will try to record thread creation but I'm not sure if the current tracepoints allow to differentiate between a fork or a thread creation; a new tracepoint might be needed which I explicitly wanted to avoid.
Regards
-- Pantelis
On Apr 4, 2012, at 12:57 PM, Dmitry Antipov wrote:
On 04/02/2012 02:15 PM, Pantelis Antoniou wrote:
Example session:
# perf sched record
< run process, etc.>
^C<- perf ends producing a perf.data file.
# perf sched spr-replay -l
<lists tasks and pids in the trace>
# perf sched spr-replay -s<pid#1> -s<name> -d -n -g <see the trace program for the processes with pid#1 and named<name>
# perf sched spr-replay -s<pid#1> -g>test.spr <generates test.spr which contains the program of pid#1>
# perf sched spr-replay -f test.spr -d
<execute the test.spr program>
Do you have any thoughts on how to replay multithreaded workloads? We definitely wants to replay per-process, not per-thread workloads.
Dmitry
Dmitry,
Ah, about the load it's because perf sched record adds too many events to the recording (and configuring small buffers for perf). Using a smaller set of events works much better.
One thing I did was to record on /tmp - You have enough memory for this to work. I'll try to come up with the optimal record set later this day and share.
Regards
-- Pantelis
On Apr 2, 2012, at 1:09 PM, Dmitry Antipov wrote:
On 03/08/2012 05:20 PM, Pantelis Antoniou wrote:
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
Have you tried to investigate whether 'perf' tool with 'sched record' and 'sched replay' features might be useful for such a purpose?
I tried to record and replay the various types of commonly used benchmarks, including CPU, I/O and network intensive workloads, and have to say that the recording and (especially) replaying overhead is quite high, at least for the default Panda board configuration (where main I/O is slow due to root file system on SD card). Simple things like 'perf sched record sleep 10' works for the most of the cases (but still may cause sample loss, up to 10-20%). But, when I tried to add some I/O, for example, with 'find /', the total workload becomes too high and the system (almost) hangs with a lot of messages like:
INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: rcu_preempt detected stalls on CPUs/tasks: 8055ec64 0 512 2 0x00000000 INFO: Stall ended before state dump start INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task flush-179:0:511 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Now I'm checking whether it's possible to do some partial recording (by skipping some kinds of unrelated samples) and offload the kernel tracing subsystem to get more CPUs time for the user-space tasks.
Do you have any thoughts about this?
Thanks, Dmitry
On 04/02/2012 02:18 PM, Pantelis Antoniou wrote:
Ah, about the load it's because perf sched record adds too many events to the recording (and configuring small buffers for perf). Using a smaller set of events works much better.
I tried with a different subsets of 'sched:*' events, but it didn't help too much - shell interactivity ruins to almost zero for everything beyond 'perf sched record sleep 10'.
One thing I did was to record on /tmp - You have enough memory for this to work.
Even with this, I can see a periodical noise about lost samples. It looks like that perf subsystem is quite CPU-intensive even for the case where the workload itself is just a thing like 'sleep 10'.
Dmitry
On Apr 4, 2012, at 1:13 PM, Dmitry Antipov wrote:
On 04/02/2012 02:18 PM, Pantelis Antoniou wrote:
Ah, about the load it's because perf sched record adds too many events to the recording (and configuring small buffers for perf). Using a smaller set of events works much better.
I tried with a different subsets of 'sched:*' events, but it didn't help too much - shell interactivity ruins to almost zero for everything beyond 'perf sched record sleep 10'.
One thing I did was to record on /tmp - You have enough memory for this to work.
Even with this, I can see a periodical noise about lost samples. It looks like that perf subsystem is quite CPU-intensive even for the case where the workload itself is just a thing like 'sleep 10'.
Dmitry
That is quite weird; I've done all my tests with a pandaboard as well. Are you using NFS root? What kind of kernel version?
But yes, perf is a hog; ASCII trace output definitely stresses the system. Gimme a few hours to see if I can come up with a record configuration that's lighter.
Regards
-- Pantelis
Dmitry,
The reason for the slowdown is that perf sched record default settings is tuned for x86 pretty much, and there's a huge amount of data being generated.
perf sched record is just a wrapper for perf record so try using this script for recording:
#!/bin/sh perf record \ -a \ -R \ -f \ -m 8192 \ -c 1 \ -e sched:sched_switch \ -e sched:sched_process_exit \ -e sched:sched_process_fork \ -e sched:sched_wakeup \ -e sched:sched_migrate_task
You can verify that it works by looking at the amount of times that perf got woken up; typically is something like this
root@omap4430-panda:~# time ./perf-sched-record.sh ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.031 MB perf.data (~1357 samples) ]
real 0m8.226s user 0m0.016s sys 0m0.641s
While running vanilla perf you get this:
root@omap4430-panda:~# time ./perf sched record ^C[ perf record: Woken up 4 times to write data ] [ perf record: Captured and wrote 11.678 MB perf.data (~510240 samples) ] Processed 120671 events and lost 1 chunks!
Check IO/CPU overload!
real 0m11.039s user 0m0.141s sys 0m1.266s
That works for spr-replay just fine; might work for you as well.
Regards
-- Pantelis
On Apr 4, 2012, at 1:13 PM, Dmitry Antipov wrote:
On 04/02/2012 02:18 PM, Pantelis Antoniou wrote:
Ah, about the load it's because perf sched record adds too many events to the recording (and configuring small buffers for perf). Using a smaller set of events works much better.
I tried with a different subsets of 'sched:*' events, but it didn't help too much - shell interactivity ruins to almost zero for everything beyond 'perf sched record sleep 10'.
One thing I did was to record on /tmp - You have enough memory for this to work.
Even with this, I can see a periodical noise about lost samples. It looks like that perf subsystem is quite CPU-intensive even for the case where the workload itself is just a thing like 'sleep 10'.
Dmitry
On 04/04/2012 05:10 PM, Pantelis Antoniou wrote:
The reason for the slowdown is that perf sched record default settings is tuned for x86 pretty much, and there's a huge amount of data being generated.
perf sched record is just a wrapper for perf record so try using this script for recording:
#!/bin/sh perf record \ -a \ -R \ -f \ -m 8192 \ -c 1 \ -e sched:sched_switch \ -e sched:sched_process_exit \ -e sched:sched_process_fork \ -e sched:sched_wakeup \ -e sched:sched_migrate_task
That's OK, but I got the following assertion while processing with 'perf sched spr-replay -l':
perf: builtin-sched.c:2629: execute_wait_id: Assertion `ret == 0 || ret == 11' failed.
The program continues, but may report a tens of these assertions.
And one more question around threads. I'm trying to record/replay a test application based on WebKit framework, which tends to spawn a service threads from time to time. So, 'perf sched spr-replay -l' output may looks like the following:
[perf/1503] [kworker/0:2/1439] [ksoftirqd/0/3] [testbrowser/1504] [swapper/1/0] [kworker/1:2/1437] [sshd/1232] [testbrowser/1505] [testbrowser/1506] [ksoftirqd/1/9] [testbrowser/1507] [testbrowser/1508] [sync_supers/179] [testbrowser/1509] [testbrowser/1510] [testbrowser/1511] [testbrowser/1512] [testbrowser/1513] [testbrowser/1514] [testbrowser/1515] [testbrowser/1516] [testbrowser/1517] [testbrowser/1518] [testbrowser/1519] [testbrowser/1520] [testbrowser/1521] [testbrowser/1522] [testbrowser/1523] [testbrowser/1524] [testbrowser/1525] [testbrowser/1526] [testbrowser/1527] [testbrowser/1528] [testbrowser/1529] [testbrowser/1530] [testbrowser/1531] [testbrowser/1532] [testbrowser/1533] [testbrowser/1534] [testbrowser/1535] [testbrowser/1536] [testbrowser/1537] [testbrowser/1538] [testbrowser/1539] [testbrowser/1540] [testbrowser/1541] [testbrowser/1542] [testbrowser/1543] [testbrowser/1544] [testbrowser/1545] [testbrowser/1546] [testbrowser/1547] [testbrowser/1548] [testbrowser/1549] [testbrowser/1550] [testbrowser/1551] [testbrowser/1552] [testbrowser/1553] [testbrowser/1554] [testbrowser/1555] [flush-179:0/1480] [testbrowser/1556] [testbrowser/1557] [kjournald/510] [mmcqd/0/499] [khungtaskd/347] [testbrowser/1558] [testbrowser/1559] [testbrowser/1560] [testbrowser/1561] [testbrowser/1562] [testbrowser/1563] [rsyslogd/573] [testbrowser/1564] [testbrowser/1565] [testbrowser/1566] [testbrowser/1567] [testbrowser/1568] [testbrowser/1569] [testbrowser/1570] [testbrowser/1571] [testbrowser/1572] [testbrowser/1573] [testbrowser/1574] [testbrowser/1575] [testbrowser/1576] [testbrowser/1577] [testbrowser/1578] [testbrowser/1579] [testbrowser/1580] [testbrowser/1581] [testbrowser/1582] [testbrowser/1583] [testbrowser/1584] [testbrowser/1585] [testbrowser/1586] [testbrowser/1587] [testbrowser/1588] [testbrowser/1589] [testbrowser/1590] [testbrowser/1591] [testbrowser/1592] [testbrowser/1593] [testbrowser/1594] [cron/1109] [testbrowser/1595] [testbrowser/1596] [testbrowser/1597] [testbrowser/1598] [testbrowser/1599] [testbrowser/1600] [testbrowser/1601] [testbrowser/1602] [testbrowser/1603] [testbrowser/1604] [testbrowser/1605] [testbrowser/1606] [testbrowser/1607] [testbrowser/1608] [testbrowser/1609] [testbrowser/1610] [testbrowser/1611] [testbrowser/1612] [testbrowser/1613] [testbrowser/1614] [testbrowser/1615] [testbrowser/1616] [testbrowser/1617] [testbrowser/1618] [testbrowser/1619] [testbrowser/1620] [testbrowser/1621] [testbrowser/1622] [testbrowser/1623] [testbrowser/1624] [testbrowser/1625] [testbrowser/1626] [testbrowser/1627] [testbrowser/1628] [testbrowser/1629]
There are a lot of threads created by main thread [testbrowser/1504]. Is there a way to find the most advanced CPU heaters among them?
Dmitry
Hi Dmitry,
On Apr 9, 2012, at 1:07 PM, Dmitry Antipov wrote:
On 04/04/2012 05:10 PM, Pantelis Antoniou wrote:
The reason for the slowdown is that perf sched record default settings is tuned for x86 pretty much, and there's a huge amount of data being generated.
perf sched record is just a wrapper for perf record so try using this script for recording:
#!/bin/sh perf record \ -a \ -R \ -f \ -m 8192 \ -c 1 \ -e sched:sched_switch \ -e sched:sched_process_exit \ -e sched:sched_process_fork \ -e sched:sched_wakeup \ -e sched:sched_migrate_task
That's OK, but I got the following assertion while processing with 'perf sched spr-replay -l':
OK
perf: builtin-sched.c:2629: execute_wait_id: Assertion `ret == 0 || ret == 11' failed.
Hmm, futex op fails? Mind sending me the trace file to test locally?
The program continues, but may report a tens of these assertions.
And one more question around threads. I'm trying to record/replay a test application based on WebKit framework, which tends to spawn a service threads from time to time. So, 'perf sched spr-replay -l' output may looks like the following:
[perf/1503] [kworker/0:2/1439] [ksoftirqd/0/3] [testbrowser/1504] [swapper/1/0] [kworker/1:2/1437] [sshd/1232] [testbrowser/1505] [testbrowser/1506] [ksoftirqd/1/9] [testbrowser/1507] [testbrowser/1508] [sync_supers/179] [testbrowser/1509] [testbrowser/1510] [testbrowser/1511] [testbrowser/1512] [testbrowser/1513] [testbrowser/1514] [testbrowser/1515] [testbrowser/1516] [testbrowser/1517] [testbrowser/1518] [testbrowser/1519] [testbrowser/1520] [testbrowser/1521] [testbrowser/1522] [testbrowser/1523] [testbrowser/1524] [testbrowser/1525] [testbrowser/1526] [testbrowser/1527] [testbrowser/1528] [testbrowser/1529] [testbrowser/1530] [testbrowser/1531] [testbrowser/1532] [testbrowser/1533] [testbrowser/1534] [testbrowser/1535] [testbrowser/1536] [testbrowser/1537] [testbrowser/1538] [testbrowser/1539] [testbrowser/1540] [testbrowser/1541] [testbrowser/1542] [testbrowser/1543] [testbrowser/1544] [testbrowser/1545] [testbrowser/1546] [testbrowser/1547] [testbrowser/1548] [testbrowser/1549] [testbrowser/1550] [testbrowser/1551] [testbrowser/1552] [testbrowser/1553] [testbrowser/1554] [testbrowser/1555] [flush-179:0/1480] [testbrowser/1556] [testbrowser/1557] [kjournald/510] [mmcqd/0/499] [khungtaskd/347] [testbrowser/1558] [testbrowser/1559] [testbrowser/1560] [testbrowser/1561] [testbrowser/1562] [testbrowser/1563] [rsyslogd/573] [testbrowser/1564] [testbrowser/1565] [testbrowser/1566] [testbrowser/1567] [testbrowser/1568] [testbrowser/1569] [testbrowser/1570] [testbrowser/1571] [testbrowser/1572] [testbrowser/1573] [testbrowser/1574] [testbrowser/1575] [testbrowser/1576] [testbrowser/1577] [testbrowser/1578] [testbrowser/1579] [testbrowser/1580] [testbrowser/1581] [testbrowser/1582] [testbrowser/1583] [testbrowser/1584] [testbrowser/1585] [testbrowser/1586] [testbrowser/1587] [testbrowser/1588] [testbrowser/1589] [testbrowser/1590] [testbrowser/1591] [testbrowser/1592] [testbrowser/1593] [testbrowser/1594] [cron/1109] [testbrowser/1595] [testbrowser/1596] [testbrowser/1597] [testbrowser/1598] [testbrowser/1599] [testbrowser/1600] [testbrowser/1601] [testbrowser/1602] [testbrowser/1603] [testbrowser/1604] [testbrowser/1605] [testbrowser/1606] [testbrowser/1607] [testbrowser/1608] [testbrowser/1609] [testbrowser/1610] [testbrowser/1611] [testbrowser/1612] [testbrowser/1613] [testbrowser/1614] [testbrowser/1615] [testbrowser/1616] [testbrowser/1617] [testbrowser/1618] [testbrowser/1619] [testbrowser/1620] [testbrowser/1621] [testbrowser/1622] [testbrowser/1623] [testbrowser/1624] [testbrowser/1625] [testbrowser/1626] [testbrowser/1627] [testbrowser/1628] [testbrowser/1629]
There are a lot of threads created by main thread [testbrowser/1504]. Is there a way to find the most advanced CPU heaters among them?
I see your problem, I'll whip something to report the CPU accumulated run time per thread.
Dmitry
Regards
-- Pantelis
Hi Dmitry,
Here's a updated patch for builtin-sched.c that should fix your issues.
Now when you issue list a field will show the amount of nsecs the task was burning cycles.
It should also fix the crash you've encountered.
Regards
-- Pantelis
On Apr 9, 2012, at 1:07 PM, Dmitry Antipov wrote:
On 04/04/2012 05:10 PM, Pantelis Antoniou wrote:
The reason for the slowdown is that perf sched record default settings is tuned for x86 pretty much, and there's a huge amount of data being generated.
perf sched record is just a wrapper for perf record so try using this script for recording:
#!/bin/sh perf record \ -a \ -R \ -f \ -m 8192 \ -c 1 \ -e sched:sched_switch \ -e sched:sched_process_exit \ -e sched:sched_process_fork \ -e sched:sched_wakeup \ -e sched:sched_migrate_task
That's OK, but I got the following assertion while processing with 'perf sched spr-replay -l':
perf: builtin-sched.c:2629: execute_wait_id: Assertion `ret == 0 || ret == 11' failed.
The program continues, but may report a tens of these assertions.
And one more question around threads. I'm trying to record/replay a test application based on WebKit framework, which tends to spawn a service threads from time to time. So, 'perf sched spr-replay -l' output may looks like the following:
[perf/1503] [kworker/0:2/1439] [ksoftirqd/0/3] [testbrowser/1504] [swapper/1/0] [kworker/1:2/1437] [sshd/1232] [testbrowser/1505] [testbrowser/1506] [ksoftirqd/1/9] [testbrowser/1507] [testbrowser/1508] [sync_supers/179] [testbrowser/1509] [testbrowser/1510] [testbrowser/1511] [testbrowser/1512] [testbrowser/1513] [testbrowser/1514] [testbrowser/1515] [testbrowser/1516] [testbrowser/1517] [testbrowser/1518] [testbrowser/1519] [testbrowser/1520] [testbrowser/1521] [testbrowser/1522] [testbrowser/1523] [testbrowser/1524] [testbrowser/1525] [testbrowser/1526] [testbrowser/1527] [testbrowser/1528] [testbrowser/1529] [testbrowser/1530] [testbrowser/1531] [testbrowser/1532] [testbrowser/1533] [testbrowser/1534] [testbrowser/1535] [testbrowser/1536] [testbrowser/1537] [testbrowser/1538] [testbrowser/1539] [testbrowser/1540] [testbrowser/1541] [testbrowser/1542] [testbrowser/1543] [testbrowser/1544] [testbrowser/1545] [testbrowser/1546] [testbrowser/1547] [testbrowser/1548] [testbrowser/1549] [testbrowser/1550] [testbrowser/1551] [testbrowser/1552] [testbrowser/1553] [testbrowser/1554] [testbrowser/1555] [flush-179:0/1480] [testbrowser/1556] [testbrowser/1557] [kjournald/510] [mmcqd/0/499] [khungtaskd/347] [testbrowser/1558] [testbrowser/1559] [testbrowser/1560] [testbrowser/1561] [testbrowser/1562] [testbrowser/1563] [rsyslogd/573] [testbrowser/1564] [testbrowser/1565] [testbrowser/1566] [testbrowser/1567] [testbrowser/1568] [testbrowser/1569] [testbrowser/1570] [testbrowser/1571] [testbrowser/1572] [testbrowser/1573] [testbrowser/1574] [testbrowser/1575] [testbrowser/1576] [testbrowser/1577] [testbrowser/1578] [testbrowser/1579] [testbrowser/1580] [testbrowser/1581] [testbrowser/1582] [testbrowser/1583] [testbrowser/1584] [testbrowser/1585] [testbrowser/1586] [testbrowser/1587] [testbrowser/1588] [testbrowser/1589] [testbrowser/1590] [testbrowser/1591] [testbrowser/1592] [testbrowser/1593] [testbrowser/1594] [cron/1109] [testbrowser/1595] [testbrowser/1596] [testbrowser/1597] [testbrowser/1598] [testbrowser/1599] [testbrowser/1600] [testbrowser/1601] [testbrowser/1602] [testbrowser/1603] [testbrowser/1604] [testbrowser/1605] [testbrowser/1606] [testbrowser/1607] [testbrowser/1608] [testbrowser/1609] [testbrowser/1610] [testbrowser/1611] [testbrowser/1612] [testbrowser/1613] [testbrowser/1614] [testbrowser/1615] [testbrowser/1616] [testbrowser/1617] [testbrowser/1618] [testbrowser/1619] [testbrowser/1620] [testbrowser/1621] [testbrowser/1622] [testbrowser/1623] [testbrowser/1624] [testbrowser/1625] [testbrowser/1626] [testbrowser/1627] [testbrowser/1628] [testbrowser/1629]
There are a lot of threads created by main thread [testbrowser/1504]. Is there a way to find the most advanced CPU heaters among them?
Dmitry
On 04/09/2012 09:24 PM, Pantelis Antoniou wrote:
Here's a updated patch for builtin-sched.c that should fix your issues.
Now when you issue list a field will show the amount of nsecs the task was burning cycles.
It should also fix the crash you've encountered.
Thanks, I'm trying it now.
BTW, what's your compiler? I'm constantly seeing annoying but easy to fix warnings:
gcc -o builtin-sched.o -c -fno-omit-frame-pointer -ggdb3 -Wall -Wextra -std=gnu99 -Werror -O6 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Wformat-y2k -Wshadow -Winit-self -Wpacked -Wredundant-decls -Wstrict-aliasing=3 -Wswitch-default -Wswitch-enum -Wno-system-headers -Wundef -Wwrite-strings -Wbad-function-cast -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wstrict-prototypes -Wdeclaration-after-statement -fstack-protector-all -Wstack-protector -Wvolatile-register-var -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -Iutil/include -Iarch/arm/include -I/util -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -DLIBELF_NO_MMAP -DNO_NEWT_SUPPORT -pthread -I/usr/include/gtk-2.0 -I/usr/lib/arm-linux-gnueabi/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/gio-unix-2.0/ -I/usr/include/glib-2.0 -I/usr/lib/arm-linux-gnueabi/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -DNO_LIBPERL -DNO_LIBPYTHON -DNO_STRLCPY builtin-sched.c builtin-sched.c: In function 'calculate_bogoloops_value': builtin-sched.c:342:34: error: variable 'delta_diff' set but not used [-Werror=unused-but-set-variable] builtin-sched.c: In function 'generate_spr_program': builtin-sched.c:3293:28: error: variable 'atom_last' set but not used [-Werror=unused-but-set-variable] cc1: all warnings being treated as errors
gcc -v ==>
Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabi/4.6.1/lto-wrapper Target: arm-linux-gnueabi Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.1-9ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-float=softfp --with-fpu=vfpv3-d16 --with-mode=thumb --disable-werror --enable-checking=release --build=arm-linux-gnueabi --host=arm-linux-gnueabi --target=arm-linux-gnueabi Thread model: posix gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
Dmitry
Hi Dmitry,
I use a somewhat older compiler that doesn't whine so much, so I don't see those warnings (don't get me started on how annoying gcc is lately).
Sent me a compile log and I'll fix them.
Regards
-- Pantelis
panto@orpheus:~/ti$ ${CROSS_COMPILE}gcc --version arm-angstrom-linux-gnueabi-gcc (GCC) 4.5.4 20111126 (prerelease) Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
On Apr 10, 2012, at 12:17 PM, Dmitry Antipov wrote:
On 04/09/2012 09:24 PM, Pantelis Antoniou wrote:
Here's a updated patch for builtin-sched.c that should fix your issues.
Now when you issue list a field will show the amount of nsecs the task was burning cycles.
It should also fix the crash you've encountered.
Thanks, I'm trying it now.
BTW, what's your compiler? I'm constantly seeing annoying but easy to fix warnings:
gcc -o builtin-sched.o -c -fno-omit-frame-pointer -ggdb3 -Wall -Wextra -std=gnu99 -Werror -O6 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Wformat-y2k -Wshadow -Winit-self -Wpacked -Wredundant-decls -Wstrict-aliasing=3 -Wswitch-default -Wswitch-enum -Wno-system-headers -Wundef -Wwrite-strings -Wbad-function-cast -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wstrict-prototypes -Wdeclaration-after-statement -fstack-protector-all -Wstack-protector -Wvolatile-register-var -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -Iutil/include -Iarch/arm/include -I/util -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -DLIBELF_NO_MMAP -DNO_NEWT_SUPPORT -pthread -I/usr/include/gtk-2.0 -I/usr/lib/arm-linux-gnueabi/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/gio-unix-2.0/ -I/usr/include/glib-2.0 -I/usr/lib/arm-linux-gnueabi/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -DNO_LIBPERL -DNO_LIBPYTHON -DNO_STRLCPY builtin-sched.c builtin-sched.c: In function 'calculate_bogoloops_value': builtin-sched.c:342:34: error: variable 'delta_diff' set but not used [-Werror=unused-but-set-variable] builtin-sched.c: In function 'generate_spr_program': builtin-sched.c:3293:28: error: variable 'atom_last' set but not used [-Werror=unused-but-set-variable] cc1: all warnings being treated as errors
gcc -v ==>
Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabi/4.6.1/lto-wrapper Target: arm-linux-gnueabi Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.1-9ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-float=softfp --with-fpu=vfpv3-d16 --with-mode=thumb --disable-werror --enable-checking=release --build=arm-linux-gnueabi --host=arm-linux-gnueabi --target=arm-linux-gnueabi Thread model: posix gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
Dmitry
Pantelis,
Why would you use anything other than the kick-ass ARM toolchain that Linaro is providing for over a year now? ;-)
/Amit
On Tue, Apr 10, 2012 at 12:46 PM, Pantelis Antoniou < panto@antoniou-consulting.com> wrote:
Hi Dmitry,
I use a somewhat older compiler that doesn't whine so much, so I don't see those warnings (don't get me started on how annoying gcc is lately).
Sent me a compile log and I'll fix them.
Regards
-- Pantelis
panto@orpheus:~/ti$ ${CROSS_COMPILE}gcc --version arm-angstrom-linux-gnueabi-gcc (GCC) 4.5.4 20111126 (prerelease) Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is
NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
On Apr 10, 2012, at 12:17 PM, Dmitry Antipov wrote:
On 04/09/2012 09:24 PM, Pantelis Antoniou wrote:
Here's a updated patch for builtin-sched.c that should fix your issues.
Now when you issue list a field will show the amount of nsecs the task was burning cycles.
It should also fix the crash you've encountered.
Thanks, I'm trying it now.
BTW, what's your compiler? I'm constantly seeing annoying but easy to
fix warnings:
gcc -o builtin-sched.o -c -fno-omit-frame-pointer -ggdb3 -Wall -Wextra
-std=gnu99 -Werror -O6 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Wformat-y2k -Wshadow -Winit-self -Wpacked -Wredundant-decls -Wstrict-aliasing=3 -Wswitch-default -Wswitch-enum -Wno-system-headers -Wundef -Wwrite-strings -Wbad-function-cast -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wstrict-prototypes -Wdeclaration-after-statement -fstack-protector-all -Wstack-protector -Wvolatile-register-var -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -Iutil/include -Iarch/arm/include -I/util -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -DLIBELF_NO_MMAP -DNO_NEWT_SUPPORT -pthread -I/usr/include/gtk-2.0 -I/usr/lib/arm-linux-gnueabi/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/gio-unix-2.0/ -I/usr/include/glib-2.0 -I/usr/lib/arm-linux-gnueabi/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -DNO_LIBPERL -DNO_LIBPYTHON -DNO_STRLCPY builtin-sched.c
builtin-sched.c: In function 'calculate_bogoloops_value': builtin-sched.c:342:34: error: variable 'delta_diff' set but not used
[-Werror=unused-but-set-variable]
builtin-sched.c: In function 'generate_spr_program': builtin-sched.c:3293:28: error: variable 'atom_last' set but not used
[-Werror=unused-but-set-variable]
cc1: all warnings being treated as errors
gcc -v ==>
Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabi/4.6.1/lto-wrapper Target: arm-linux-gnueabi Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
4.6.1-9ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-float=softfp --with-fpu=vfpv3-d16 --with-mode=thumb --disable-werror --enable-checking=release --build=arm-linux-gnueabi --host=arm-linux-gnueabi --target=arm-linux-gnueabi
Thread model: posix gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
Dmitry
Amit,
I haven't the faintest clue of how many ARM toolchains I have on the box right now. I just grab the latest one I installed.
Regards
-- Pantelis
On Apr 10, 2012, at 1:12 PM, Amit Kucheria wrote:
Pantelis,
Why would you use anything other than the kick-ass ARM toolchain that Linaro is providing for over a year now? ;-)
/Amit
On Tue, Apr 10, 2012 at 12:46 PM, Pantelis Antoniou panto@antoniou-consulting.com wrote: Hi Dmitry,
I use a somewhat older compiler that doesn't whine so much, so I don't see those warnings (don't get me started on how annoying gcc is lately).
Sent me a compile log and I'll fix them.
Regards
-- Pantelis
panto@orpheus:~/ti$ ${CROSS_COMPILE}gcc --version arm-angstrom-linux-gnueabi-gcc (GCC) 4.5.4 20111126 (prerelease) Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
On Apr 10, 2012, at 12:17 PM, Dmitry Antipov wrote:
On 04/09/2012 09:24 PM, Pantelis Antoniou wrote:
Here's a updated patch for builtin-sched.c that should fix your issues.
Now when you issue list a field will show the amount of nsecs the task was burning cycles.
It should also fix the crash you've encountered.
Thanks, I'm trying it now.
BTW, what's your compiler? I'm constantly seeing annoying but easy to fix warnings:
gcc -o builtin-sched.o -c -fno-omit-frame-pointer -ggdb3 -Wall -Wextra -std=gnu99 -Werror -O6 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Wformat-y2k -Wshadow -Winit-self -Wpacked -Wredundant-decls -Wstrict-aliasing=3 -Wswitch-default -Wswitch-enum -Wno-system-headers -Wundef -Wwrite-strings -Wbad-function-cast -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wstrict-prototypes -Wdeclaration-after-statement -fstack-protector-all -Wstack-protector -Wvolatile-register-var -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -Iutil/include -Iarch/arm/include -I/util -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -DLIBELF_NO_MMAP -DNO_NEWT_SUPPORT -pthread -I/usr/include/gtk-2.0 -I/usr/lib/arm-linux-gnueabi/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/gio-unix-2.0/ -I/usr/include/glib-2.0 -I/usr/lib/arm-linux-gnueabi/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -DNO_LIBPERL -DNO_LIBPYTHON -DNO_STRLCPY builtin-sched.c builtin-sched.c: In function 'calculate_bogoloops_value': builtin-sched.c:342:34: error: variable 'delta_diff' set but not used [-Werror=unused-but-set-variable] builtin-sched.c: In function 'generate_spr_program': builtin-sched.c:3293:28: error: variable 'atom_last' set but not used [-Werror=unused-but-set-variable] cc1: all warnings being treated as errors
gcc -v ==>
Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabi/4.6.1/lto-wrapper Target: arm-linux-gnueabi Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.1-9ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-float=softfp --with-fpu=vfpv3-d16 --with-mode=thumb --disable-werror --enable-checking=release --build=arm-linux-gnueabi --host=arm-linux-gnueabi --target=arm-linux-gnueabi Thread model: posix gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
Dmitry
On 04/09/2012 09:24 PM, Pantelis Antoniou wrote:
Here's a updated patch for builtin-sched.c that should fix your issues.
Now when you issue list a field will show the amount of nsecs the task was burning cycles.
It should also fix the crash you've encountered.
1) IIUC, 'perf sched spr-replay -l' should lists the tasks only, but it looks like it tries to execute something just after displaying the list. While this command is running, I can see a lot of perf processes, like:
root 2846 28.0 1.8 30552 19328 pts/0 S+ 07:51 0:01 perf sched spr-replay -l root 2847 36.6 1.4 30552 14572 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2848 11.6 1.4 30552 14580 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2849 36.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2850 13.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2851 7.3 1.4 30552 14584 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2852 0.0 1.4 30552 14584 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2853 5.0 1.4 30552 14588 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2854 0.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2855 0.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2856 0.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2857 0.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2858 37.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2859 37.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2860 35.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2861 37.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l
And:
perf: builtin-sched.c:2621: execute_wait_id: Assertion `ret == 0 || ret == 11' failed.
from one of them.
2) When I'm trying to generate SPR from an interesting process with 'perf sched spr-replay -s 2580 -g', I got:
[testbrowser/2580] burn 36007422086 exit 0 end
I suppose this is wrong because it means that testbrowser/2580 newer yields the CPU until exit.
Could you try to look at my perf.data and replay testbrowser/2580? Gzipped copy is here: http://78.153.153.8/tmp/perf.data.gz
Thanks, Dmitry
Hi Dmitry,
On Apr 11, 2012, at 11:02 AM, Dmitry Antipov wrote:
On 04/09/2012 09:24 PM, Pantelis Antoniou wrote:
Here's a updated patch for builtin-sched.c that should fix your issues.
Now when you issue list a field will show the amount of nsecs the task was burning cycles.
It should also fix the crash you've encountered.
- IIUC, 'perf sched spr-replay -l' should lists the tasks only, but it looks
like it tries to execute something just after displaying the list. While this command is running, I can see a lot of perf processes, like:
root 2846 28.0 1.8 30552 19328 pts/0 S+ 07:51 0:01 perf sched spr-replay -l root 2847 36.6 1.4 30552 14572 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2848 11.6 1.4 30552 14580 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2849 36.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2850 13.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2851 7.3 1.4 30552 14584 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2852 0.0 1.4 30552 14584 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2853 5.0 1.4 30552 14588 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2854 0.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2855 0.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2856 0.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2857 0.0 1.4 30552 14576 pts/0 S+ 07:51 0:00 perf sched spr-replay -l root 2858 37.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2859 37.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2860 35.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l root 2861 37.3 1.4 30552 14548 pts/0 R+ 07:51 0:01 perf sched spr-replay -l
And:
perf: builtin-sched.c:2621: execute_wait_id: Assertion `ret == 0 || ret == 11' failed.
from one of them.
If you don't want to run anything, use the -n switch (dry-run). Arguably -l could assume -n, but passing -n works for now.
I'm not that happy about the crash. I'll try to check it out.
- When I'm trying to generate SPR from an interesting process with 'perf sched spr-replay -s 2580 -g',
I got:
[testbrowser/2580] burn 36007422086 exit 0 end
I suppose this is wrong because it means that testbrowser/2580 newer yields the CPU until exit.
OK, a bit of an expanding on that is required. By default, spr-replay will not preserve recorded timing. To preserve timing you can supply the -p switch which will cause the timing of the recorded process to be observed.
Let's take for example a process that is burning CPU and doesn't yield explicitly. At regular intervals it would be switched out of the CPU so that another process would execute, but it's state would still be running.
That means that in an ideal system with infinite CPUs it would be possible to execute. That is what we want to record, since the actual intention of the process would be to continue running but due to limited CPU resources it is interrupted.
Try running this with the -p switch, or include more processes with the -s switch.
Also, since you are not including any other processes that might signal this one, you effectively remove them from being recorded, so if testbrowser was waiting for a signal from another process that is not part of the schedule, the wait will be removed.
Could you try to look at my perf.data and replay testbrowser/2580? Gzipped copy is here: http://78.153.153.8/tmp/perf.data.gz
Thanks, Dmitry
OK, will check.
Regards
-- Pantelis