Dmitry,
Already seen that, and have spend a few weeks coming up with a solution. At this point it's internal to TI and ARM, but it doesn't hurt to get it out for broader discussion.
Please note that this is a proof of concept so far.
Regards
-- Pantelis
I've managed to clean up the patch, and rebase it against the current mainline: I'm also including the source file.
What the patch does is add an extra perf command: perf sched spr-replay <options> which allows you to analyze an existing perf trace, simulate the load that's exhibiting, or even generate and replay a human readable load form.
Example session:
# perf sched record
< run process, etc.>
^C <- perf ends producing a perf.data file.
# perf sched spr-replay -l
<lists tasks and pids in the trace>
# perf sched spr-replay -s <pid#1> -s <name> -d -n -g <see the trace program for the processes with pid#1 and named <name>
# perf sched spr-replay -s <pid#1> -g >test.spr <generates test.spr which contains the program of pid#1>
# perf sched spr-replay -f test.spr -d
<execute the test.spr program>
Make sure you apply the compile-fix patch first cause without it perf doesn't compile at mainline.
As you well know perf does have a capability of recording and playback of scheduling activity. This is done by issuing perf sched record & perf sched replay commands.
How this works is as follows: perf record uses the tracing facilities of the kernel to record events relating to scheduling activity, and perf record takes those traces and constructs an in-memory representation of the activity of a task as something called a scheduling atom. Those atoms can be only of the following 4 types.
- SCHED_EVENT_RUN - for recording duration task execution
- SCHED_EVENT_SLEEP - for recording a task's wait for a resource (or sleeping)
- SCHED_EVENT_WAKEUP - for recording another tasks waking up of another
- SCHED_EVENT_MIGRATION - migrations off the cpu
The problem with this method is apparent when dumping those atoms and examining the schedule produced for a simple task which is hogging the cpu for 5 secs. The program in question can be simple like this:
main() { spin_secs(5); }
As you can see, every activity is recorded, even superfluous wakeups due to the fork handling activity, idle task activity, disk flush activity etc. The SLEEP R (running) at regular intervals, this is the scheduling out of the cpu. It is arguable that this is not what is intended when we want to capture activity of a number of user space processes. We do want to 'catch' cross-task sleep and waking up, but we don't care about activity that the kernel will re-generate when we simulate the task. On top of that, something like this is impossible to be written by hand by a developer which wants to have a simplified scheduling test case that wants to share with others.
00001427795410 FORK_CHILD 1802 00001602966309 WAKEUP [1801@00001450469971] 00001790466309 WAKEUP [1801@00001630523682] 00001985778809 WAKEUP [1801@00001818725586] 00002181091309 WAKEUP [1801@00002014709473] 00002296478272 WAKEUP [1565@00001302673340] 00002296569824 RUN 00000862213134 00002296569824 SLEEP R (running) 0 00002360778809 WAKEUP [1801@00002210021973] 00002556060791 WAKEUP [1801@00002390869141] 00002751403809 WAKEUP [1801@00002584991455] 00002946716309 WAKEUP [1801@00002780395508] 00003134216309 WAKEUP [1801@00002975799561] 00003296508789 WAKEUP [1565@00002302764893] 00003296752930 RUN 00000993988037 00003296752930 SLEEP R (running) 0 00003306091309 WAKEUP [1801@00003162384033] 00003509216309 WAKEUP [1801@00003341369629] 00003704528809 WAKEUP [1801@00003530242920] 00003899871826 WAKEUP [1801@00003734405518] 00004095153809 WAKEUP [1801@00003928863525] 00004282623291 WAKEUP [1801@00004124206543] 00004296722412 WAKEUP [1565@00003302825928] 00004296875000 RUN 00000994049072 00004296875000 SLEEP R (running) 0 00004466857910 WAKEUP [1801@00004333312988] 00004657653809 WAKEUP [1801@00004516815186] 00004852935791 WAKEUP [1801@00004681182861] 00005048278809 WAKEUP [1801@00004882690430] 00005243591309 WAKEUP [1801@00005085571289] 00005296508789 WAKEUP [1565@00004303466797] 00005296752930 RUN 00000993286133 00005296752930 SLEEP R (running) 0 00005332183838 RUN 00000029357910 00005332183838 SLEEP R (running) 0 00005415466309 WAKEUP [1801@00005272033691] 00005610748291 WAKEUP [1801@00005437622070] 00005798278809 WAKEUP [1801@00005640533447] 00005993591309 WAKEUP [1801@00005826477051] 00006188873291 WAKEUP [1801@00006029937744] 00006296478272 WAKEUP [1565@00005302825928] 00006296569824 RUN 00000957977295 00006296569824 SLEEP R (running) 0 00006368591309 WAKEUP [1801@00006217071533] 00006424987793 EXIT 00006428070068 WAKEUP [1802@00001434356690] 00006434265137 RUN 00000131439209 00006434265137 SLEEP x (dead) 0
The solution I came up with is to transform this into a sequence of arbitrary simplified task actions, where task actions are simplified task sequences; so far I've defined the following (plus a few others but not used for now):
- TA_BURN - Burn CPU by spinning for a given duration
- TA_SLEEP - Sleep for a given duration
- TA_WAIT_ID - Wait for a signal on the given ID
- TA_SIGNAL_ID - Signal a given ID
- TA_EXIT - process terminates
- TA_CLONE_PARENT - A clone is executed at this point and the executing task is the father of the child task given as parameter
- TA_CLONE_CHILD - (Always first in sequence) Child executes, started by the parent task given as parameter.
- TA_END - End of task actions
By processing the scheduling atoms in list above, it is possible to generate a program (when given as the sole interesting process the one which does the spin loop) which has the form:
TASK: name :taskA5-101, pid 1803 [00000000000000] BURN 00004562927245ns [00004562927245] EXIT [00004562927245] END
As you can see this is pretty obvious, and it is trivial for a human to understand what the program does, and more importantly tweak it and share it with other developers.
There are a whole bunch of details, but the gist is this; given a perf record trace we can generate a simplified program sequence that has the same effect scheduling wise with the original recorded program, removing anomalies which have to do with the kernel's tasks of the host on which the program has run.
Similarly for a program like this:
main() { pid = fork(); if (pid == 0) { spin_for(5); exit(0); } wait(pid); }
The following simplified program is generated:
Task #1: (Parent) [00000000000000] BURN 00001074249265ns [00001074249265] CLONE_PARENT1803 [00001074249265] BURN 00000006561280ns [00001080810545] SLEEP 00004997039794ns [00006077850339] BURN 00000000579834ns [00006078430173] EXIT [00006078430173] END
Task #2: Child [00000000000000] CLONE_CHILD 1802 [00000000000000] BURN 00004562927245ns [00004562927245] EXIT [00004562927245] END
The original program was this: http://pastebin.com/iarw20zH
Hope this will get some discussion started until I get my patches done and give you something to play with. Please note that the code is nowhere near the place to be submitted anywhere yet, it's a proof of concept.
On Apr 2, 2012, at 1:09 PM, Dmitry Antipov wrote:
On 03/08/2012 05:20 PM, Pantelis Antoniou wrote:
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
Have you tried to investigate whether 'perf' tool with 'sched record' and 'sched replay' features might be useful for such a purpose?
I tried to record and replay the various types of commonly used benchmarks, including CPU, I/O and network intensive workloads, and have to say that the recording and (especially) replaying overhead is quite high, at least for the default Panda board configuration (where main I/O is slow due to root file system on SD card). Simple things like 'perf sched record sleep 10' works for the most of the cases (but still may cause sample loss, up to 10-20%). But, when I tried to add some I/O, for example, with 'find /', the total workload becomes too high and the system (almost) hangs with a lot of messages like:
INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: rcu_preempt detected stalls on CPUs/tasks: 8055ec64 0 512 2 0x00000000 INFO: Stall ended before state dump start INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task flush-179:0:511 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task kjournald:512 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Now I'm checking whether it's possible to do some partial recording (by skipping some kinds of unrelated samples) and offload the kernel tracing subsystem to get more CPUs time for the user-space tasks.
Do you have any thoughts about this?
Thanks, Dmitry