Re: [RFC] Scheduler recorder and playback

2 Apr 2012

      Dmitry,
Already seen that, and have spend a few weeks coming up with a solution.
At this point it's internal to TI and ARM, but it doesn't hurt to get it
out for broader discussion.
Please note that this is a proof of concept so far.
Regards
-- Pantelis
...
I've managed to clean up the patch, and rebase it against the current mainline:
I'm also including the source file.
What the patch does is add an extra perf command: perf sched spr-replay <options>
which allows you to analyze an existing perf trace, simulate the load that's exhibiting,
or even generate and replay a human readable load form.
Example session:
# perf sched record
< run process, etc.>
^C <- perf ends producing a perf.data file.
# perf sched spr-replay -l
<lists tasks and pids in the trace>
# perf sched spr-replay -s <pid#1> -s <name> -d -n -g
<see the trace program for the processes with pid#1 and named <name>
# perf sched spr-replay -s <pid#1> -g >test.spr
<generates test.spr which contains the program of pid#1>
# perf sched spr-replay -f test.spr -d
<execute the test.spr program>
Make sure you apply the compile-fix patch first cause without it perf doesn't compile at mainline.
...
As you well know perf does have a capability of recording and playback of
scheduling activity. This is done by issuing perf sched record & perf sched replay
commands.
How this works is as follows: perf record uses the tracing facilities of the kernel
to record events relating to scheduling activity, and perf record takes those traces
and constructs an in-memory representation of the activity of a task as something called
a scheduling atom. Those atoms can be only of the following 4 types.

SCHED_EVENT_RUN - for recording duration task execution
SCHED_EVENT_SLEEP - for recording a task's wait for a resource (or sleeping)
SCHED_EVENT_WAKEUP - for recording another tasks waking up of another
SCHED_EVENT_MIGRATION - migrations off the cpu

The problem with this method is apparent when dumping those atoms and examining
the schedule produced for a simple task which is hogging the cpu for 5 secs.
The program in question can be simple like this:
main() { spin_secs(5); }
As you can see, every activity is recorded, even superfluous wakeups due to
the fork handling activity, idle task activity, disk flush activity etc.
The SLEEP R (running) at regular intervals, this is the scheduling out of the cpu.
It is arguable that this is not what is intended when we want to capture activity of
a number of user space processes. We do want to 'catch' cross-task sleep
and waking up, but we don't care about activity that the kernel will re-generate
when we simulate the task.
On top of that, something like this is impossible to be written by hand by a
developer which wants to have a simplified scheduling test case that wants to share
with others.
   00001427795410 FORK_CHILD 1802
   00001602966309 WAKEUP     [1801@00001450469971]
   00001790466309 WAKEUP     [1801@00001630523682]
   00001985778809 WAKEUP     [1801@00001818725586]
   00002181091309 WAKEUP     [1801@00002014709473]
   00002296478272 WAKEUP     [1565@00001302673340]
   00002296569824 RUN       00000862213134
   00002296569824 SLEEP      R (running) 0
   00002360778809 WAKEUP     [1801@00002210021973]
   00002556060791 WAKEUP     [1801@00002390869141]
   00002751403809 WAKEUP     [1801@00002584991455]
   00002946716309 WAKEUP     [1801@00002780395508]
   00003134216309 WAKEUP     [1801@00002975799561]
   00003296508789 WAKEUP     [1565@00002302764893]
   00003296752930 RUN       00000993988037
   00003296752930 SLEEP      R (running) 0
   00003306091309 WAKEUP     [1801@00003162384033]
   00003509216309 WAKEUP     [1801@00003341369629]
   00003704528809 WAKEUP     [1801@00003530242920]
   00003899871826 WAKEUP     [1801@00003734405518]
   00004095153809 WAKEUP     [1801@00003928863525]
   00004282623291 WAKEUP     [1801@00004124206543]
   00004296722412 WAKEUP     [1565@00003302825928]
   00004296875000 RUN       00000994049072
   00004296875000 SLEEP      R (running) 0
   00004466857910 WAKEUP     [1801@00004333312988]
   00004657653809 WAKEUP     [1801@00004516815186]
   00004852935791 WAKEUP     [1801@00004681182861]
   00005048278809 WAKEUP     [1801@00004882690430]
   00005243591309 WAKEUP     [1801@00005085571289]
   00005296508789 WAKEUP     [1565@00004303466797]
   00005296752930 RUN       00000993286133
   00005296752930 SLEEP      R (running) 0
   00005332183838 RUN       00000029357910
   00005332183838 SLEEP      R (running) 0
   00005415466309 WAKEUP     [1801@00005272033691]
   00005610748291 WAKEUP     [1801@00005437622070]
   00005798278809 WAKEUP     [1801@00005640533447]
   00005993591309 WAKEUP     [1801@00005826477051]
   00006188873291 WAKEUP     [1801@00006029937744]
   00006296478272 WAKEUP     [1565@00005302825928]
   00006296569824 RUN       00000957977295
   00006296569824 SLEEP      R (running) 0
   00006368591309 WAKEUP     [1801@00006217071533]
   00006424987793 EXIT
   00006428070068 WAKEUP     [1802@00001434356690]
   00006434265137 RUN       00000131439209
   00006434265137 SLEEP      x (dead) 0

The solution I came up with is to transform this into a sequence of arbitrary simplified
task actions, where task actions are simplified task sequences; so far I've defined the
following (plus a few others but not used for now):

TA_BURN - Burn CPU by spinning for a given duration
TA_SLEEP - Sleep for a given duration
TA_WAIT_ID - Wait for a signal on the given ID
TA_SIGNAL_ID - Signal a given ID
TA_EXIT - process terminates
TA_CLONE_PARENT - A clone is executed at this point and the executing task is the father of the child task given as parameter
TA_CLONE_CHILD - (Always first in sequence) Child executes, started by the parent task given as parameter.
TA_END - End of task actions

By processing the scheduling atoms in list above, it is possible to generate a program (when given as the sole interesting process
the one which does the spin loop) which has the form:
TASK: name :taskA5-101, pid 1803
[00000000000000] BURN        00004562927245ns
[00004562927245] EXIT
[00004562927245] END
As you can see this is pretty obvious, and it is trivial for a human to understand what the program does, and more
importantly tweak it and share it with other developers.
There are a whole bunch of details, but the gist is this; given a perf record trace we can generate a simplified program
sequence that has the same effect scheduling wise with the original recorded program, removing anomalies which
have to do with the kernel's tasks of the host on which the program has run.
Similarly for a program like this:
main() { pid = fork(); if (pid == 0) { spin_for(5); exit(0); } wait(pid); }
The following simplified program is generated:
Task #1: (Parent)
[00000000000000] BURN        00001074249265ns
[00001074249265] CLONE_PARENT1803
[00001074249265] BURN        00000006561280ns
[00001080810545] SLEEP       00004997039794ns
[00006077850339] BURN        00000000579834ns
[00006078430173] EXIT
[00006078430173] END
Task #2: Child
[00000000000000] CLONE_CHILD 1802
[00000000000000] BURN        00004562927245ns
[00004562927245] EXIT
[00004562927245] END
The original program was this: http://pastebin.com/iarw20zH
Hope this will get some discussion started until I get my patches done and give you
something to play with. Please note that the code is nowhere near the place to be
submitted anywhere yet, it's a proof of concept.
On Apr 2, 2012, at 1:09 PM, Dmitry Antipov wrote:
...
On 03/08/2012 05:20 PM, Pantelis Antoniou wrote:
...
The current issue is that scheduler development is not easily shared between
developers. Each developer has their own 'itch', be it Android use cases, server
workloads, VM, etc. The risk is high of optimizing for one's own use case and
causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one
could perform a given use-case workload in a host, record the activity in a
interchangeable portable trace format file, and then play it back on another
host via a playback application that will generate an approximately similar load
which was observed during recording.
Have you tried to investigate whether 'perf' tool with 'sched record' and 'sched replay'
features might be useful for such a purpose?
I tried to record and replay the various types of commonly used benchmarks, including
CPU, I/O and network intensive workloads, and have to say that the recording and
(especially) replaying overhead is quite high, at least for the default Panda board
configuration (where main I/O is slow due to root file system on SD card). Simple
things like 'perf sched record sleep 10' works for the most of the cases (but still
may cause sample loss, up to 10-20%). But, when I tried to add some I/O, for example,
with 'find /', the total workload becomes too high and the system (almost) hangs
with a lot of messages like:
INFO: task kjournald:512 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
INFO: rcu_preempt detected stalls on CPUs/tasks: 8055ec64     0   512      2 0x00000000
INFO: Stall ended before state dump start
INFO: task kjournald:512 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
INFO: task flush-179:0:511 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
INFO: task kjournald:512 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Now I'm checking whether it's possible to do some partial recording (by skipping
some kinds of unrelated samples) and offload the kernel tracing subsystem to get
more CPUs time for the user-space tasks.
Do you have any thoughts about this?
Thanks,
Dmitry

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [RFC] Scheduler recorder and playback