Hi there,
There's considerable activity in the subject of the scheduler lately and how to adapt it to the peculiarities of the new class of hardware coming out lately, like the big.LITTLE class of devices from a number of manufacturers.
The platforms that Linux runs are very diverse, and run differing workloads. For example most consumer devices will very likely run something like Android, with common use cases such as audio and/or video playback. Methods to achieve lower power consumption using a power aware scheduler are under investigation.
Similarly for server applications, or VM hosting, the behavior of the scheduler shouldn't have adverse performance implications; any power saving on top of that would be a welcome improvement.
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
The way that the two hosts respond under the same load generated by the playback application can be compared, so that the performance of the two scheduler implementations measured in various metrics (like performance, power consumption etc.) can be evaluated.
The fidelity of the this approximation is of great importance but it is debatable if it is possible to have a fully identical load generated, since details of the hosts might differ in such a way that such a thing is impossible. I believe that it should be possible at least to simulate a purely CPU load, and the blocking behavior of tasks, in such a way that it would result in scheduler decisions that can be compared and shared among developers.
The recording part I believe can be handled by the kernel's tracing infrastructure, either by using the existing tracepoints, or need be adding more; possibly even creating a new tracer solely for this purpose. Since some applications can adapt their behavior according to insufficient system resources (think media players that can drop frames for example), I think it would be beneficial to record such events to the same trace file.
The trace file should have a portable format so that it can be freely shared between developers. An ASCII format like we currently use should be OK, as long as it doesn't cause too much of an effect during execution of the recording.
The playback application can be implemented via two ways.
One way, which is the LinSched way would be to have the full scheduler implementation compiled as part of said application, and use application specific methods to evaluate performance. While this will work, it won't allow comparison of the two hosts in a meaningful manner.
For both scheduler and platform evaluation, the playback application will generate the load on the running host by simulating the source host's recorded work load session. That means emulating process activity like forks, thread spawning, blocking on resources etc. It is not clear to me yet if that's possible without using some kind of kernel level helper module, but not requiring such is desirable.
Since one would have the full trace of scheduling activity: past, present and future; there would be the possibility of generating a perfect schedule (as defined by best performance, or best power consumption), and use it as a yardstick of evaluation against the actual scheduler. Comparing the results, you would get an estimate of the best case improvement that could be achieved if the ideal scheduler existed.
I know this is a bit long, but I hope this can be a basis of thinking on how to go about developing this.
Regards
-- Pantelis
Hi Pantelis,
what is your primary goal?
Saving power? Improving interactivity? Be more cache-optimal?
I think recording the scheduling (without drastically slowing it down) is hard to do, since it ticks at least HZ times per second.
Replay such an recording (on a differeny platform) seems even harder to me. Since different cpus (and number of them), different speed and a very likely not reproduceable process-setup may noise completely the replay.
Do you also want to record/replay the behaviour of the more important loadbalancer, too?
Did you thought to compare different platforms by simply having syntetically generated loads? (I.e. see interbench -> http://users.on.net/~ckolivas/interbench/)
If you are interested in examing the scheduling-behaviour as a function of the tuneables (and even the HZ) - and if you are interested in getting better latency, maybe you are interested in nitro-patch for the scheduler? (I currently don't have an external patch-file, but you can get it integrated from https://github.com/baerwolf/linux-stephan/commits/v3.2.9-stephan-20120303000... )
Nitro enables you to do so some things, like:
* tune the scheduler at configuration point * increase the ticker-frequency way above 1000Hz * tune the ticker-freq. from userspace during runtime * change the scheduling-algo for idle-tasks
regards Stephan
Hi Stephan,
On Mar 8, 2012, at 5:20 PM, Stephan Bärwolf wrote:
Hi Pantelis,
what is your primary goal?
Saving power? Improving interactivity? Be more cache-optimal?
My primary goal is improving the scheduler implementation for the new class of devices using asymmetric cores, like the big.LITTLE ARM devices.
The scheduler is really not capable at this moment to handle the case of an arbitrary mix of fast and slow cores which need to be shutdown when in use in milliseconds.
The target is keeping the device within the proper power operating envelope, while delivering acceptable performance for the interactive user. However I cannot go about wildly hacking the scheduler without impacting drastically the performance of other user cases.
I think recording the scheduling (without drastically slowing it down) is hard to do, since it ticks at least HZ times per second.
Replay such an recording (on a differeny platform) seems even harder to me. Since different cpus (and number of them), different speed and a very likely not reproduceable process-setup may noise completely the replay.
Do you also want to record/replay the behaviour of the more important loadbalancer, too?
Did you thought to compare different platforms by simply having syntetically generated loads? (I.e. see interbench -> http://users.on.net/~ckolivas/interbench/)
If you are interested in examing the scheduling-behaviour as a function of the tuneables (and even the HZ) - and if you are interested in getting better latency, maybe you are interested in nitro-patch for the scheduler? (I currently don't have an external patch-file, but you can get it integrated from https://github.com/baerwolf/linux-stephan/commits/v3.2.9-stephan-20120303000... )
Nitro enables you to do so some things, like:
* tune the scheduler at configuration point * increase the ticker-frequency way above 1000Hz * tune the ticker-freq. from userspace during runtime
- change the scheduling-algo for idle-tasks
regards Stephan
Scheduling frequency and tuning the scheduler is immaterial. I don't expect to record and playback the actions of a scheduling tick.
Let's take the simple example of two CPU tasks both running in single CPU system, without performing any I/O.
Assume a normal, non NOHZ kernel with T is the HZ tick.
A possible schedule would be something like this.
Time -> --------------------T-----------T------------T------ ... |---- A ----|-- B --|--- A ----------|-- B - ... (1) (2)
That is A run for time units, on the scheduling tick it is replaced by B on the CPU, which runs for 7 time units and is blocked at (2). A is scheduled again for 18 time units and so on.
I don't care to recreate the minutiae of the preemption of A by B at point (1) or vice versa at (2).
What I want is to record that during a given time period A had the CPU for 12 time units was preempted by B for 7, and then due to the blocking of (2) A was scheduled again.
These time units could be converted in some kind of MIPS rating, and I could state
A: Uses the CPU for 7 + 18 time units... B: Uses the CPU for 7 time units and then blocks for ...
This is intended to be a tool for these reasons:
a) For developers to collaborate in the creation of a scheduler that satisfies requirements for modern classes of hardware. b) For developers to have a way to pass around files that describe their own use-case models, and make sure that there are no serious regressions. c) For users to be able to report possible scheduler problems, by including the workload that caused the problem when reporting bugs.
Regards
-- Pantelis
On Thu, Mar 08, 2012 at 03:20:53PM +0200, Pantelis Antoniou wrote:
Hi there,
There's considerable activity in the subject of the scheduler lately and how to adapt it to the peculiarities of the new class of hardware coming out lately, like the big.LITTLE class of devices from a number of manufacturers.
The platforms that Linux runs are very diverse, and run differing workloads. For example most consumer devices will very likely run something like Android, with common use cases such as audio and/or video playback. Methods to achieve lower power consumption using a power aware scheduler are under investigation.
Similarly for server applications, or VM hosting, the behavior of the scheduler shouldn't have adverse performance implications; any power saving on top of that would be a welcome improvement.
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
The way that the two hosts respond under the same load generated by the playback application can be compared, so that the performance of the two scheduler implementations measured in various metrics (like performance, power consumption etc.) can be evaluated.
The fidelity of the this approximation is of great importance but it is debatable if it is possible to have a fully identical load generated, since details of the hosts might differ in such a way that such a thing is impossible. I believe that it should be possible at least to simulate a purely CPU load, and the blocking behavior of tasks, in such a way that it would result in scheduler decisions that can be compared and shared among developers.
The recording part I believe can be handled by the kernel's tracing infrastructure, either by using the existing tracepoints, or need be adding more; possibly even creating a new tracer solely for this purpose. Since some applications can adapt their behavior according to insufficient system resources (think media players that can drop frames for example), I think it would be beneficial to record such events to the same trace file.
The trace file should have a portable format so that it can be freely shared between developers. An ASCII format like we currently use should be OK, as long as it doesn't cause too much of an effect during execution of the recording.
The playback application can be implemented via two ways.
One way, which is the LinSched way would be to have the full scheduler implementation compiled as part of said application, and use application specific methods to evaluate performance. While this will work, it won't allow comparison of the two hosts in a meaningful manner.
For both scheduler and platform evaluation, the playback application will generate the load on the running host by simulating the source host's recorded work load session. That means emulating process activity like forks, thread spawning, blocking on resources etc. It is not clear to me yet if that's possible without using some kind of kernel level helper module, but not requiring such is desirable.
Since one would have the full trace of scheduling activity: past, present and future; there would be the possibility of generating a perfect schedule (as defined by best performance, or best power consumption), and use it as a yardstick of evaluation against the actual scheduler. Comparing the results, you would get an estimate of the best case improvement that could be achieved if the ideal scheduler existed.
I know this is a bit long, but I hope this can be a basis of thinking on how to go about developing this.
Regards
-- Pantelis
Hi,
May be you could have a look at the perf sched tool. It has a replay feature. I think it perform well basic replay but it can certainly be enhanced.
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Frederic,
On Mar 8, 2012, at 5:45 PM, Frederic Weisbecker wrote:
On Thu, Mar 08, 2012 at 03:20:53PM +0200, Pantelis Antoniou wrote:
Hi there,
[snip]
Hi,
May be you could have a look at the perf sched tool. It has a replay feature. I think it perform well basic replay but it can certainly be enhanced.
Yes I am aware of perf-sched, and I know that it can do basic record and replay of the wakeup, switch & fork events.
It looks like it can be the starting point of what we're trying to do, but I doubt this will cover all the cases.
That why I'm trying to have a discussion about this problem; to find out what is there, what is missing, and what is there to fix or develop anew.
So what do you think?
Regards
-- Pantelis
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Mar 8, 2012 at 6:50 PM, Pantelis Antoniou panto@antoniou-consulting.com wrote:
Hi there,
There's considerable activity in the subject of the scheduler lately and how to adapt it to the peculiarities of the new class of hardware coming out lately, like the big.LITTLE class of devices from a number of manufacturers.
The platforms that Linux runs are very diverse, and run differing workloads. For example most consumer devices will very likely run something like Android, with common use cases such as audio and/or video playback. Methods to achieve lower power consumption using a power aware scheduler are under investigation.
Similarly for server applications, or VM hosting, the behavior of the scheduler shouldn't have adverse performance implications; any power saving on top of that would be a welcome improvement.
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
I believe many people would have had this simple idea, but I don't know why, or if, it's bad. So I am going to ask.
Why not have much coarser, but deterministic, load patterns using user space apps (perhaps modified to log important characteristics of execution) ?
We could have, say, three sets of stress patterns one each for Server, Desktop and Mobile. Only deterministic would be top-level usage pattern (say by having some app-spawning script running from init, with least or none external influence)
Say the 'Mobile-profile' script could spawn multimedia playback, encoding/decoding, 3d game playback, storage access and suspend/resume cycles in some parallel and serial manner. Each task at the end tells how it was treated during its lifetime (total dropped frames, average latency, overall power consumed etc). From which we calculate a 'GPA'. For any modification in the scheduler, we could see how it affects the current score for each profile running on respective reference platforms.
Kind Regards Yadi
ps: I had to drop Amit Kucheria <amit.kucheria@li, otherwise my post wouldn't fire.
--------------------- Jo darr gaya, so marr gaya!
Hi Yadi,
On Mar 8, 2012, at 7:40 PM, Yadwinder Singh Brar wrote:
On Thu, Mar 8, 2012 at 6:50 PM, Pantelis Antoniou panto@antoniou-consulting.com wrote:
Hi there,
There's considerable activity in the subject of the scheduler lately and how to adapt it to the peculiarities of the new class of hardware coming out lately, like the big.LITTLE class of devices from a number of manufacturers.
The platforms that Linux runs are very diverse, and run differing workloads. For example most consumer devices will very likely run something like Android, with common use cases such as audio and/or video playback. Methods to achieve lower power consumption using a power aware scheduler are under investigation.
Similarly for server applications, or VM hosting, the behavior of the scheduler shouldn't have adverse performance implications; any power saving on top of that would be a welcome improvement.
The current issue is that scheduler development is not easily shared between developers. Each developer has their own 'itch', be it Android use cases, server workloads, VM, etc. The risk is high of optimizing for one's own use case and causing severe degradation on most other use cases.
One way to fix this problem would be the development of a method with which one could perform a given use-case workload in a host, record the activity in a interchangeable portable trace format file, and then play it back on another host via a playback application that will generate an approximately similar load which was observed during recording.
I believe many people would have had this simple idea, but I don't know why, or if, it's bad. So I am going to ask.
Why not have much coarser, but deterministic, load patterns using user space apps (perhaps modified to log important characteristics of execution) ?
We could have, say, three sets of stress patterns one each for Server, Desktop and Mobile. Only deterministic would be top-level usage pattern (say by having some app-spawning script running from init, with least or none external influence)
Say the 'Mobile-profile' script could spawn multimedia playback, encoding/decoding, 3d game playback, storage access and suspend/resume cycles in some parallel and serial manner. Each task at the end tells how it was treated during its lifetime (total dropped frames, average latency, overall power consumed etc). From which we calculate a 'GPA'. For any modification in the scheduler, we could see how it affects the current score for each profile running on respective reference platforms.
Kind Regards Yadi
ps: I had to drop Amit Kucheria <amit.kucheria@li, otherwise my post wouldn't fire.
The problem is defining that characteristic load pattern. Which is it? It might one set of things today, something different tomorrow. Not only the kernel is evolving, media application evolve too. In the end you will end up with some workloads that are treated as benchmarks, and manufacturers will start tweaking for them.
On top of that, the most common consumer linux platform is Android. I bet that most kernel developers do not run Android as their main platform; but they do care to test if their changes affect Android performance.
There is value however in recording these characteristic use-cases, and keeping them in a repository of traces, so when one hacks on the scheduler can compare results.
Regards
-- Pantelis
Jo darr gaya, so marr gaya!
linaro-kernel@lists.linaro.org