-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 10/03/2014 03:50 AM, Peter Zijlstra wrote:
On Fri, Oct 03, 2014 at 08:23:04AM +0200, Mike Galbraith wrote:
On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote:
Subject: sched,idle: teach select_idle_sibling about idle states
Change select_idle_sibling to take cpu idle exit latency into account. First preference is to select the cpu with the lowest exit latency from a completely idle sched_group inside the CPU; if that is not available, we pick the CPU with the lowest exit latency in any sched_group.
This increases the total search time of select_idle_sibling, we may want to look into propagating load info up the sched_group tree in some way. That information would also be useful to prevent the wake_affine logic from causing a load imbalance between sched_groups.
A generic boo hiss aimed in the general direction of all of this let's go look at every possibility on every wakeup stuff. Less is more.
I hear you, can you see actual slowdown with the patch? While the worst case doesn't change, it does make the average case equal to the worst case iteration -- where we previously would average out at inspecting half the CPUs before finding an idle one, we'd now always inspect all of them in order to compare all idle ones on their properties.
Also, with the latest generation of Haswell Xeons having 18 cores (36 threads) this is one massively painful loop for sure.
We have 3 different goals when selecting a runqueue for a task: 1) locality: get the task running close to where it has stuff cached 2) work preserving: get the task running ASAP, and preferably on a fully idle core 3) idle state latency: place the task on a CPU that can start running it ASAP
We may also consider the interplay of the above 3 to have an impact on 4) power use: pack tasks on some CPUs so other CPUs can go into deeper idle states
The current implementation is a "compromise" between (1) and (2), with a strong preference for (2), falling back to (1) if no fully idle core is found.
My ugly hack isn't any better, trading off (1) in order to be better at (2) and (3). Whether it even affects (4) remains to be seen.
I know my patch is probably unacceptable, but I do think it is important that we talk about the problem, and hopefully agree on exactly what the problem is that we want to solve.
One big question in my mind is, when is locality more important, and when is work preserving more important? Do we have an answer to that question?
The current code has the potential to be quite painful on systems with a large number of cores per chip, so we will have to change things anyway...
- -- All rights reversed