On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote:
Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
The function of select_idle_sibling() is to place the woken up task in the vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says. This placement being only in an idle group.If an idle group is not found,the fallback cpu is either the waking cpu or the previous cpu accordingly.
This results in the runqueue of the waking cpu or the previous cpu getting overloaded when the system is committed,which is a latency hit to these tasks.
What is required is that the newly woken up tasks be placed close to the wake up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache coldness respectively.This is achieved with wake_affine() deciding which cache domain the task should be placed on.
Once this is decided,instead of searching for a completely idle group,let us search for the idlest group.This will anyway return a completely idle group if it exists and its mechanism will fall back to what select_idle_sibling() was doing.But if this fails,find_idlest_group() continues the search for a relatively more idle group.
The argument could be that,we wish to avoid migration of the newly woken up task to any other group unless it is completely idle.But in this case, to begin with we choose a sched domain,within which a migration could be less harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate with the same.
What if..
Fast movers currently suffer from traversing large package, mostly due to traversal order walking 1:1 buddies hand in hand across the whole package endlessly. With only one buddy pair running, it's horrific.
Even if you change the order to be friendlier, perturbation induces bouncing. More spots to bounce too equals more bouncing. Ergo, I cross coupled cpu pairs to eliminate that. If buddies are perturbed, having one and only one buddy cpu pulls them back together, so can't induce a bounce fest, only correct. That worked well, but had the down side that some loads really REALLY want maximum spread, so suffer when you remove migration options as I did. There's in_interrupt() consideration I'm not so sure of too, in that case, going the extra mile to find an idle hole to plug _may_ be worth some extra cost too.. dunno.
So wrt integration, what if a buddy cpu were made a FIRST choice of generic wake balancing vs the ONLY choice of select_idle_sibling() as I did? If buddy cpu is available, cool, perturbed pairs find each other and pair back up, if not, and you were here too recently, you stay with prev_cpu, avoid bounce and traversal at high frequency. All tasks can try the cheap buddy cpu first, all can try full domain as well, just not at insane rates. The heavier the short term load average (or such, with instant decay on longish idle ala idle balance throttle so you ramp well), the longer the 'forget eating full balance' interval becomes, with cutoff affecting the cheap but also not free cross coupled buddy cpu as well at some point. Looking for an idle cpu at hefty load is a waste of cycles at best, plugging micro-holes does nothing good even if you find one, forget wake balance entirely at some cutoff, let periodic balancing do it's thing in peace.
Hrmph, that's too many words, but basically, I think whacking select_idle_sibling() integration into wake balance makes loads of sense, but needs a bit more to not end up just moving the problems to a different spot.
I still have a 2.6-rt problem I need to find time to squabble with, but maybe I'll soonish see if what you did plus what I did combined works out on that 4x10 core box where current is _so_ unbelievably horrible. Heck, it can't get any worse, and the restricted wake balance alone kinda sorta worked.
-Mike