On Mon, 2013-01-07 at 10:59 +0530, Preeti U Murthy wrote:
Hi Mike, Thank you very much for your inputs.Just a few thoughts so that we are clear with the problems so far in the scheduler scalability and in what direction we ought to move to correct them.
- During fork or exec,the scheduler goes through find_idlest_group()
and find_idlest_cpu() in select_task_rq_fair() by iterating through all domains.Why then was a similar approach not followed for wake up balancing? What was so different about wake ups (except that the woken up task had to remain close to the prev/waking cpu) that we had to introduce select_idle_sibling() in the first place?
Cost. select_idle_sibling() was introduced specifically to recoup overlap via placing wakee in a shared L2 where there is no pain really, it's all gain. L3 ain't the same deal, but it still does a good job, even for fast movers (on Intel) if you're not bouncing them all over the place. Breakeven depends on architecture, but critical for fast movers on all is no bounce allowed. Wakeup is a whole different thing than fork/exec balancing, there, you just want to spread to seed your future, and it doesn't matter much where you send wakees. With wakeup of fast movers, you lose on every migration, so buddies really really need to find each other fast, and stay put, so you really do recoup overlap instead of just paying again and again. My old Q6600 kicks the snot out of that 10 core Westmere CPU in a common desktop box cross core scheduling scenario. That's sick.
2.To the best of my knowlege,the concept of buddy cpu was introduced in select_idle_sibling() so as to avoid the entire package traversal and restrict it to the buddy cpus alone.But even during fork or exec,we iterate through all the sched domains,like I have mentioned above.Why did not the buddy cpu solution come to the rescue here as well?
I'm looking (well, pondering) fork/exec. It may be a common case win too, but I kinda doubt it'll matter. It's the 1:1 waker/wakee relation that suffers mightily. The buddy was introduced specifically to bind 1:1 sw buddies (common) in hw as well. They are one, make it so. It was the simplest solution I could think of to both kill the evil, and cut cost to restore select_idle_sibling() to the optimization it was intended to be vs the pessimization it has turned into on large L3 CPUs. I took it back to it's functional roots. Weld two modern cores together, create core2duo on steroids, optimization works again and is cheap again.
That the current behavior has its upsides too is.. most unfortunate :)
3.So the correct problem stands at avoid iterating through the entire package at the cost of less aggression in finding the idle cpu or iterate through the package with an intention of finding the idlest cpu.To the best of my understanding the former is your approach or commit 37407ea7,the latter is what I tried to do.But as you have rightly pointed out my approach will have scaling issues.In this light,how does your best_combined patch(below) look like?
It doesn't exist anymore, morphed into ever less wonderful as the day went on. Integration needs to fiddle with find_* to cut some wasted cycles, and methinks reordered groups _may_ make it usable with no idle-buddy hint. Using the idle-buddy first looked promising, but bottom line is that at higher load, you use it for a modest fee, buddy was busy, so you then do the full scan paying a bigger fee on top of it, which of course crops the high end. Reorder should kill the low load horror, but you've still got the problem that as you approach saturation, the less value any search has. That, and the logic is not necessarily as light as it must be when targeting wakeups, where every scheduler cycle detracts substantially from fast/light task performance.
Do you introduce a cut off value on the loads to decide on which approach to take?
I fiddled with that, but then whacked it, because the result needs to be nice to 1:1 and 1:N which may need to spread at high frequency, so the cutoff would be harmful. Seems likely either a cutoff or 1:1 vs 1:N detection or _something_ will be needed to cut cost when it's flat not affordable. With 1:N pgbench (and ilk), there's no such thing as too expensive a search, but for fast/light network stuff there is.
-Mike