Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

7 Jan 2013


      On Mon, 2013-01-07 at 10:59 +0530, Preeti U Murthy wrote:
...
Hi Mike,
Thank you very much for your inputs.Just a few thoughts so that we are
clear with the problems so far in the scheduler scalability and in what
direction we ought to move to correct them.

During fork or exec,the scheduler goes through find_idlest_group()

and find_idlest_cpu() in select_task_rq_fair() by iterating through all
domains.Why then was a similar approach not followed for wake up
balancing? What was so different about wake ups (except that the woken
up task had to remain close to the prev/waking cpu) that we had to
introduce select_idle_sibling() in the first place?
Cost.  select_idle_sibling() was introduced specifically to recoup
overlap via placing wakee in a shared L2 where there is no pain really,
it's all gain.  L3 ain't the same deal, but it still does a good job,
even for fast movers (on Intel) if you're not bouncing them all over the
place.  Breakeven depends on architecture, but critical for fast movers
on all is no bounce allowed.  Wakeup is a whole different thing than
fork/exec balancing, there, you just want to spread to seed your future,
and it doesn't matter much where you send wakees.  With wakeup of fast
movers, you lose on every migration, so buddies really really need to
find each other fast, and stay put, so you really do recoup overlap
instead of just paying again and again.  My old Q6600 kicks the snot out
of that 10 core Westmere CPU in a common desktop box cross core
scheduling scenario.  That's sick.
...
2.To the best of my knowlege,the concept of buddy cpu was introduced in
select_idle_sibling() so as to avoid the entire package traversal and
restrict it to the buddy cpus alone.But even during fork or exec,we
iterate through all the sched domains,like I have mentioned above.Why
did not the buddy cpu solution come to the rescue here as well?
I'm looking (well, pondering) fork/exec.  It may be a common case win
too, but I kinda doubt it'll matter.  It's the 1:1 waker/wakee relation
that suffers mightily.  The buddy was introduced specifically to bind
1:1 sw buddies (common) in hw as well.  They are one, make it so.  It
was the simplest solution I could think of to both kill the evil, and
cut cost to restore select_idle_sibling() to the optimization it was
intended to be vs the pessimization it has turned into on large L3 CPUs.
I took it back to it's functional roots.  Weld two modern cores
together, create core2duo on steroids, optimization works again and is
cheap again.
That the current behavior has its upsides too is.. most unfortunate :)
...
3.So the correct problem stands at avoid iterating through the entire
package at the cost of less aggression in finding the idle cpu or
iterate through the package with an intention of finding the idlest
cpu.To the best of my understanding the former is your approach or
commit 37407ea7,the latter is what I tried to do.But as you have rightly
pointed out my approach will have scaling issues.In this light,how does
your best_combined patch(below) look like?
It doesn't exist anymore, morphed into ever less wonderful as the day
went on.  Integration needs to fiddle with find_* to cut some wasted
cycles, and methinks reordered groups _may_ make it usable with no
idle-buddy hint.  Using the idle-buddy first looked promising, but
bottom line is that at higher load, you use it for a modest fee, buddy
was busy, so you then do the full scan paying a bigger fee on top of it,
which of course crops the high end.  Reorder should kill the low load
horror, but you've still got the problem that as you approach
saturation, the less value any search has.  That, and the logic is not
necessarily as light as it must be when targeting wakeups, where every
scheduler cycle detracts substantially from fast/light task performance.
...
Do you introduce a cut off value on the loads to decide on which
approach to take?
I fiddled with that, but then whacked it, because the result needs to be
nice to 1:1 and 1:N which may need to spread at high frequency, so the
cutoff would be harmful.  Seems likely either a cutoff or 1:1 vs 1:N
detection or _something_ will be needed to cut cost when it's flat not
affordable.  With 1:N pgbench (and ilk), there's no such thing as too
expensive a search, but for fast/light network stuff there is.
-Mike

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer