Hi Mike, Thank you very much for your inputs.Just a few thoughts so that we are clear with the problems so far in the scheduler scalability and in what direction we ought to move to correct them.
1. During fork or exec,the scheduler goes through find_idlest_group() and find_idlest_cpu() in select_task_rq_fair() by iterating through all domains.Why then was a similar approach not followed for wake up balancing? What was so different about wake ups (except that the woken up task had to remain close to the prev/waking cpu) that we had to introduce select_idle_sibling() in the first place?
2.To the best of my knowlege,the concept of buddy cpu was introduced in select_idle_sibling() so as to avoid the entire package traversal and restrict it to the buddy cpus alone.But even during fork or exec,we iterate through all the sched domains,like I have mentioned above.Why did not the buddy cpu solution come to the rescue here as well?
3.So the correct problem stands at avoid iterating through the entire package at the cost of less aggression in finding the idle cpu or iterate through the package with an intention of finding the idlest cpu.To the best of my understanding the former is your approach or commit 37407ea7,the latter is what I tried to do.But as you have rightly pointed out my approach will have scaling issues.In this light,how does your best_combined patch(below) look like? Do you introduce a cut off value on the loads to decide on which approach to take?
Meanwhile I will also try to run tbench and a few other benchmarks to find out why the results are like below.Will update you very soon on this.
Thank you
Regards Preeti U Murthy
On 01/06/2013 10:02 PM, Mike Galbraith wrote:
On Sat, 2013-01-05 at 09:13 +0100, Mike Galbraith wrote:
I still have a 2.6-rt problem I need to find time to squabble with, but maybe I'll soonish see if what you did plus what I did combined works out on that 4x10 core box where current is _so_ unbelievably horrible. Heck, it can't get any worse, and the restricted wake balance alone kinda sorta worked.
Actually, I flunked copy/paste 101. Below (preeti) shows the real deal.
tbench, 3 runs, 30 secs/run revert = 37407ea7 reverted clients 1 5 10 20 40 80 3.6.0.virgin 27.83 139.50 1488.76 4172.93 6983.71 8301.73 29.23 139.98 1500.22 4162.92 6907.16 8231.13 30.00 141.43 1500.09 3975.50 6847.24 7983.98
3.6.0+revert 281.08 1404.76 2802.44 5019.49 7080.97 8592.80 282.38 1375.70 2747.23 4823.95 7052.15 8508.45 270.69 1375.53 2736.29 5243.05 7058.75 8806.72
3.6.0+preeti 26.43 126.62 1027.23 3350.06 7004.22 7561.83 26.67 128.66 922.57 3341.73 7045.05 7662.18 25.54 129.20 1015.02 3337.60 6591.32 7634.33
3.6.0+best_combined 280.48 1382.07 2730.27 4786.20 6477.28 7980.07 276.88 1392.50 2708.23 4741.25 6590.99 7992.11 278.92 1368.55 2735.49 4614.99 6573.38 7921.75
3.0.51-0.7.9-default 286.44 1415.37 2794.41 5284.39 7282.57 13670.80
Something is either wrong with 3.6 itself, or the config I'm using, as max throughput is nowhere near where it should be (see default). On the bright side, integrating the two does show some promise.
-Mike