- eas-dev - lists.linaro.org

Question about change in EAS 1.4 (from ACK 4.4)

by Zachariah Kennedy

Works better with a subject! ;) Hey guys, This is a question for Brendan Jackman but feel free to chime in if you know the answer. I am having an issue when pulling in the new EAS 1.4 changes from ACK4.4. Mainly, I am getting a warning from: https://android.googlesource.com/kernel/common.git/+/a21299785a502ca4b3592a… You can see the warning below: c0 1865 [20171029_10:29:47.834626]@0 PC is at build_sched_domains+0xc00/0xcc8 c0 1865 [20171029_10:29:47.834632]@0 LR is at build_sched_domains+0xc00/0xcc8 c0 1865 [20171029_10:29:47.834637]@0 pc : [<ffffff84000d2758>] lr : [<ffffff84000d2758>] pstate: 60000145 c0 1865 [20171029_10:29:47.834641]@0 sp : ffffffcac19b3800 c0 1865 [20171029_10:29:47.834645]@0 x29: ffffffcac19b3800 x28: ffffff8401df7ee4 c0 1865 [20171029_10:29:47.834652]@0 x27: ffffffcae6626480 x26: ffffff8401e08858 c0 1865 [20171029_10:29:47.834658]@0 x25: ffffffcaf35fc780 x24: ffffff8400f77238 c0 1865 [20171029_10:29:47.834665]@0 x23: ffffff8401df85a0 x22: ffffff8401777400 c0 1865 [20171029_10:29:47.834672]@0 x21: 0000000000000008 x20: ffffff8401777400 c0 1865 [20171029_10:29:47.834678]@0 x19: ffffff8401df7ee4 x18: 00000000ffffffe8 c0 1865 [20171029_10:29:47.834684]@0 x17: 0000000000000000 x16: 0000000000000000 c0 1865 [20171029_10:29:47.834691]@0 x15: ffffff8401e16850 x14: 6465686373206572 c0 1865 [20171029_10:29:47.834697]@0 x13: 6177612079677265 x12: 6e6520726f662061 c0 1865 [20171029_10:29:47.834703]@0 x11: 74616420676e6973 x10: 73694d2030405d35 c0 1865 [20171029_10:29:47.834709]@0 x9 : 37353433382e3734 x8 : ffffffcaf46402ab c0 1865 [20171029_10:29:47.834715]@0 x7 : 0000000000000000 x6 : 000002257b061a96 c0 1865 [20171029_10:29:47.834721]@0 x5 : 00ffffffffffffff x4 : 0000000000000000 c0 1865 [20171029_10:29:47.834727]@0 x3 : 0000000000000140 x2 : a2032cf00b50bf18 c0 1865 [20171029_10:29:47.834734]@0 x1 : 0000000000000000 x0 : 0000000000000045 c0 1865 [20171029_10:29:47.834740]@0 c0 1865 PC: 0xffffff84000d2718: c0 1865 [20171029_10:29:47.834744]@0 2718 9120bc21 39402424 35ffec84 d4210000 52800024 39002424 17ffff60 d503201f c0 1865 [20171029_10:29:47.834756]@0 2738 9400e4a6 d503201f 97ffdcfc 72001c1f 54ffe501 b0009ac0 911a8000 9402a9cb c0 1865 [20171029_10:29:47.834767]@0 2758 d4210000 17ffff23 aa1403e0 9403d705 12800160 f9006fbf b90067a0 17ffff20 c0 1865 [20171029_10:29:47.834778]@0 2778 97ff3ab2 b9401005 b9401321 6b0100bf 54fffa81 34fff6a5 f9400c02 f9400f21 c0 1865 [20171029_10:29:47.834790]@0 c0 1865 LR: 0xffffff84000d2718: c0 1865 [20171029_10:29:47.834794]@0 2718 9120bc21 39402424 35ffec84 d4210000 52800024 39002424 17ffff60 d503201f c0 1865 [20171029_10:29:47.834806]@0 2738 9400e4a6 d503201f 97ffdcfc 72001c1f 54ffe501 b0009ac0 911a8000 9402a9cb c0 1865 [20171029_10:29:47.834817]@0 2758 d4210000 17ffff23 aa1403e0 9403d705 12800160 f9006fbf b90067a0 17ffff20 c0 1865 [20171029_10:29:47.834828]@0 2778 97ff3ab2 b9401005 b9401321 6b0100bf 54fffa81 34fff6a5 f9400c02 f9400f21 c0 1865 [20171029_10:29:47.834840]@0 c0 1865 SP: 0xffffffcac19b37c0: c0 1865 [20171029_10:29:47.834844]@0 37c0 000d2758 ffffff84 c19b3800 ffffffca 000d2758 ffffff84 60000145 00000000 c0 1865 [20171029_10:29:47.834855]@0 37e0 00000008 00000000 000000ff 00000000 00000000 00000080 f3405d50 ffffffca c0 1865 [20171029_10:29:47.834867]@0 3800 c19b38f0 ffffffca 000d2bbc ffffff84 00000000 00000000 01feacd0 ffffff84 c0 1865 [20171029_10:29:47.834878]@0 3820 01feab00 ffffff84 00000000 00000000 01feab00 ffffff84 00000004 00000000 c0 1865 [20171029_10:29:47.834890]@0 c0 1865 [20171029_10:29:47.834894]@0 ---[ end trace f7934377fe8659bc ]--- c0 1865 [20171029_10:29:47.834899]@0 Call trace: c0 1865 [20171029_10:29:47.834904]@0 Exception stack(0xffffffcac19b3610 to 0xffffffcac19b3740) c0 1865 [20171029_10:29:47.834910]@0 3600: ffffff8401df7ee4 0000008000000000 c0 1865 [20171029_10:29:47.834917]@0 3620: ffffffcac19b3800 ffffff84000d2758 0000000060000145 ffffff8401777400 c0 1865 [20171029_10:29:47.834923]@0 3640: ffffff8401df85a0 ffffff8400f77238 ffffffcaf35fc780 ffffff8401e08858 c0 1865 [20171029_10:29:47.834930]@0 3660: ffffffcae6626480 ffffff8401df7ee4 ffffffcac19b36c0 ffffff8401fecb90 c0 1865 [20171029_10:29:47.834937]@0 3680: 0000000000000000 00004d1712d78a33 ffffff8401fed000 00000000fcbeb400 c0 1865 [20171029_10:29:47.834943]@0 36a0: ffffff8401fed550 0000000000000140 ffffffcac19b3800 ffffffcac19b3800 c0 1865 [20171029_10:29:47.834950]@0 36c0: ffffffcac19b37c0 a2032cf00b50bf18 0000000000000045 0000000000000000 c0 1865 [20171029_10:29:47.834957]@0 36e0: a2032cf00b50bf18 0000000000000140 0000000000000000 00ffffffffffffff c0 1865 [20171029_10:29:47.834964]@0 3700: 000002257b061a96 0000000000000000 ffffffcaf46402ab 37353433382e3734 c0 1865 [20171029_10:29:47.834970]@0 3720: 73694d2030405d35 74616420676e6973 6e6520726f662061 6177612079677265 c0 1865 [20171029_10:29:47.834977]@0 [<ffffff84000d2758>] build_sched_domains+0xc00/0xcc8 c0 1865 [20171029_10:29:47.834983]@0 [<ffffff84000d2bbc>] partition_sched_domains+0x35c/0x410 c0 1865 [20171029_10:29:47.834990]@0 [<ffffff84000d2cb0>] cpuset_cpu_active+0x40/0x78 c0 1865 [20171029_10:29:47.834997]@0 [<ffffff84000c0a80>] notifier_call_chain+0x50/0x90 c0 1865 [20171029_10:29:47.835005]@0 [<ffffff84000c0be4>] __raw_notifier_call_chain+0xc/0x18 c0 1865 [20171029_10:29:47.835013]@0 [<ffffff84000a16e8>] cpu_notify+0x28/0x48 c0 1865 [20171029_10:29:47.835019]@0 [<ffffff84000a200c>] _cpu_up+0x23c/0x250 c0 1865 [20171029_10:29:47.835026]@0 [<ffffff84000a25cc>] enable_nonboot_cpus+0xc4/0x258 c0 1865 [20171029_10:29:47.835032]@0 [<ffffff84000fcb84>] suspend_enter+0x304/0x5f8 c0 1865 [20171029_10:29:47.835038]@0 [<ffffff84000fcf4c>] suspend_devices_and_enter+0xd4/0x310 c0 1865 [20171029_10:29:47.835045]@0 [<ffffff84000fd630>] pm_suspend+0x4a8/0x640 c0 1865 [20171029_10:29:47.835051]@0 [<ffffff84000fba84>] state_store+0x94/0xa8 c0 1865 [20171029_10:29:47.835058]@0 [<ffffff84003b642c>] kobj_attr_store+0x14/0x28 c0 1865 [20171029_10:29:47.835066]@0 [<ffffff8400243178>] sysfs_kf_write+0x48/0x58 c0 1865 [20171029_10:29:47.835073]@0 [<ffffff840024258c>] kernfs_fop_write+0xbc/0x190 c0 1865 [20171029_10:29:47.835080]@0 [<ffffff84001d2bb4>] __vfs_write+0x34/0xf8 c0 1865 [20171029_10:29:47.835086]@0 [<ffffff84001d34cc>] vfs_write+0x8c/0x178 c0 1865 [20171029_10:29:47.835093]@0 [<ffffff84001d3f64>] SyS_write+0x5c/0xc8 c0 1865 [20171029_10:29:47.835100]@0 [<ffffff8400084630>] el0_svc_naked+0x24/0x28 c0 1865 [20171029_10:29:47.836943]@0 Missing data for energy aware scheduling c0 1865 [20171029_10:29:47.836950]@0 ------------[ cut here ]------------ The error only occurs during suspend or if I manually set a core(s) to offline. This just floods the log during suspend. Is this the expected behavior? Kind Regards, Zachariah Kennedy

8 years, 6 months

3
6
0 0

(no subject)

by Zachariah Kennedy

Hey guys, This is a question for Brendan Jackman but feel free to chime in. I am having an issue when pulling in the new EAS 1.4 changes from ACK4.4. Mainly, I am getting a warning from: https://android.googlesource.com/kernel/common.git/+/a21299785a502ca4b3592a… You can see the warning below: c0 1865 [20171029_10:29:47.834626]@0 PC is at build_sched_domains+0xc00/0xcc8 c0 1865 [20171029_10:29:47.834632]@0 LR is at build_sched_domains+0xc00/0xcc8 c0 1865 [20171029_10:29:47.834637]@0 pc : [<ffffff84000d2758>] lr : [<ffffff84000d2758>] pstate: 60000145 c0 1865 [20171029_10:29:47.834641]@0 sp : ffffffcac19b3800 c0 1865 [20171029_10:29:47.834645]@0 x29: ffffffcac19b3800 x28: ffffff8401df7ee4 c0 1865 [20171029_10:29:47.834652]@0 x27: ffffffcae6626480 x26: ffffff8401e08858 c0 1865 [20171029_10:29:47.834658]@0 x25: ffffffcaf35fc780 x24: ffffff8400f77238 c0 1865 [20171029_10:29:47.834665]@0 x23: ffffff8401df85a0 x22: ffffff8401777400 c0 1865 [20171029_10:29:47.834672]@0 x21: 0000000000000008 x20: ffffff8401777400 c0 1865 [20171029_10:29:47.834678]@0 x19: ffffff8401df7ee4 x18: 00000000ffffffe8 c0 1865 [20171029_10:29:47.834684]@0 x17: 0000000000000000 x16: 0000000000000000 c0 1865 [20171029_10:29:47.834691]@0 x15: ffffff8401e16850 x14: 6465686373206572 c0 1865 [20171029_10:29:47.834697]@0 x13: 6177612079677265 x12: 6e6520726f662061 c0 1865 [20171029_10:29:47.834703]@0 x11: 74616420676e6973 x10: 73694d2030405d35 c0 1865 [20171029_10:29:47.834709]@0 x9 : 37353433382e3734 x8 : ffffffcaf46402ab c0 1865 [20171029_10:29:47.834715]@0 x7 : 0000000000000000 x6 : 000002257b061a96 c0 1865 [20171029_10:29:47.834721]@0 x5 : 00ffffffffffffff x4 : 0000000000000000 c0 1865 [20171029_10:29:47.834727]@0 x3 : 0000000000000140 x2 : a2032cf00b50bf18 c0 1865 [20171029_10:29:47.834734]@0 x1 : 0000000000000000 x0 : 0000000000000045 c0 1865 [20171029_10:29:47.834740]@0 c0 1865 PC: 0xffffff84000d2718: c0 1865 [20171029_10:29:47.834744]@0 2718 9120bc21 39402424 35ffec84 d4210000 52800024 39002424 17ffff60 d503201f c0 1865 [20171029_10:29:47.834756]@0 2738 9400e4a6 d503201f 97ffdcfc 72001c1f 54ffe501 b0009ac0 911a8000 9402a9cb c0 1865 [20171029_10:29:47.834767]@0 2758 d4210000 17ffff23 aa1403e0 9403d705 12800160 f9006fbf b90067a0 17ffff20 c0 1865 [20171029_10:29:47.834778]@0 2778 97ff3ab2 b9401005 b9401321 6b0100bf 54fffa81 34fff6a5 f9400c02 f9400f21 c0 1865 [20171029_10:29:47.834790]@0 c0 1865 LR: 0xffffff84000d2718: c0 1865 [20171029_10:29:47.834794]@0 2718 9120bc21 39402424 35ffec84 d4210000 52800024 39002424 17ffff60 d503201f c0 1865 [20171029_10:29:47.834806]@0 2738 9400e4a6 d503201f 97ffdcfc 72001c1f 54ffe501 b0009ac0 911a8000 9402a9cb c0 1865 [20171029_10:29:47.834817]@0 2758 d4210000 17ffff23 aa1403e0 9403d705 12800160 f9006fbf b90067a0 17ffff20 c0 1865 [20171029_10:29:47.834828]@0 2778 97ff3ab2 b9401005 b9401321 6b0100bf 54fffa81 34fff6a5 f9400c02 f9400f21 c0 1865 [20171029_10:29:47.834840]@0 c0 1865 SP: 0xffffffcac19b37c0: c0 1865 [20171029_10:29:47.834844]@0 37c0 000d2758 ffffff84 c19b3800 ffffffca 000d2758 ffffff84 60000145 00000000 c0 1865 [20171029_10:29:47.834855]@0 37e0 00000008 00000000 000000ff 00000000 00000000 00000080 f3405d50 ffffffca c0 1865 [20171029_10:29:47.834867]@0 3800 c19b38f0 ffffffca 000d2bbc ffffff84 00000000 00000000 01feacd0 ffffff84 c0 1865 [20171029_10:29:47.834878]@0 3820 01feab00 ffffff84 00000000 00000000 01feab00 ffffff84 00000004 00000000 c0 1865 [20171029_10:29:47.834890]@0 c0 1865 [20171029_10:29:47.834894]@0 ---[ end trace f7934377fe8659bc ]--- c0 1865 [20171029_10:29:47.834899]@0 Call trace: c0 1865 [20171029_10:29:47.834904]@0 Exception stack(0xffffffcac19b3610 to 0xffffffcac19b3740) c0 1865 [20171029_10:29:47.834910]@0 3600: ffffff8401df7ee4 0000008000000000 c0 1865 [20171029_10:29:47.834917]@0 3620: ffffffcac19b3800 ffffff84000d2758 0000000060000145 ffffff8401777400 c0 1865 [20171029_10:29:47.834923]@0 3640: ffffff8401df85a0 ffffff8400f77238 ffffffcaf35fc780 ffffff8401e08858 c0 1865 [20171029_10:29:47.834930]@0 3660: ffffffcae6626480 ffffff8401df7ee4 ffffffcac19b36c0 ffffff8401fecb90 c0 1865 [20171029_10:29:47.834937]@0 3680: 0000000000000000 00004d1712d78a33 ffffff8401fed000 00000000fcbeb400 c0 1865 [20171029_10:29:47.834943]@0 36a0: ffffff8401fed550 0000000000000140 ffffffcac19b3800 ffffffcac19b3800 c0 1865 [20171029_10:29:47.834950]@0 36c0: ffffffcac19b37c0 a2032cf00b50bf18 0000000000000045 0000000000000000 c0 1865 [20171029_10:29:47.834957]@0 36e0: a2032cf00b50bf18 0000000000000140 0000000000000000 00ffffffffffffff c0 1865 [20171029_10:29:47.834964]@0 3700: 000002257b061a96 0000000000000000 ffffffcaf46402ab 37353433382e3734 c0 1865 [20171029_10:29:47.834970]@0 3720: 73694d2030405d35 74616420676e6973 6e6520726f662061 6177612079677265 c0 1865 [20171029_10:29:47.834977]@0 [<ffffff84000d2758>] build_sched_domains+0xc00/0xcc8 c0 1865 [20171029_10:29:47.834983]@0 [<ffffff84000d2bbc>] partition_sched_domains+0x35c/0x410 c0 1865 [20171029_10:29:47.834990]@0 [<ffffff84000d2cb0>] cpuset_cpu_active+0x40/0x78 c0 1865 [20171029_10:29:47.834997]@0 [<ffffff84000c0a80>] notifier_call_chain+0x50/0x90 c0 1865 [20171029_10:29:47.835005]@0 [<ffffff84000c0be4>] __raw_notifier_call_chain+0xc/0x18 c0 1865 [20171029_10:29:47.835013]@0 [<ffffff84000a16e8>] cpu_notify+0x28/0x48 c0 1865 [20171029_10:29:47.835019]@0 [<ffffff84000a200c>] _cpu_up+0x23c/0x250 c0 1865 [20171029_10:29:47.835026]@0 [<ffffff84000a25cc>] enable_nonboot_cpus+0xc4/0x258 c0 1865 [20171029_10:29:47.835032]@0 [<ffffff84000fcb84>] suspend_enter+0x304/0x5f8 c0 1865 [20171029_10:29:47.835038]@0 [<ffffff84000fcf4c>] suspend_devices_and_enter+0xd4/0x310 c0 1865 [20171029_10:29:47.835045]@0 [<ffffff84000fd630>] pm_suspend+0x4a8/0x640 c0 1865 [20171029_10:29:47.835051]@0 [<ffffff84000fba84>] state_store+0x94/0xa8 c0 1865 [20171029_10:29:47.835058]@0 [<ffffff84003b642c>] kobj_attr_store+0x14/0x28 c0 1865 [20171029_10:29:47.835066]@0 [<ffffff8400243178>] sysfs_kf_write+0x48/0x58 c0 1865 [20171029_10:29:47.835073]@0 [<ffffff840024258c>] kernfs_fop_write+0xbc/0x190 c0 1865 [20171029_10:29:47.835080]@0 [<ffffff84001d2bb4>] __vfs_write+0x34/0xf8 c0 1865 [20171029_10:29:47.835086]@0 [<ffffff84001d34cc>] vfs_write+0x8c/0x178 c0 1865 [20171029_10:29:47.835093]@0 [<ffffff84001d3f64>] SyS_write+0x5c/0xc8 c0 1865 [20171029_10:29:47.835100]@0 [<ffffff8400084630>] el0_svc_naked+0x24/0x28 c0 1865 [20171029_10:29:47.836943]@0 Missing data for energy aware scheduling c0 1865 [20171029_10:29:47.836950]@0 ------------[ cut here ]------------ The error only occurs during suspend or if I manually set a core(s) to offline. This just floods the log during suspend. Is this the expected behavior? Kind Regards, Zachariah Kennedy

8 years, 6 months

1
0
0 0

[RFC eas-dev] sched: Consider RT/IRQ pressure in capacity_spare_wake

by Joel Fernandes

capacity_spare_wake in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up. Several tests with results are included below to show improvements with this change. 1) Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core) ------------------------------------------------------------ Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms. Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+ 2) Rohit ran barrier.c test (details below) with following improvements: ------------------------------------------------------------------------ This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees. barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ |Threads | Without patch | With patch | | | | | +--------+--------+---------+-----------------+---------+ | | Mean | Std Dev | Mean | Std Dev | +--------+--------+---------+-----------------+---------+ |1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 | |2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 | |4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 | |8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 | |16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 | |32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 | |64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 | +--------+--------+---------+-----------------+---------+ 3) ping+hackbench test on bare-metal sever (Rohit ran this test) ---------------------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better). +--------------+-----------------+--------------------------+ |Task Groups | Without patch | With patch | | +-------+---------+----------------+---------+ |(Groups of 40)| Mean | Std Dev | Mean | Std Dev | +--------------+-------+---------+----------------+---------+ |1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 | |2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 | |4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 | |8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 | |16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 | |25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 | +--------------+-------+---------+----------------+---------+ [1] https://patchwork.kernel.org/patch/9991635/ Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Cc: Vincent Guittot <vincent.guittot(a)linaro.org> Cc: Morten Ramussen <morten.rasmussen(a)arm.com> Cc: Brendan Jackman <brendan.jackman(a)arm.com> Cc: Matt Fleming <matt(a)codeblueprint.co.uk> Tested-by: Rohit Jain <rohit.k.jain(a)oracle.com> Signed-off-by: Joel Fernandes <joelaf(a)google.com> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 740602ce799f..487e485b3560 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5742,7 +5742,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p); static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) { - return capacity_orig_of(cpu) - cpu_util_wake(cpu, p); + return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0); } /* -- 2.15.0.rc2.357.g7e34df9404-goog

8 years, 6 months

1
0
0 0

Improvement by replacing capacity_orig_of with capacity_of in wakeup

by Joel Fernandes

Hi, I tried an experiment this weekend - basically I have RT threads bound to big CPUs running a fixed period load, with hack bench running with all CPUs allowed. The system is a Pixel2 ARM big.LITTLE 8-core (4x4). Basically, I changed capacity_orig_of to capacity_of in capacity_spare_wake and wake_cap and I see a good performance improvement. That makes sense because wake_cap would send the task wake up to the slow-path if RT capacity was eating into the CFS capacity on prev/current CPU, and capacity_spare_wake would find a better group with spare-capacity deducted by the RT pressure capacity. One of the concerns for such a change to wake_cap, that I had, was that it might affect upstream cases that may still want to do a select_idle_sibling even if the capacity on the previous/waker's CPU was not enough after deducting RT pressure. In that case, the wake_cap change to use capacity_of would cause it to enter the slow-path for those cases I think. Could you let me know your thoughts about such a change? I heard that capacity_of was attempted before and there might be some cases to consider. Anything from your previous experiences with this change that you could share? Atleast for capacity_spare_wake, the improvements seems to be worthwhile and dramatic in some cases. I also have some more changes I am thinking off to find_idlest_group but I wanted to start a discussion on the spare capacity idea first. This is related to Rohit's work on RT Capacity awareness, I was talking to him and we were discussing ideas on the implementation. thanks, - Joel

8 years, 6 months

2
2
0 0

[PATCH 0/3] sched/fair: Remote load updates for idle CPUs

by Brendan Jackman

The blocked load and shares of root cfs_rqs is currently only updated by a the CPU owning the rq. That means if a CPU goes suddenly from being busy to totally idle, its load and shares are not updated. Schedutil works around this problem by ignoring the util of CPUs that were last updated more than a tick ago. However the stale load does impact task placement: elements that look at load and util (in particular the slow-path of select_task_rq_fair) can leave the idle CPUs un-used while other CPUs go unnecessarily overloaded. Furthermore the stale shares can impact CPU time allotment. Two complementary solutions are proposed here: 1. When a task wakes up, if necessary an idle CPU is woken as if to perform a NOHZ idle balance, which is then aborted once the load of NOHZ idle CPUs has been updated. This solves the problem but brings with it extra CPU wakeups, which have an energy cost. 2. During newly-idle load balancing, the load of remote nohz-idle CPUs in the sched_domain is updated. When all of the idle CPUs were updated in that step, the nohz.next_update field is pushed further into the future. This field is used to determine the need for triggering the newly-added NOHZ kick. So if such newly-idle balances are happening often enough, no additional CPU wakeups are required to keep all the CPUs' loads updated. [eas-dev] Patch 2/3 here is to highlight a change I made from Vincent's original patch, so that it can be reviewed more easily - if the modification is accepted then I'll squash it before posting this to LKML proper. Brendan Jackman (2): sched/fair: Refactor nohz blocked load udpates sched/fair: Update blocked load from newly idle balance Vincent Guittot (1): sched: force update of blocked load of idle cpus kernel/sched/core.c | 1 + kernel/sched/fair.c | 106 ++++++++++++++++++++++++++++++++++++++++++++------- kernel/sched/sched.h | 2 + 3 files changed, 96 insertions(+), 13 deletions(-) -- 2.14.1

8 years, 6 months

3
5
0 0

cpu_util() after use cumulative_runnable_avg always hit 0

by Ke Wang

Hi Joonwoo, Recently, I backport EAS1.3 related patches (the latest commit is ec888d46d8993b2bf205ed375e538a3819c23659) on google android-4.4 branch to SPREADTRUM platform(kernel3.18, 4 A53 LITTLE + 4A53 big), and enabled WALT signal, tracing util_avg_pelt(avg.util_avg), util_avg_walt(cumulative_runnable_avg), util_avg_freq(prev_runnable_sum) at the same time. The event ftrace (a game scenario: Subway Surf) is as below: <idle>-0 [002] dn.3 53.899765: sched_load_avg_cpu: cpu=2 load_avg=306 util_avg=27 util_avg_pelt=57 util_avg_walt=27 util_avg_freq=69 <idle>-0 [000] d.s5 53.899766: sched_load_avg_cpu: cpu=0 load_avg=385 util_avg=0 util_avg_pelt=115 util_avg_walt=0 util_avg_freq=121 UnityMain-4608 [006] d..3 53.899773: sched_load_avg_cpu: cpu=6 load_avg=964 util_avg=0 util_avg_pelt=923 util_avg_walt=0 util_avg_freq=678 UnityMain-4608 [006] d..3 53.899774: sched_load_avg_cpu: cpu=6 load_avg=964 util_avg=0 util_avg_pelt=923 util_avg_walt=0 util_avg_freq=678 <idle>-0 [000] dn.3 53.899813: sched_load_avg_cpu: cpu=0 load_avg=385 util_avg=9 util_avg_pelt=115 util_avg_walt=9 util_avg_freq=121 kworker/u17:2-4204 [001] d..3 53.899830: sched_load_avg_cpu: cpu=1 load_avg=5445 util_avg=223 util_avg_pelt=139 util_avg_walt=223 util_avg_freq=175 kworker/u17:2-4204 [001] d..3 53.899836: sched_load_avg_cpu: cpu=1 load_avg=5445 util_avg=144 util_avg_pelt=139 util_avg_walt=144 util_avg_freq=175 kworker/u17:1-2763 [001] d..3 53.899853: sched_load_avg_cpu: cpu=1 load_avg=5445 util_avg=144 util_avg_pelt=139 util_avg_walt=144 util_avg_freq=175 kworker/u17:1-2763 [001] d..3 53.899858: sched_load_avg_cpu: cpu=1 load_avg=5445 util_avg=100 util_avg_pelt=139 util_avg_walt=100 util_avg_freq=175 adbd-2915 [000] d..3 53.899900: sched_load_avg_cpu: cpu=0 load_avg=385 util_avg=9 util_avg_pelt=115 util_avg_walt=9 util_avg_freq=121 adbd-2915 [000] d..3 53.899907: sched_load_avg_cpu: cpu=0 load_avg=385 util_avg=0 util_avg_pelt=115 util_avg_walt=0 util_avg_freq=121 adbd-2915 [000] d..3 53.899909: sched_load_avg_cpu: cpu=0 load_avg=385 util_avg=0 util_avg_pelt=115 util_avg_walt=0 util_avg_freq=121 mali-event-hnd-2919 [001] d..4 53.899910: sched_load_avg_cpu: cpu=3 load_avg=934 util_avg=0 util_avg_pelt=190 util_avg_walt=0 util_avg_freq=155 >From the ftrace, we found that util_avg_walt always hit 0 while util_pelt&util_avg_freq stay on a relative big value. Could you give some suggestion for this? Thanks in advance.

8 years, 7 months

1
0
0 0

[PATCH v5 0/3] sched/fair: Introduce scaled capacity awareness in enqueue

by Rohit Jain

Changelog: --------------------------------------------------------------------------- v1->v2: * Changed the dynamic threshold calculation as the having global state can be avoided. v2->v3: * Split up the patch for find_idlest_cpu and select_idle_sibling code paths. v3->v4: * Rebased it to peterz's tree (apologies for wrong tree for v3) v4->v5: * Changed the threshold to 768 from 819 for easier shifts * Changed the find_idlest_cpu code path to be simpler * Changed the select_idle_core code path to search for idlest+full_capacity core * Added scaled capacity awareness to wake_affine_idle code path --------------------------------------------------------------------------- During OLTP workload runs, threads can end up on CPUs with a lot of softIRQ activity, thus delaying progress. For more reliable and faster runs, if the system can spare it, these threads should be scheduled on CPUs with lower IRQ/RT activity. Currently, the scheduler takes into account the original capacity of CPUs when providing 'hints' for select_idle_sibling code path to return an idle CPU. However, the rest of the select_idle_* code paths remain capacity agnostic. Further, these code paths are only aware of the original capacity and not the capacity stolen by IRQ/RT activity. This patch introduces capacity awarness in scheduler (CAS) which avoids CPUs which might have their capacities reduced (due to IRQ/RT activity) when trying to schedule threads (on the push side) in the system. This awareness has been added into the fair scheduling class. It does so by, using the following algorithm: 1) As in rt_avg the scaled capacities are already calculated. 2) Any CPU which is running below 80% capacity is considered running low on capacity. 3) During idle CPU search if a CPU is found running low on capacity, it is skipped if better CPUs are available. 4) If none of the CPUs are better in terms of idleness and capacity, then the low-capacity CPU is considered to be the best available CPU. The performance numbers: --------------------------------------------------------------------------- CAS shows upto 1.5% improvement on x86 when running 'SELECT' database workload. For microbenchmark results, I used hackbench running with process along with, running ping on CPU 0,1 and 2 as: 'ping -l 10000 -q -s 10 -f hostX' The results below should be read as: * 'Baseline without ping' is how the workload would've behaved if there was no IRQ activity. * Compare 'Baseline with ping' and 'Baseline without ping' to see the effect of ping * Compare 'Baseline with ping' and 'CAS with ping' to see the improvement CAS can give over baseline Following are the runtime(s) with hackbench and ping activity as described above (lower is better), on a 44 core 2 socket x86 machine: +---------------+------+--------+--------+ |Num. |CAS |Baseline|Baseline| |Tasks |with |with |without | |(groups of 40) |ping |ping |ping | +---------------+------+--------+--------+ | |Mean |Mean |Mean | +---------------+------+--------+--------+ |1 | 0.55 | 0.59 | 0.53 | |2 | 0.66 | 0.81 | 0.51 | |4 | 0.99 | 1.16 | 0.95 | |8 | 1.92 | 1.93 | 1.88 | |16 | 3.24 | 3.26 | 3.15 | |32 | 5.93 | 5.98 | 5.68 | |64 | 11.55| 11.94 | 10.89 | +---------------+------+--------+--------+ Rohit Jain (3): sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path sched/fair: Introduce scaled capacity awareness in wake_affine_idle code path kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 53 insertions(+), 13 deletions(-) -- 2.7.4

8 years, 7 months

3
8
0 0

Re: [Eas-dev] [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path

by Atish Patra

Minor nit: Patch version missing in the subject line. Other than that: Reviewed-by: Atish Patra <atish.patra(a)oracle.com> Regards, Atish ----- Original Message ----- From: rohit.k.jain(a)oracle.com To: linux-kernel(a)vger.kernel.org, eas-dev(a)lists.linaro.org Cc: peterz(a)infradead.org, mingo(a)redhat.com, joelaf(a)google.com, atish.patra(a)oracle.com, vincent.guittot(a)linaro.org, dietmar.eggemann(a)arm.com, morten.rasmussen(a)arm.com Sent: Saturday, October 7, 2017 6:44:47 PM GMT -06:00 US/Canada Central Subject: [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path While looking for idle CPUs for a waking task, we should also account for the delays caused due to the bandwidth reduction by RT/IRQ tasks. This patch does that by trying to find a higher capacity CPU with minimum wake up latency. Signed-off-by: Rohit Jain <rohit.k.jain(a)oracle.com> --- kernel/sched/fair.c | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0107280..eaede50 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5579,6 +5579,11 @@ static unsigned long capacity_orig_of(int cpu) return cpu_rq(cpu)->cpu_capacity_orig; } +static inline bool full_capacity(int cpu) +{ + return (capacity_of(cpu) >= (capacity_orig_of(cpu)*768 >> 10)); +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5865,8 +5870,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0; + unsigned int backup_cap = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1; + int shallowest_idle_cpu_backup = -1; int i; /* Check if we have any choice: */ @@ -5876,6 +5883,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) { if (idle_cpu(i)) { + int idle_candidate = -1; struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) { @@ -5886,7 +5894,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) */ min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; + idle_candidate = i; } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /* @@ -5895,7 +5903,16 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) * a warmer cache. */ latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; + idle_candidate = i; + } + + if (idle_candidate != -1) { + if (full_capacity(idle_candidate)) { + shallowest_idle_cpu = idle_candidate; + } else if (capacity_of(idle_candidate) > backup_cap) { + shallowest_idle_cpu_backup = idle_candidate; + backup_cap = capacity_of(idle_candidate); + } } } else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(cpu_rq(i)); @@ -5906,7 +5923,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) } } - return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; + if (shallowest_idle_cpu != -1) + return shallowest_idle_cpu; + + return (shallowest_idle_cpu_backup != -1 ? + shallowest_idle_cpu_backup : least_loaded_cpu); } #ifdef CONFIG_SCHED_SMT -- 2.7.4

8 years, 7 months

1
0
0 0

[PATCH V3] Per Sched domain over utilization

by Thara Gopinath

The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level. The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios. 1. There is a misfit task in one of the cpu's in this sched_domain. 2. The total utilization of the domain is greater than the domain capacity The flag is cleared if no cpu in a sched domain is overutilized. This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately. Signed-off-by: Thara Gopinath <thara.gopinath(a)linaro.org> --- V2->V3: - Rebased on latest kernel. - The previous check for misfit task is replaced with the newely introduced rq->misfit_task flag. V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities. include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 3 - kernel/sched/topology.c | 8 +-- 4 files changed, 117 insertions(+), 32 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 3137750..ae44044 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -88,6 +88,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + bool overutilized; }; struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a9ac67c..34bdfeb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4791,6 +4791,29 @@ static inline void hrtick_update(struct rq *rq) static bool cpu_overutilized(int cpu); +static bool +is_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + return sd->shared->overutilized; + else + return false; +} + +static void +set_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = true; +} + +static void +clear_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = false; +} + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -4800,6 +4823,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; + struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP); @@ -4843,9 +4867,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!se) { add_nr_running(rq, 1); - if (!task_new && !rq->rd->overutilized && - cpu_overutilized(rq->cpu)) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!task_new && !is_sd_overutilized(sd) && + cpu_overutilized(rq->cpu)) + set_sd_overutilized(sd); + rcu_read_unlock(); } hrtick_update(rq); } @@ -6276,8 +6303,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd; - rcu_read_lock(); - + /* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu)); if (!sd) @@ -6315,8 +6341,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) } unlock: - rcu_read_unlock(); - if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu; @@ -6350,10 +6374,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, &p->cpus_allowed); } - if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized)) - return select_energy_cpu_brute(p, prev_cpu); - rcu_read_lock(); + sd = rcu_dereference(cpu_rq(prev_cpu)->sd); + if (energy_aware() && + !is_sd_overutilized(sd)) { + new_cpu = select_energy_cpu_brute(p, prev_cpu); + goto unlock; + } + + sd = NULL; + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break; @@ -6418,6 +6448,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ } + +unlock: rcu_read_unlock(); return new_cpu; @@ -7478,6 +7510,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ + unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ @@ -7497,6 +7530,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL, + .total_util = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, @@ -7792,7 +7826,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, - bool *overload, bool *overutilized) + bool *overload, bool *overutilized, bool *misfit_task) { unsigned long load; int i, nr_running; @@ -7831,8 +7865,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, !sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i); - if (cpu_overutilized(i)) + if (cpu_overutilized(i)) { *overutilized = true; + /* + * If the cpu is overutilized and if there is only one + * current task in cfs runqueue, it is potentially a misfit + * task. + */ + if (rq->misfit_task) + *misfit_task = true; + } } /* Adjust by relative CPU capacity of the group */ @@ -7974,12 +8016,12 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) */ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { - struct sched_domain *child = env->sd->child; + struct sched_domain *child = env->sd->child, *sd; struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; - bool overload = false, overutilized = false; + bool overload = false, overutilized = false, misfit_task = false; if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; @@ -8001,7 +8043,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload, &overutilized); + &overload, &overutilized, + &misfit_task); if (local_group) goto next_group; @@ -8032,6 +8075,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; + sds->total_util += sgs->group_util; sg = sg->next; } while (sg != env->sd->groups); @@ -8045,14 +8089,45 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; + } - /* Update over-utilization (tipping point, U >= 0) indicator */ - if (env->dst_rq->rd->overutilized != overutilized) - env->dst_rq->rd->overutilized = overutilized; - } else { - if (!env->dst_rq->rd->overutilized && overutilized) - env->dst_rq->rd->overutilized = true; + if (overutilized) + set_sd_overutilized(env->sd); + else + clear_sd_overutilized(env->sd); + + /* + * If there is a misfit task in one cpu in this sched_domain + * it is likely that the imbalance cannot be sorted out among + * the cpu's in this sched_domain. In this case set the + * overutilized flag at the parent sched_domain. + */ + if (misfit_task) { + + sd = env->sd->parent; + + /* + * In case of a misfit task, load balance at the parent + * sched domain level will make sense only if the the cpus + * have a different capacity. If cpus at a domain level have + * the same capacity, the misfit task cannot be well + * accomodated in any of the cpus and there in no point in + * trying a load balance at this level + */ + while (sd) { + if (sd->flags & SD_ASYM_CPUCAPACITY) { + set_sd_overutilized(sd); + break; + } + sd = sd->parent; + } } + + /* If the domain util is greater that domain capacity, load balancing + * needs to be done at the next sched domain level as well + */ + if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) + set_sd_overutilized(env->sd->parent); } /** @@ -8279,8 +8354,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds); - if (energy_aware() && !env->dst_rq->rd->overutilized) - goto out_balanced; + if (energy_aware()) { + if (!is_sd_overutilized(env->sd)) + goto out_balanced; + } local = &sds.local_stat; busiest = &sds.busiest_stat; @@ -9164,6 +9241,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) rcu_read_lock(); for_each_domain(cpu, sd) { + if (energy_aware()) { + if (!is_sd_overutilized(sd)) + continue; + } + /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second. @@ -9466,6 +9548,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; + struct sched_domain *sd; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -9477,8 +9560,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) rq->misfit_task = !task_fits_capacity(curr, capacity_of(rq->cpu)); - if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr))) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!is_sd_overutilized(sd) && + cpu_overutilized(task_cpu(curr))) + set_sd_overutilized(sd); + rcu_read_unlock(); } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8d27d5b..1604ef2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -585,9 +585,6 @@ struct root_domain { /* Indicate more than one runnable task for any CPU */ bool overload; - /* Indicate one or more cpus over-utilized (tipping point) */ - bool overutilized; - /* * The bit corresponding to a CPU gets set here if such CPU has more * than one runnable -deadline task (as it is below for RT tasks). diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 263e549..e5ba6fc 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1040,11 +1040,11 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */ - if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->shared = *per_cpu_ptr(sdd->sds, sd_id); - atomic_inc(&sd->shared->ref); + sd->shared = *per_cpu_ptr(sdd->sds, sd_id); + atomic_inc(&sd->shared->ref); + + if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight); - } sd->private = sdd; -- 2.1.4

8 years, 7 months

2
8
0 0

[PATCH v4 0/3] sched/fair: Introduce scaled capacity awareness in enqueue

by Rohit Jain

During OLTP workload runs, threads can end up on CPUs with a lot of softIRQ activity, thus delaying progress. For more reliable and faster runs, if the system can spare it, these threads should be scheduled on CPUs with lower IRQ/RT activity. Currently, the scheduler takes into account the original capacity of CPUs when providing 'hints' for select_idle_sibling code path to return an idle CPU. However, the rest of the select_idle_* code paths remain capacity agnostic. Further, these code paths are only aware of the original capacity and not the capacity stolen by IRQ/RT activity. This patch introduces capacity awarness in scheduler (CAS) which avoids CPUs which might have their capacities reduced (due to IRQ/RT activity) when trying to schedule threads (on the push side) in the system. This awareness has been added into the fair scheduling class. It does so by, using the following algorithm: 1) As in rt_avg the scaled capacities are already calculated. 2) Any CPU which is running below 80% capacity is considered running low on capacity. 3) During idle CPU search if a CPU is found running low on capacity, it is skipped if better CPUs are available. 4) If none of the CPUs are better in terms of idleness and capacity, then the low-capacity CPU is considered to be the best available CPU. The performance numbers: --------------------------------------------------------------------------- CAS shows upto 1.5% improvement on x86 when running 'SELECT' database workload. I also used barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. I was also running ping on CPU 0 as: 'ping -l 10000 -q -s 10 -f host2' The results below should be read as: * 'Baseline without ping' is how the workload would've behaved if there was no IRQ activity. * Compare 'Baseline with ping' and 'Baseline without ping' to see the effect of ping * Compare 'Baseline with ping' and 'CAS with ping' to see the improvement CAS can give over baseline The program (barrier.c) can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 20 core x86 machine: +-------+----------------+----------------+------------------+ |Num. |CAS |Baseline |Baseline without | |Threads|with ping |with ping |ping | +-------+-------+--------+-------+--------+-------+----------+ | |Mean |Std. Dev|Mean |Std. Dev|Mean |Std. Dev | +-------+-------+--------+-------+--------+-------+----------+ |1 | 511.7 | 6.9 | 508.3 | 17.3 | 514.6 | 4.7 | |2 | 486.8 | 16.3 | 463.9 | 17.4 | 510.8 | 3.9 | |4 | 466.1 | 11.7 | 451.4 | 12.5 | 489.3 | 4.1 | |8 | 433.6 | 3.7 | 427.5 | 2.2 | 447.6 | 5.0 | |16 | 391.9 | 7.9 | 385.5 | 16.4 | 396.2 | 0.3 | |32 | 269.3 | 5.3 | 266.0 | 6.6 | 276.8 | 0.2 | +-------+-------+--------+-------+--------+-------+----------+ Following are the runtime(s) with hackbench and ping activity as described above (lower is better), on a 20 core x86 machine: +---------------+------+--------+--------+ |Num. |CAS |Baseline|Baseline| |Tasks |with |with |without | |(groups of 40) |ping |ping |ping | +---------------+------+--------+--------+ | |Mean |Mean |Mean | +---------------+------+--------+--------+ |1 | 0.97 | 0.97 | 0.68 | |2 | 1.36 | 1.36 | 1.30 | |4 | 2.57 | 2.57 | 1.84 | |8 | 3.31 | 3.34 | 2.86 | |16 | 5.63 | 5.71 | 4.61 | |25 | 7.99 | 8.23 | 6.78 | +---------------+------+--------+--------+ Changelog: --------------------------------------------------------------------------- v1->v2: * Changed the dynamic threshold calculation as the having global state can be avoided. v2->v3: * Split up the patch for find_idlest_cpu and select_idle_sibling code paths. v3->v4: * Rebased it to peterz's tree (apologies for wrong tree for v3) Previous discussion can be found at: --------------------------------------------------------------------------- https://patchwork.kernel.org/patch/9741351/ https://lists.linaro.org/pipermail/eas-dev/2017-August/000933.html Rohit Jain (3): sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path ignore_this_patch: Fixing compilation error on Peter's tree kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++--------- kernel/time/tick-sched.c | 1 + 2 files changed, 68 insertions(+), 14 deletions(-) -- 2.7.4

8 years, 7 months

4
13
0 0

WALT panic on Hikey960

by Leo Yan

Hi Vikram, Joonwoo, [ + EAS mailing list ] On Hikey960 with EASv1.3, I encountered many times for WALT panic, it reports the bug from below two functions; you also could see log in the below. Before I dig into this, could you give some suggestion for this? Or if there have some existed fixing for this? Thanks in advance. void walt_dec_cumulative_runnable_avg(struct rq *rq, struct task_struct *p) { rq->cumulative_runnable_avg -= p->ravg.demand; BUG_ON((s64)rq->cumulative_runnable_avg < 0); } static void fixup_cumulative_runnable_avg(struct rq *rq, struct task_struct *p, u64 new_task_load) { s64 task_load_delta = (s64)new_task_load - task_load(p); rq->cumulative_runnable_avg += task_load_delta; if ((s64)rq->cumulative_runnable_avg < 0) panic("cra less than zero: tld: %lld, task_load(p) = %u\n", task_load_delta, task_load(p)); } --- Panic Log --- [ 1108.441865] init: Untracked pid 15425 exited with status 0 [ 1108.657107] ------------[ cut here ]------------ [ 1108.661746] kernel BUG at kernel/sched/walt.c:109! [ 1108.666538] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 1108.672026] CPU: 1 PID: 1248 Comm: kschedfreq:0 Not tainted 4.4.78-07635-g0255026 #45 [ 1108.679851] Hardware name: HiKey960 (DT) [ 1108.683770] task: ffffffc0b166c080 ti: ffffffc0b0e64000 task.ti: ffffffc0b0e64000 [ 1108.691261] PC is at walt_dec_cumulative_runnable_avg+0x40/0x44 [ 1108.697179] LR is at dequeue_task_rt+0x40/0x8c [ 1108.701617] pc : [<ffffff8008112428>] lr : [<ffffff800810c82c>] pstate: 60000185 [ 1108.709007] sp : ffffffc0b0e67b90 [ 1108.712315] x29: ffffffc0b0e67b90 x28: 0000000000000001 [ 1108.717633] x27: ffffff8008bc4fc4 x26: ffffffc0bff13400 [ 1108.722948] x25: ffffffc0b166c6c8 x24: 0000000000000000 [ 1108.728263] x23: ffffff8009095000 x22: ffffffc0b166c080 [ 1108.733579] x21: ffffffc0bff13be8 x20: ffffffc0b166c080 [ 1108.738895] x19: ffffffc0bff13400 x18: 0000000000000000 [ 1108.744209] x17: 0000000000000000 x16: 0000000000000000 [ 1108.749524] x15: 0000000000000000 x14: 0000000000000000 [ 1108.754839] x13: 0000000000000000 x12: 0000000034d5d91d [ 1108.760156] x11: ffffff8008be13cc x10: 00000000000009d0 [ 1108.765471] x9 : ffffffc0b0e64000 x8 : ffffffc0b0e67ce0 [ 1108.770786] x7 : ffffffc0ae6cfe30 x6 : ffffff8009095000 [ 1108.776101] x5 : 0000000000000001 x4 : 00000040b6ea8000 [ 1108.781415] x3 : 0000000000000002 x2 : 0000000000000000 [ 1108.786730] x1 : fffffffffffedce2 x0 : 00000000000bc75e [ 1108.792047] [ 1108.792047] SP: 0xffffffc0b0e67b10: [ 1108.797006] 7b10 b166c080 ffffffc0 09095000 ffffff80 00000000 00000000 b166c6c8 ffffffc0 [ 1108.805223] 7b30 bff13400 ffffffc0 08bc4fc4 ffffff80 00000001 00000000 b0e67b90 ffffffc0 [ 1108.813440] 7b50 0810c82c ffffff80 b0e67b90 ffffffc0 08112428 ffffff80 60000185 00000000 [ 1108.821656] 7b70 b0e67ba0 ffffffc0 0810c538 ffffff80 ffffffff ffffffff 0810c574 ffffff80 [ 1108.829872] 7b90 b0e67bc0 ffffffc0 0810c82c ffffff80 bff13400 ffffffc0 0810c820 ffffff80 [ 1108.838090] 7bb0 bff13400 ffffffc0 b166c080 ffffffc0 b0e67bf0 ffffffc0 080eead8 ffffff80 [ 1108.846306] 7bd0 bff13400 ffffffc0 0906b000 ffffff80 bff13400 ffffffc0 0906b000 ffffff80 [ 1108.854522] 7bf0 b0e67c20 ffffffc0 08bc4b80 ffffff80 bff13400 ffffffc0 08bc47e4 ffffff80 [ 1108.862741] [ 1108.862741] X1: 0xfffffffffffedc62: [ 1108.867700] dc60 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.875924] dc80 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.884140] dca0 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.892358] dcc0 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.900576] dce0 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.908793] dd00 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.917009] dd20 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.925227] dd40 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.933446] dd60 ******** ******** ******** ******** ******** ******** ******** ******** [ 1108.941666] [ 1108.941666] X7: 0xffffffc0ae6cfdb0: [ 1108.946625] fdb0 0000c350 00000000 00000001 00000000 00000000 00000000 ae6cfeb0 ffffffc0 [ 1108.954840] fdd0 00000000 00000000 00000001 00000000 00000000 00000000 0000c350 00000001 [ 1108.963055] fdf0 ae6cfe90 ffffffc0 0813dd98 ffffff80 b55c0418 0000007f 00000000 00000000 [ 1108.971271] fe10 ffffffff ffffffff b76b299c 0000007f ae6cfe60 ffffffc0 080efae0 00000001 [ 1108.979489] fe30 b0e67ce0 ffffffc0 00000000 00000000 00000000 00000000 0c0b8de9 00000102 [ 1108.987707] fe50 0c0aca99 00000102 0813c598 ffffff80 bff0ed40 ffffffc0 00000001 00000825 [ 1108.995923] fe70 08bc80e8 ffffff80 696c616d 6d656d2d 7275702d 00006567 b11f3100 ffffffc0 [ 1109.004140] fe90 00000000 00000000 08085f30 ffffff80 00000000 00000000 b6ee1020 0000007f [ 1109.012359] [ 1109.012359] X8: 0xffffffc0b0e67c60: [ 1109.017318] 7c60 b1092800 ffffffc0 091a8000 ffffff80 00000000 00000000 00009f4c 00000000 [ 1109.025536] 7c80 b0e67ca0 ffffffc0 08bc82a8 ffffff80 b0e67d98 ffffffc0 00000100 00000000 [ 1109.033752] 7ca0 b0e67d40 ffffffc0 08bc8348 ffffff80 b0e67d98 ffffffc0 00000064 00000000 [ 1109.041970] 7cc0 b11bb580 ffffffc0 b0e67d30 ffffffc0 0808e7e4 ffffff80 b166c080 00000001 [ 1109.050186] 7ce0 bff0f2d1 ffffffc0 ae6cfe30 ffffffc0 bff0f170 ffffffc0 09828528 00000102 [ 1109.058405] 7d00 0980fe88 00000102 0813c598 ffffff80 bff0ed40 ffffffc0 00000001 000004e0 [ 1109.066621] 7d20 08bc829c ffffff80 6863736b 72666465 303a7165 00000000 b166c080 ffffffc0 [ 1109.074839] 7d40 b0e67d70 ffffffc0 08bc8068 ffffff80 026e40e0 00000000 001a13c8 00000000 [ 1109.083058] [ 1109.083058] X9: 0xffffffc0b0e63f80: [ 1109.088015] 3f80 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.096232] 3fa0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.104448] 3fc0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.112665] 3fe0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.120883] 4000 00000000 00000000 ffffffff ffffffff b166c080 ffffffc0 00000003 00000001 [ 1109.129100] 4020 57ac6e9d 00000000 32273028 0d0f1c33 10233816 201d111a 3b013e3c 532f2f26 [ 1109.137317] 4040 ae7a3648 ffffffc0 ae354db8 ffffffc0 ae7a36c0 ffffffc0 ae7a36c0 ffffffc0 [ 1109.145535] 4060 00000001 00000000 aca585e0 ffffffc0 aca584e0 ffffffc0 07fb7c71 00000000 [ 1109.153753] [ 1109.153753] X19: 0xffffffc0bff13380: [ 1109.158799] 3380 00000000[ 1109.160774] mali e82c0000.mali: Reset interrupt didn't reach CPU. Check interrupt assignments. [ 1109.169934] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.177103] 33a0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.185319] 33c0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.193537] 33e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.201755] 3400 fab6faaf 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.209971] 3420 00000000 00000000 00000000 00000000 00031548 00000001 00000000 00000000 [ 1109.218187] 3440 00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000000 [ 1109.226402] 3460 00017d53 00000000 0002db81 00000000 00000000 00000000 00000000 00000000 [ 1109.234619] [ 1109.234619] X20: 0xffffffc0b166c000: [ 1109.239664] c000 0b030d00 08010211 100f0d05 07091214 040e0a0c ffffff00 00000001 00000000 [ 1109.247881] c020 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.256097] c040 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.264313] c060 00000000 00000010 00000000 00000000 ffffffff 0000003f ffffffff 0000003f [ 1109.272528] c080 00000002 00000000 b0e64000 ffffffc0 00000003 04208040 00000000 00000000 [ 1109.280745] c0a0 00000000 00000000 00000001 00000000 00031435 00000001 84b47180 ffffffc0 [ 1109.288960] c0c0 00000001 00000001 00000031 00000078 00000031 00000032 08be15c8 ffffff80 [ 1109.297178] c0e0 00000400 00000000 00400000 00000000 00000001 00000000 00000000 00000000 [ 1109.305395] [ 1109.305395] X21: 0xffffffc0bff13b68: [ 1109.310440] 3b68 bff13b60 ffffffc0 bff13b70 ffffffc0 bff13b70 ffffffc0 bff13b80 ffffffc0 [ 1109.318658] 3b88 bff13b80 ffffffc0 bff13b90 ffffffc0 bff13b90 ffffffc0 bff13ba0 ffffffc0 [ 1109.326875] 3ba8 bff13ba0 ffffffc0 bff13bb0 ffffffc0 bff13bb0 ffffffc0 00000000 00000064 [ 1109.335092] 3bc8 00000064 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.343309] 3be8 bff13be8 ffffffc0 bff13be8 ffffffc0 00000000 00000000 00000000 00000000 [ 1109.351526] 3c08 00000000 00000000 0810cb88 ffffff80 00020002 00000000 00000000 00000000 [ 1109.359742] 3c28 006303e4 00000000 389fd980 00000000 f80df80d 00000000 00000000 00000000 [ 1109.367959] 3c48 bff13400 ffffffc0 091c42a0 ffffff80 00000000 00000000 00000000 00000000 [ 1109.376176] [ 1109.376176] X22: 0xffffffc0b166c000: [ 1109.381221] c000 0b030d00 08010211 100f0d05 07091214 040e0a0c ffffff00 00000001 00000000 [ 1109.389439] c020 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.397656] c040 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.405873] c060 00000000 00000010 00000000 00000000 ffffffff 0000003f ffffffff 0000003f [ 1109.414090] c080 00000002 00000000 b0e64000 ffffffc0 00000003 04208040 00000000 00000000 [ 1109.422307] c0a0 00000000 00000000 00000001 00000000 00031435 00000001 84b47180 ffffffc0 [ 1109.430524] c0c0 00000001 00000001 00000031 00000078 00000031 00000032 08be15c8 ffffff80 [ 1109.438741] c0e0 00000400 00000000 00400000 00000000 00000001 00000000 00000000 00000000 [ 1109.446959] [ 1109.446959] X25: 0xffffffc0b166c648: [ 1109.452004] c648 b166c648 ffffffc0 b166c648 ffffffc0 b10c8910 ffffffc0 b10c8910 ffffffc0 [ 1109.460220] c668 b0e67ea0 ffffffc0 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.468436] c688 00000073 00000000 00000000 00000000 00000073 00000000 00000000 00000000 [ 1109.476653] c6a8 00000000 00000000 00000000 00000000 00000000 00000000 00009814 00000000 [ 1109.484869] c6c8 00000004 00000000 09ec2856 00000001 09ec2856 00000001 00000000 00000000 [ 1109.493087] c6e8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.501305] c708 b166c708 ffffffc0 b166c708 ffffffc0 b166c718 ffffffc0 b166c718 ffffffc0 [ 1109.509521] c728 b166c728 ffffffc0 b166c728 ffffffc0 00000000 00000000 b0e0b880 ffffffc0 [ 1109.517740] [ 1109.517740] X26: 0xffffffc0bff13380: [ 1109.522785] 3380 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.531001] 33a0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.539217] 33c0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.547434] 33e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.555651] 3400 fab6faaf 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1109.563868] 3420 00000000 00000000 00000000 00000000 00031548 00000001 00000000 00000000 [ 1109.572085] 3440 00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000000 [ 1109.580301] 3460 00017d53 00000000 0002db81 00000000 00000000 00000000 00000000 00000000 [ 1109.588521] [ 1109.588521] X29: 0xffffffc0b0e67b10: [ 1109.593567] 7b10 b166c080 ffffffc0 09095000 ffffff80 00000000 00000000 b166c6c8 ffffffc0 [ 1109.601784] 7b30 bff13400 ffffffc0 08bc4fc4 ffffff80 00000001 00000000 b0e67b90 ffffffc0 [ 1109.610001] 7b50 0810c82c ffffff80 b0e67b90 ffffffc0 08112428 ffffff80 60000185 00000000 [ 1109.618218] 7b70 b0e67ba0 ffffffc0 0810c538 ffffff80 ffffffff ffffffff 0810c574 ffffff80 [ 1109.626436] 7b90 b0e67bc0 ffffffc0 0810c82c ffffff80 bff13400 ffffffc0 0810c820 ffffff80 [ 1109.634653] 7bb0 bff13400 ffffffc0 b166c080 ffffffc0 b0e67bf0 ffffffc0 080eead8 ffffff80 [ 1109.642869] 7bd0 bff13400 ffffffc0 0906b000 ffffff80 bff13400 ffffffc0 0906b000 ffffff80 [ 1109.651086] 7bf0 b0e67c20 ffffffc0 08bc4b80 ffffff80 bff13400 ffffffc0 08bc47e4 ffffff80 [ 1109.659303] [ 1109.660788] Process kschedfreq:0 (pid: 1248, stack limit = 0xffffffc0b0e64020) [ 1109.668006] Stack: (0xffffffc0b0e67b90 to 0xffffffc0b0e68000) [ 1109.673748] 7b80: ffffffc0b0e67bc0 ffffff800810c82c [ 1109.681574] 7ba0: ffffffc0bff13400 ffffff800810c820 ffffffc0bff13400 ffffffc0b166c080 [ 1109.689399] 7bc0: ffffffc0b0e67bf0 ffffff80080eead8 ffffffc0bff13400 ffffff800906b000 [ 1109.697225] 7be0: ffffffc0bff13400 ffffff800906b000 ffffffc0b0e67c20 ffffff8008bc4b80 [ 1109.705052] 7c00: ffffffc0bff13400 ffffff8008bc47e4 ffffffc000000001 ffffffc0b166c080 [ 1109.712877] 7c20: ffffffc0b0e67c80 ffffff8008bc4fc4 ffffffc0b0e64000 0000000000000001 [ 1109.720703] 7c40: 00000000000186a0 ffffffc0b0e64000 ffffff8008be0000 00000000001a13c8 [ 1109.728529] 7c60: ffffffc0b1092800 ffffff80091a8000 0000000000000000 0000000000009f4c [ 1109.736355] 7c80: ffffffc0b0e67ca0 ffffff8008bc82a8 ffffffc0b0e67d98 0000000000000100 [ 1109.744181] 7ca0: ffffffc0b0e67d40 ffffff8008bc8348 ffffffc0b0e67d98 0000000000000064 [ 1109.752006] 7cc0: ffffffc0b11bb580 ffffffc0b0e67d30 ffffff800808e7e4 00000001b166c080 [ 1109.759832] 7ce0: ffffffc0bff0f2d1 ffffffc0ae6cfe30 ffffffc0bff0f170 0000010209828528 [ 1109.767658] 7d00: 000001020980fe88 ffffff800813c598 ffffffc0bff0ed40 000004e000000001 [ 1109.775483] 7d20: ffffff8008bc829c 726664656863736b 00000000303a7165 ffffffc0b166c080 [ 1109.783309] 7d40: ffffffc0b0e67d70 ffffff8008bc8068 00000000026e40e0 00000000001a13c8 [ 1109.791135] 7d60: 00000000000186a0 0000000108142a54 ffffffc0b0e67da0 ffffff800811a770 [ 1109.798961] 7d80: 000001020980f324 ffffffc0b0e64000 ffffffc0b11bb580 00000000026e40e0 [ 1109.806787] 7da0: ffffffc0b0e67e20 ffffff80080e1a14 ffffffc0b11bb400 ffffffc0b0e64000 [ 1109.814613] 7dc0: ffffff80091c41c8 ffffffc0b1092800 ffffff800811a6b0 0000000000000000 [ 1109.822438] 7de0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.830263] 7e00: ffffffc0b11bb400 ffffffc0b0e64000 ffffff80091c41c8 ffffffc000000032 [ 1109.838089] 7e20: 0000000000000000 ffffff8008085ed0 ffffff80080e192c ffffffc0b11bb400 [ 1109.845914] 7e40: 0000000000000000 0000000000000000 0000000000000000 ffffff80080efe18 [ 1109.853740] 7e60: 0000000000000000 0000000000000000 0000000000000000 ffffffc0b1092800 [ 1109.861565] 7e80: ffffffc000000000 ffffff8000000000 ffffffc0b0e67e90 ffffffc0b0e67e90 [ 1109.869392] 7ea0: 0000000000000000 ffffff8000000000 ffffffc0b0e67eb0 ffffffc0b0e67eb0 [ 1109.877217] 7ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.885042] 7ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.892867] 7f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.900692] 7f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.908518] 7f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.916344] 7f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.924169] 7f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.931994] 7fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.939820] 7fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000 [ 1109.947646] 7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.955470] Call trace: [ 1109.957911] Exception stack(0xffffffc0b0e679c0 to 0xffffffc0b0e67af0) [ 1109.964348] 79c0: ffffffc0bff13400 0000008000000000 ffffffc0b0e67b90 ffffff8008112428 [ 1109.972173] 79e0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 1109.979999] 7a00: 0000000000000000 0000000000000000 0000000000000000 0000000000000009 [ 1109.987824] 7a20: 0000000000000010 0000000000000010 0000000000000000 000000000000068a [ 1109.995650] 7a40: ffffffc0b0e67a90 ffffff8008bc8bac 0000000000000180 ffffff800928c688 [ 1110.003476] 7a60: 00000000000bc75e fffffffffffedce2 0000000000000000 0000000000000002 [ 1110.011301] 7a80: 00000040b6ea8000 0000000000000001 ffffff8009095000 ffffffc0ae6cfe30 [ 1110.019127] 7aa0: ffffffc0b0e67ce0 ffffffc0b0e64000 00000000000009d0 ffffff8008be13cc [ 1110.026952] 7ac0: 0000000034d5d91d 0000000000000000 0000000000000000 0000000000000000 [ 1110.034777] 7ae0: 0000000000000000 0000000000000000 [ 1110.039651] [<ffffff8008112428>] walt_dec_cumulative_runnable_avg+0x40/0x44 [ 1110.046609] [<ffffff800810c82c>] dequeue_task_rt+0x40/0x8c [ 1110.052093] [<ffffff80080eead8>] deactivate_task+0x98/0xbc [ 1110.057580] [<ffffff8008bc4b80>] __schedule+0x44c/0x7c0 [ 1110.062800] [<ffffff8008bc4fc4>] schedule+0x40/0xa0 [ 1110.067674] [<ffffff8008bc82a8>] schedule_hrtimeout_range_clock+0x94/0x100 [ 1110.074544] [<ffffff8008bc8348>] schedule_hrtimeout_range+0x34/0x40 [ 1110.080806] [<ffffff8008bc8068>] usleep_range+0x4c/0x58 [ 1110.086028] [<ffffff800811a770>] cpufreq_sched_thread+0xc0/0x1e4 [ 1110.092032] [<ffffff80080e1a14>] kthread+0xe8/0xfc [ 1110.096821] [<ffffff8008085ed0>] ret_from_fork+0x10/0x40 [ 1110.102129] Code: b7f80081 f9400bf3 a8c37bfd d65f03c0 (d4210000)

8 years, 7 months

5
15
0 0

[RFC PATCH 0/2] sched: Introduce CPU soft affinity for processes

by Rohit Jain

For multi-tenancy currently there are mechanisms to share the system CPUs by time-sharing (e.g: CFS) and by dividing up the system in 'rigid' containers by using system calls like sched_setaffinity. There is no existing way in the linux kernel today, for flexible workloads where there is a need to give the whole system while still maintaining a notion of preference to CPUs. This patch introduces a new CPU mask, 'cpus_preferred' within the task_struct structure and allows applications a way to specify a set of CPUs which the application would like to run on. The scheduler will try to honor the applications' request the best it can, however if the scheduler finds that there are no idle CPUs within the preferred list, it shall run the application anywhere within the system. This can be used to design soft containers which allows a tenant to use more capacity than he is entitled to when others aren't fully using theirs. The advantage of space sharing the system as opposed to time sharing is that you maintain more cache locality when the soft containers are being utilized. Since this behavior is observed on every scheduling decision, the application gets to run on its preferred CPUs as long as the application does not overuse its specified resources. The design of soft containers still needs more user-space code however, this is what is needed from the kernel. FAQs: Q) What if I set "hard" affinity after I set a preference by using soft affinity? A: Hard affinity will over-ride any previous soft affinity. Q) What if my application had already specified a "hard" affinity? Can I still provide a set of CPUs for soft affinity? A: Yes, it will work as long as the new soft affinity is a subset of the "hard" affinity. Q) Can I have mutually exclusive hard and soft affinities? A: No, soft affinity is always a subset of hard affinity. Note: Ignore the kernel/sched/tick-sched.c change. It is just fixing a build error on Peter's tree. Rohit Jain (2): sched: Introduce new flags to sched_setaffinity to support soft affinity. sched: Actual changes after adding SCHED_SOFT_AFFINITY to make it work with the scheduler arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/init_task.h | 1 + include/linux/sched.h | 4 +- include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/sched.h | 3 + kernel/compat.c | 2 +- kernel/sched/core.c | 167 ++++++++++++++++++++++++++++----- kernel/sched/cpudeadline.c | 4 +- kernel/sched/cpupri.c | 4 +- kernel/sched/fair.c | 116 +++++++++++++++++------ kernel/time/tick-sched.c | 1 + 12 files changed, 250 insertions(+), 60 deletions(-) -- 2.7.4

8 years, 7 months

3
6
0 0

Energy Model Question

by Zachariah Kennedy

Good day! Been really enjoying watching development of eas on Android Gerrit and in other places so first let me say thanks again for all the great work. With that, I had a couple questions about idle states in the energy model. I am mainly curious about the number of tuples used for the idle states. On the Pixel they used "2 2 0" for the CPU Idle States, I would think that would be for: wfi fpc-def fpc But for Cluster Idle states they used "0 0" but I would think there should be 3 tuples since the Cluster Idle States are: l2-wfi l2-gdhs l2-fpc So why only the two "0 0" for the Pixel EM? Now this leads up to my ultimate question and that is about the SD835 (from the OnePlus5) For CPU Idle States we have: wfi ret pc But we disable idle_enable for "ret". So would this mean that in my own EM for the OP5 I should only have 2 tuples? And for Cluster Idle States we have: l2-wfi l2-dynret l2-ret l2-pc But on the OP5 we disable l2-dynret and l2-ret. So once again, should I only have 2 tuples for the number of idle states used or a tuple for each physical idle state possible? Kind Regards, Zachariah Kennedy

8 years, 8 months

5
5
0 0

[RFC PATCH v3 0/2] sched: Introduce scaled capacity awareness in enqueue

by Rohit Jain

During OLTP workload runs, threads can end up on CPUs with a lot of softIRQ activity, thus delaying progress. For more reliable and faster runs, if the system can spare it, these threads should be scheduled on CPUs with lower IRQ/RT activity. Currently, the scheduler takes into account the original capacity of CPUs when providing 'hints' for select_idle_sibling code path to return an idle CPU. However, the rest of the select_idle_* code paths remain capacity agnostic. Further, these code paths are only aware of the original capacity and not the capacity stolen by IRQ/RT activity. This patch introduces capacity awarness in scheduler (CAS) which avoids CPUs which might have their capacities reduced (due to IRQ/RT activity) when trying to schedule threads (on the push side) in the system. This awareness has been added into the fair scheduling class. It does so by, using the following algorithm: 1) As in rt_avg the scaled capacities are already calculated. 2) Any CPU which is running below 80% capacity is considered running low on capacity. 3) During idle CPU search if a CPU is found running low on capacity, it is skipped if better CPUs are available. 4) If none of the CPUs are better in terms of idleness and capacity, then the low-capacity CPU is considered to be the best available CPU. The performance numbers*: --------------------------------------------------------------------------- CAS shows upto 1.5% improvement on x86 when running 'SELECT' database workload. I also used barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. I was also running ping on CPU 0 as: 'ping -l 10000 -q -s 10 -f host2' The results below should be read as: * 'Baseline without ping' is how the workload would've behaved if there was no IRQ activity. * Compare 'Baseline with ping' and 'Baseline without ping' to see the effect of ping * Compare 'Baseline with ping' and 'CAS with ping' to see the improvement CAS can give over baseline The program (barrier.c) can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 20 core x86 machine: +-------+----------------+----------------+------------------+ |Num. |CAS |Baseline |Baseline without | |Threads|with ping |with ping |ping | +-------+-------+--------+-------+--------+-------+----------+ | |Mean |Std. Dev|Mean |Std. Dev|Mean |Std. Dev | +-------+-------+--------+-------+--------+-------+----------+ |1 | 511.7 | 6.9 | 508.3 | 17.3 | 514.6 | 4.7 | |2 | 486.8 | 16.3 | 463.9 | 17.4 | 510.8 | 3.9 | |4 | 466.1 | 11.7 | 451.4 | 12.5 | 489.3 | 4.1 | |8 | 433.6 | 3.7 | 427.5 | 2.2 | 447.6 | 5.0 | |16 | 391.9 | 7.9 | 385.5 | 16.4 | 396.2 | 0.3 | |32 | 269.3 | 5.3 | 266.0 | 6.6 | 276.8 | 0.2 | +-------+-------+--------+-------+--------+-------+----------+ Following are the runtime(s) with hackbench and ping activity as described above (lower is better), on a 20 core x86 machine: +---------------+------+--------+--------+ |Num. |CAS |Baseline|Baseline| |Tasks |with |with |without | |(groups of 40) |ping |ping |ping | +---------------+------+--------+--------+ | |Mean |Mean |Mean | +---------------+------+--------+--------+ |1 | 0.97 | 0.97 | 0.68 | |2 | 1.36 | 1.36 | 1.30 | |4 | 2.57 | 2.57 | 1.84 | |8 | 3.31 | 3.34 | 2.86 | |16 | 5.63 | 5.71 | 4.61 | |25 | 7.99 | 8.23 | 6.78 | +---------------+------+--------+--------+ *Performance numbers for ARM: --------------------------------------------------------------------------- I was asked to show the efficacy on ARM in v2 review, however I am having some difficulty gathering an ARM machine. Would it be possible for someone to give try this out on ARM? Changelog: --------------------------------------------------------------------------- v1->v2: * Changed the dynamic threshold calculation as the having global state can be avoided. v2->v3: * Split up the patch for find_idlest_cpu and select_idle_sibling code paths. Previous discussion can be found at: --------------------------------------------------------------------------- https://patchwork.kernel.org/patch/9741351/ Rohit Jain (2): sched: Introduce scaled capacity awareness in find_idlest_cpu code path sched: Introduce scaled capacity awareness in select_idle_sibling code path kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 66 insertions(+), 14 deletions(-) -- 2.7.4

8 years, 8 months

1
2
0 0

Linux Users List

by alyssa.healy＠forematica.com

<div dir="ltr">Hi, We would like to learn your interest in acquiring our recently updated Linux Users List which helps you to improve your business campaign. We have a verified list of MSPs with complete contact information like Company name, Website, Contact name (First, Middle, Last), Title, Direct email address, Phone number, Postal address, Industry, Employee size, Revenue size, Fax etc. We have other Innovation information also like: Ubuntu, CentOS, Fedora, macOS Sierra, Chromium OS, Oracle Linux, Tizen, and many more. Specialties: Ubuntu, CentOS, Fedora, macOS Sierra, Chromium OS, Oracle Linux, Tizen. Please let me know if this is something of interest to you? I would love to share further details for your review. Best Regards, Alyssa Healy Database Consultant- Global IT Growth If you don’t wish to receive further emails, please reply with Remove. </div>  <a style='display: block; margin: 32px 0 40px 0; padding: 10px; font-size: 1em; text-align: center; border: 0; border-top: 1px solid gray; ' href='https://goo.gl/2ksdRv'>powered by GSM. Free mail merge and email marketing software for Gmail.</a>

8 years, 8 months

1
0
0 0

[RFC PATCH v2] sched: Introduce scaled capacity awareness in enqueue

by Rohit Jain

During OLTP workload runs, threads can end up on CPUs with a lot of softIRQ activity, thus delaying progress. For more reliable and faster runs, if the system can spare it, these threads should be scheduled on CPUs with lower IRQ/RT activity. Currently, the scheduler takes into account the original capacity of CPUs when providing 'hints' for select_idle_sibling code path to return an idle CPU. However, the rest of the select_idle_* code paths remain capacity agnostic. Further, these code paths are only aware of the original capacity and not the capacity stolen by IRQ/RT activity. This patch introduces capacity awarness in scheduler (CAS) which avoids CPUs which might have their capacities reduced (due to IRQ/RT activity) when trying to schedule threads (on the push side) in the system. This awareness has been added into the fair scheduling class. It does so by, using the following algorithm: 1) As in rt_avg the scaled capacities are already calculated. 2) Any CPU which is running below 80% capacity is considered running low on capacity[*]. 3) During idle CPU search if a CPU is found running low on capacity, it is skipped if better CPUs are available. 4) If none of the CPUs are better in terms of idleness and capacity, then the low-capacity CPU is considered to be the best available CPU. The performance numbers: --------------------------------------------------------------------------- CAS shows upto 1.5% improvement on x86 when running 'SELECT' database workload. I also used barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. I was also running ping on CPU 0 as: 'ping -l 10000 -q -s 10 -f host2' The results below should be read as: * 'Baseline without ping' is how the workload would've behaved if there was no IRQ activity. * Compare 'Baseline with ping' and 'Baseline without ping' to see the effect of ping * Compare 'Baseline with ping' and 'CAS with ping' to see the improvement CAS can give over baseline The program (barrier.c) can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 20 core x86 machine: +-------+----------------+----------------+------------------+ |Num. |CAS |Baseline |Baseline without | |Threads|with ping |with ping |ping | +-------+-------+--------+-------+--------+-------+----------+ | |Mean |Std. Dev|Mean |Std. Dev|Mean |Std. Dev | +-------+-------+--------+-------+--------+-------+----------+ |1 | 511.7 | 6.9 | 508.3 | 17.3 | 514.6 | 4.7 | |2 | 486.8 | 16.3 | 463.9 | 17.4 | 510.8 | 3.9 | |4 | 466.1 | 11.7 | 451.4 | 12.5 | 489.3 | 4.1 | |8 | 433.6 | 3.7 | 427.5 | 2.2 | 447.6 | 5.0 | |16 | 391.9 | 7.9 | 385.5 | 16.4 | 396.2 | 0.3 | |32 | 269.3 | 5.3 | 266.0 | 6.6 | 276.8 | 0.2 | +-------+-------+--------+-------+--------+-------+----------+ Following are the runtime(s) with hackbench and ping activity as described above (lower is better), on a 20 core x86 machine: +---------------+------+--------+--------+ |Num. |CAS |Baseline|Baseline| |Tasks |with |with |without | |(groups of 40) |ping |ping |ping | +---------------+------+--------+--------+ | |Mean |Mean |Mean | +---------------+------+--------+--------+ |1 | 0.97 | 0.97 | 0.68 | |2 | 1.36 | 1.36 | 1.30 | |4 | 2.57 | 2.57 | 1.84 | |8 | 3.31 | 3.34 | 2.86 | |16 | 5.63 | 5.71 | 4.61 | |25 | 7.99 | 8.23 | 6.78 | +---------------+------+--------+--------+ [*] Question (RFC part): --------------------------------------------------------------------------- In the previous discussion of this patch the threshold to decide whether a CPU is running low on capacity, was being calculated dynamically. In the tests I have done, 80% seems to be a good threshold. Would it be OK to choose a fixed cutoff? Changelog: --------------------------------------------------------------------------- v1->v2: * Changed the dynamic threshold calculation as the having global state can be avoided. Previous discussion can be found at: --------------------------------------------------------------------------- https://patchwork.kernel.org/patch/9741351/ Signed-off-by: Rohit Jain <rohit.k.jain(a)oracle.com> --- kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 66 insertions(+), 14 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c95880e..3c26c13 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5298,6 +5298,11 @@ static unsigned long cpu_avg_load_per_task(int cpu) return 0; } +static inline bool full_capacity(int cpu) +{ + return (capacity_of(cpu) >= (capacity_orig_of(cpu)*819 >> 10)); +} + static void record_wakee(struct task_struct *p) { /* @@ -5516,9 +5521,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) { unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; + unsigned int backup_cap = 0; u64 latest_idle_timestamp = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1; + int shallowest_idle_cpu_backup = -1; int i; /* Check if we have any choice: */ @@ -5538,7 +5545,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) */ min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; + if (full_capacity(i)) { + shallowest_idle_cpu = i; + } else if (capacity_of(i) > backup_cap) { + shallowest_idle_cpu_backup = i; + backup_cap = capacity_of(i); + } } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /* @@ -5547,7 +5559,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) * a warmer cache. */ latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; + if (full_capacity(i)) { + shallowest_idle_cpu = i; + } else if (capacity_of(i) > backup_cap) { + shallowest_idle_cpu_backup = i; + backup_cap = capacity_of(i); + } } } else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(i); @@ -5558,7 +5575,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) } } - return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; + if (shallowest_idle_cpu != -1) + return shallowest_idle_cpu; + + return (shallowest_idle_cpu_backup != -1 ? + shallowest_idle_cpu_backup : least_loaded_cpu); } #ifdef CONFIG_SCHED_SMT @@ -5620,7 +5641,9 @@ void __update_idle_core(struct rq *rq) static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int core, cpu; + int core, cpu, rcpu, rcpu_backup; + unsigned int backup_cap = 0; + rcpu = rcpu_backup = -1; if (!static_branch_likely(&sched_smt_present)) return -1; @@ -5637,10 +5660,20 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int cpumask_clear_cpu(cpu, cpus); if (!idle_cpu(cpu)) idle = false; + + if (full_capacity(cpu)) { + rcpu = cpu; + } else if ((rcpu == -1) && (capacity_of(cpu) > backup_cap)) { + backup_cap = capacity_of(cpu); + rcpu_backup = cpu; + } } - if (idle) - return core; + if (idle) { + if (rcpu == -1) + return (rcpu_backup != -1 ? rcpu_backup : core); + return rcpu; + } } /* @@ -5656,7 +5689,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { - int cpu; + int cpu, backup_cpu = -1; + unsigned int backup_cap = 0; if (!static_branch_likely(&sched_smt_present)) return -1; @@ -5664,11 +5698,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu(cpu, cpu_smt_mask(target)) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) - return cpu; + if (idle_cpu(cpu)) { + if (full_capacity(cpu)) + return cpu; + if (capacity_of(cpu) > backup_cap) { + backup_cap = capacity_of(cpu); + backup_cpu = cpu; + } + } } - return -1; + return backup_cpu; } #else /* CONFIG_SCHED_SMT */ @@ -5697,6 +5737,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 time, cost; s64 delta; int cpu, nr = INT_MAX; + int backup_cpu = -1; + unsigned int backup_cap = 0; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -5727,10 +5769,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) - break; + if (idle_cpu(cpu)) { + if (full_capacity(cpu)) { + backup_cpu = -1; + break; + } else if (capacity_of(cpu) > backup_cap) { + backup_cap = capacity_of(cpu); + backup_cpu = cpu; + } + } } + if (backup_cpu >= 0) + cpu = backup_cpu; time = local_clock() - time; cost = this_sd->avg_scan_cost; delta = (s64)(time - cost) / 8; @@ -5747,13 +5798,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) struct sched_domain *sd; int i; - if (idle_cpu(target)) + if (idle_cpu(target) && full_capacity(target)) return target; /* * If the previous cpu is cache affine and idle, don't be stupid. */ - if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)) + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev) + && full_capacity(prev)) return prev; sd = rcu_dereference(per_cpu(sd_llc, target)); -- 2.7.4

8 years, 8 months

3
6
0 0

[PATCH V5 0/2] sched: cpufreq: Allow remote callbacks

by Viresh Kumar

With Android UI and benchmarks the latency of cpufreq response to certain scheduling events can become very critical. Currently, callbacks into cpufreq governors are only made from the scheduler if the target CPU of the event is the same as the current CPU. This means there are certain situations where a target CPU may not run the cpufreq governor for some time. One testcase [1] to show this behavior is where a task starts running on CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the system is configured such that the new tasks should receive maximum demand initially, this should result in CPU0 increasing frequency immediately. But because of the above mentioned limitation though, this does not occur. This series updates the scheduler core to call the cpufreq callbacks for remote CPUs as well and updates the registered hooks to handle that. This is tested with couple of usecases (Android: hackbench, recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit octa-core, single policy). Only galleryfling showed minor improvements, while others didn't had much deviation. The reason being that this patch only targets a corner case, where following are required to be true to improve performance and that doesn't happen too often with these tests: - Task is migrated to another CPU. - The task has high demand, and should take the target CPU to higher OPPs. - And the target CPU doesn't call into the cpufreq governor until the next tick. Rebased over: pm/linux-next V4->V5: - Drop cpu field from "struct update_util_data" and add it in "struct sugov_cpu" instead. - Can't have separate patches now because of the above change and so merged all the patches from V4 into a single patch. - Add a comment suggested by PeterZ. - Commit log of 1/2 is improved to contain more details. - A new patch (which was posted during V1) is also added to take care of platforms where any CPU can do DVFS on behalf of any other CPU, even if they are part of different cpufreq policies. This has been requested by Saravana several times already and as the series is quite straight forward now, I decided to include it in. V3->V4: - Respect iowait boost flag and util updates for the all remote callbacks. - Minor updates in commit log of 2/3. V2->V3: - Rearranged/merged patches as suggested by Rafael (looks much better now) - Also handle new hook added to intel-pstate driver. - The final code remains the same as V2, except for the above hook. V1->V2: - Don't support remote callbacks for unshared cpufreq policies. - Don't support remote callbacks where local CPU isn't part of the target CPU's cpufreq policy. - Dropped dvfs_possible_from_any_cpu flag. -- viresh [1] http://pastebin.com/7LkMSRxE Viresh Kumar (2): sched: cpufreq: Allow remote cpufreq callbacks cpufreq: Process remote callbacks from any CPU if the platform permits drivers/cpufreq/cpufreq-dt.c | 1 + drivers/cpufreq/cpufreq_governor.c | 3 +++ drivers/cpufreq/intel_pstate.c | 8 ++++++++ include/linux/cpufreq.h | 23 +++++++++++++++++++++++ kernel/sched/cpufreq_schedutil.c | 31 ++++++++++++++++++++++++++----- kernel/sched/deadline.c | 2 +- kernel/sched/fair.c | 8 +++++--- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 10 ++-------- 9 files changed, 70 insertions(+), 18 deletions(-) -- 2.13.0.71.gd7076ec9c9cb

8 years, 9 months

6
10
0 0

EAS r1.3 for AOSP Common Kernel 4.4 and 4.9

by Chris Redpath

Hello EAS-dev! ARM is pleased to announce the EAS r1.3 release. This is the next tick in our regular updates to EAS in AOSP, including documentation and testing updates. In particular this release is the first major update to EAS in Android Common Kernel 4.9 Changes in EAS 1.3 * Validation on real devices and additional development boards (Hikey960) * Increased test coverage * Upstream schedutil backporting * Schedutil is now the recommended CPUFreq governor * General EAS refactoring improvements (find_best_target changes) * android common kernel-4.9 brought to EAS equivalence with 4.4 Android Common Kernel 4.4: https://android.googlesource.com/kernel/common/+/android-4.4 Android Common Kernel 4.9: https://android-review.googlesource.com/#/c/444387/ Once merged into android-4.9, the gerrit web interface will tell you that the patches have been merged however the changeset link should stay active. Documentation: https://developer.arm.com/-/media/developer/developers/open-source/energy-a… Specifically about schedutil: We have backported schedutil patches up until 38d4ea229d which was included in v4.12. (https://github.com/torvalds/linux/commit/38d4ea229d25d30be6bf41bcd6cd663a58…) "cpufreq: schedutil: Trace frequency only if it has changed". The version included in android-4.9 includes backported patches to the same level. This brings schedutil in both versions of Android up to v4.12. We have satisfied ourselves in testing that this version of schedutil works well enough to be used in place of schedfreq both for performance and energy usage. EAS Updates: We have done a large refactoring of find_best_target as it was beginning to become difficult to make further improvements without impacting other behaviors. The refactored version has exactly the same behaviour in the refactor commit, and it has allowed us to further refine the task of selecting a CPU during wakeup. We added the ability to return a second target CPU from find_best_target, which is chosen using a different strategy. When the first target is not allowable due to the energy/performance trade-off not being good enough, we now check the alternative strategy as well (but only if the primary strategy fails). A new tracepoint was added to help in understanding EAS task placement decisions - sched_find_best_target - which traces the task, schedtune flags and CPUs which were selected by find_best_target for energy evaluation. More patches were added to improve system behavior with idle CPUs. We now prevent an idle CPU from holding the system in overutilized mode (if it was overutilized just before going into nohz mode), allowing EAS to handle task placements again sooner. In addition, when misfit tasks are present, we bypass some of the normal nohz balance rate-limiting to reduce the time needed for those tasks to be redistributed. Finally, we added the ability for EAS to forecast the idle state which could potentially be selected under the utilization conditions when calculating the energy for a particular sched group. The forecast is intentionally simple as it is done during wakeup - we reserve the deepest idle state for completely idle groups and otherwise linearly map the group utilization to idle states. In previous versions, EAS used the current idle state when estimating energy. This change allows EAS to see the potential impact of moving the last task from one group to another and move tasks if appropriate. Android-4.9: android-4.9 has not yet had the same level of testing that android-4.4 has due to us having a limited set of platforms which can run a 4.9 kernel version. For most of this dev cycle we have only had access to Juno, and we have confirmed that our tests behave the same on 4.9 as they do on 4.4. In the last week or two, the Hikey960 board has gained a usable BSP for running android-4.9, so we have also been testing that but it is too early to share those results. We continue to develop EAS on AOSP in public. Please feel free to participate in testing patches, reviewing code and generally being a good open-source citizen. Best Regards, Chris Redpath

8 years, 9 months

1
0
0 0

[PATCH V3 0/3] sched: cpufreq: Allow remote callbacks

by Viresh Kumar

Hi, With Android UI and benchmarks the latency of cpufreq response to certain scheduling events can become very critical. Currently, callbacks into schedutil are only made from the scheduler if the target CPU of the event is the same as the current CPU. This means there are certain situations where a target CPU may not run schedutil for some time. One testcase to show this behavior is where a task starts running on CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the system is configured such that new tasks should receive maximum demand initially, this should result in CPU0 increasing frequency immediately. Because of the above mentioned limitation though this does not occur. This is verified using ftrace with the sample [1] application. Maybe the ideal solution is to always allow remote callbacks but that has its own challenges: o There is no protection required for single CPU per policy case today, and adding any kind of locking there, to supply remote callbacks, isn't really a good idea. o If is local CPU isn't part of the same cpufreq policy as the target CPU, then we wouldn't be able to do fast switching at all and have to use some kind of bottom half to schedule work on the target CPU to do real switching. That may be overkill as well. And so this series only allows remote callbacks for target CPUs that share the cpufreq policy with the local CPU. This series is tested with couple of usecases (Android: hackbench, recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit octa-core, single policy). Only galleryfling showed minor improvements, while others didn't had much deviation. The reason being that this patchset only targets a corner case, where following are required to be true to improve performance and that doesn't happen too often with these tests: - Task is migrated to another CPU. - The task has maximum demand initially, and should take the CPU to higher OPPs. - And the target CPU doesn't call into schedutil until the next tick. V2->V3: - Rearranged/merged patches as suggested by Rafael (looks much better now) - Also handle new hook added to intel-pstate driver. - The final code remains the same as V2, except for the above hook. V1->V2: - Don't support remote callbacks for unshared cpufreq policies. - Don't support remote callbacks where local CPU isn't part of the target CPU's cpufreq policy. - Dropped dvfs_possible_from_any_cpu flag. -- viresh [1] http://pastebin.com/7LkMSRxE Viresh Kumar (3): sched: cpufreq: Allow remote cpufreq callbacks cpufreq: schedutil: Process remote callback for shared policies cpufreq: governor: Process remote callback for shared policies drivers/cpufreq/cpufreq_governor.c | 4 ++++ drivers/cpufreq/intel_pstate.c | 8 ++++++++ include/linux/sched/cpufreq.h | 1 + kernel/sched/cpufreq.c | 1 + kernel/sched/cpufreq_schedutil.c | 19 ++++++++++++++----- kernel/sched/deadline.c | 2 +- kernel/sched/fair.c | 8 +++++--- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 10 ++-------- 9 files changed, 37 insertions(+), 18 deletions(-) -- 2.13.0.71.gd7076ec9c9cb

8 years, 9 months

7
23
0 0

[PATCH V4 0/3] sched: cpufreq: Allow remote callbacks

by Viresh Kumar

Hi, I had some IRC discussions with Peter and V4 is based on his feedback. Here is the diff between V3 and V4: diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index d64754fb912e..df9aa1ee53ff 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -79,6 +79,10 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) s64 delta_ns; bool update; + /* Allow remote callbacks only on the CPUs sharing cpufreq policy */ + if (!cpumask_test_cpu(smp_processor_id(), sg_policy->policy->cpus)) + return false; + if (sg_policy->work_in_progress) return false; @@ -225,10 +229,6 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, unsigned int next_f; bool busy; - /* Remote callbacks aren't allowed for policies which aren't shared */ - if (smp_processor_id() != hook->cpu) - return; - sugov_set_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; @@ -298,14 +298,9 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time, { struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); struct sugov_policy *sg_policy = sg_cpu->sg_policy; - struct cpufreq_policy *policy = sg_policy->policy; unsigned long util, max; unsigned int next_f; - /* Allow remote callbacks only on the CPUs sharing cpufreq policy */ - if (!cpumask_test_cpu(smp_processor_id(), policy->cpus)) - return; - sugov_get_util(&util, &max, hook->cpu); raw_spin_lock(&sg_policy->update_lock); -------------------------8<------------------------- With Android UI and benchmarks the latency of cpufreq response to certain scheduling events can become very critical. Currently, callbacks into schedutil are only made from the scheduler if the target CPU of the event is the same as the current CPU. This means there are certain situations where a target CPU may not run schedutil for some time. One testcase to show this behavior is where a task starts running on CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the system is configured such that new tasks should receive maximum demand initially, this should result in CPU0 increasing frequency immediately. Because of the above mentioned limitation though this does not occur. This is verified using ftrace with the sample [1] application. Maybe the ideal solution is to always allow remote callbacks but that has its own challenges: o There is no protection required for single CPU per policy case today, and adding any kind of locking there, to supply remote callbacks, isn't really a good idea. o If is local CPU isn't part of the same cpufreq policy as the target CPU, then we wouldn't be able to do fast switching at all and have to use some kind of bottom half to schedule work on the target CPU to do real switching. That may be overkill as well. And so this series only allows remote callbacks for target CPUs that share the cpufreq policy with the local CPU. This series is tested with couple of usecases (Android: hackbench, recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit octa-core, single policy). Only galleryfling showed minor improvements, while others didn't had much deviation. The reason being that this patchset only targets a corner case, where following are required to be true to improve performance and that doesn't happen too often with these tests: - Task is migrated to another CPU. - The task has maximum demand initially, and should take the CPU to higher OPPs. - And the target CPU doesn't call into schedutil until the next tick. V3->V4: - Respect iowait boost flag and util updates for the all remote callbacks. - Minor updates in commit log of 2/3. V2->V3: - Rearranged/merged patches as suggested by Rafael (looks much better now) - Also handle new hook added to intel-pstate driver. - The final code remains the same as V2, except for the above hook. V1->V2: - Don't support remote callbacks for unshared cpufreq policies. - Don't support remote callbacks where local CPU isn't part of the target CPU's cpufreq policy. - Dropped dvfs_possible_from_any_cpu flag. -- viresh Viresh Kumar (3): sched: cpufreq: Allow remote cpufreq callbacks cpufreq: schedutil: Process remote callback for shared policies cpufreq: governor: Process remote callback for shared policies drivers/cpufreq/cpufreq_governor.c | 4 ++++ drivers/cpufreq/intel_pstate.c | 8 ++++++++ include/linux/sched/cpufreq.h | 1 + kernel/sched/cpufreq.c | 1 + kernel/sched/cpufreq_schedutil.c | 14 +++++++++----- kernel/sched/deadline.c | 2 +- kernel/sched/fair.c | 8 +++++--- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 10 ++-------- 9 files changed, 32 insertions(+), 18 deletions(-) -- 2.13.0.71.gd7076ec9c9cb

8 years, 9 months

5
19
0 0

Re: [Eas-dev] A few EAS questions

by Zachariah Kennedy

Thanks guys for all the great info! I will take another look and see what I can do now that I have a better idea of how to go about it. Once again, it's appreciated that you guys are working out in the open. I know many others that are also keeping up with this mailing list. It has been a great learning experience. Kind Regards, Zachariah Kennedy

8 years, 10 months

1
0
0 0

A few EAS questions

by Zachariah Kennedy

Good day! I have been following EAS development for sometime now. Currently, I have implemented EAS in my own personal kernel for the Oneplus 3. It was largely based on the work done for the pixel and I am happy to say that currently, I have gotten better performance and battery life compared to stock CAF with HMP. These questions will be based on the ACK android-4.4 branch My first question is regarding tunings for EAS. I have seen many different values thrown around for awhile but I was curious about what everyone close to the project is using for schedutil up/down_rate_limit. Currently the stock values are 1000 (for up and down). Is this still the case for those testing the newest EAS changes using schedutil? Also what about stune? I know stock pixel is using 50 for top-app\schedtune.boost for interactions but that turns out to be overkill with schedutil. Lastly, I had purchased the Oneplus 5 with the SD835 just so I can port EAS to it as well. I am looking forward to testing how EAS scales with the extra cores to work with when compared to the SD820/821. One main questions regarding the SD835 is I wanted to see if anyone on the EAS-DEV list has developed an energy model for the SD835 (MSM8998). Even if it is just preliminary, I would appreciate any help with this. I do not have a proper energy meter yet. This is something I am truly interested in. I love the openness of all the Devs close to this project. I have become a better developer having participated and watching from the sidelines. Thanks guys for your hard work. Kind Regards, Zachariah Kennedy

8 years, 10 months

4
5
0 0

[PATCH V2 0/4] sched: cpufreq: Allow remote callbacks

by Viresh Kumar

Hi, Here is the second version of this series. The first [1] version was sent several months back. With Android UI and benchmarks the latency of cpufreq response to certain scheduling events can become very critical. Currently, callbacks into schedutil are only made from the scheduler if the target CPU of the event is the same as the current CPU. This means there are certain situations where a target CPU may not run schedutil for some time. One testcase to show this behavior is where a task starts running on CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the system is configured such that new tasks should receive maximum demand initially, this should result in CPU0 increasing frequency immediately. Because of the above mentioned limitation though this does not occur. This is verified using ftrace with the sample [2] application. Maybe the ideal solution is to always allow remote callbacks but that has its own challenges: o There is no protection required for single CPU per policy case today, and adding any kind of locking there, to supply remote callbacks, isn't really a good idea. o If is local CPU isn't part of the same cpufreq policy as the target CPU, then we wouldn't be able to do fast switching at all and have to use some kind of bottom half to schedule work on the target CPU to do real switching. That may be overkill as well. Taking above challenges into consideration, this version proposes a much simpler diff as compared to the first version. This series only allows remote callbacks for target CPUs that share the cpufreq policy with the local CPU. Locking is mostly in place everywhere and we wouldn't be required to change a lot of things. This series is tested with couple of usecases (Android: hackbench, recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit octa-core, single policy). Only galleryfling showed minor improvements, while others didn't had much deviation. The reason being that this patchset only targets a corner case, where following are required to be true to improve performance and that doesn't happen too often with these tests: - Task is migrated to another CPU. - The task has maximum demand initially, and should take the CPU to higher OPPs. - And the target CPU doesn't call into schedutil until the next tick. V1->V2: - Don't support remote callbacks for unshared cpufreq policies. - Don't support remote callbacks where local CPU isn't part of the target CPU's cpufreq policy. - Dropped dvfs_possible_from_any_cpu flag. -- viresh [1] https://marc.info/?l=linux-pm&m=148906015927796&w=2 [2] http://pastebin.com/7LkMSRxE Steve Muckle (1): intel_pstate: Ignore scheduler cpufreq callbacks on remote CPUs Viresh Kumar (3): cpufreq: schedutil: Process remote callback for shared policies cpufreq: governor: Process remote callback for shared policies sched: cpufreq: Enable remote sched cpufreq callbacks drivers/cpufreq/cpufreq_governor.c | 4 ++++ drivers/cpufreq/intel_pstate.c | 3 +++ include/linux/sched/cpufreq.h | 1 + kernel/sched/cpufreq.c | 1 + kernel/sched/cpufreq_schedutil.c | 19 ++++++++++++++----- kernel/sched/deadline.c | 2 +- kernel/sched/fair.c | 8 +++++--- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 10 ++-------- 9 files changed, 32 insertions(+), 18 deletions(-) -- 2.13.0.71.gd7076ec9c9cb

8 years, 10 months

4
10
0 0

[PATCH 00/13] EXPERIMENT: Power Optimization On Hikey960

by Leo Yan

### Basic Ideas ### This patch set is rebased on EASv1.2 for power optimization on Hikey960. The ARM big.LITTLE systems have many variants, some platforms use the same CPU architecture for multi-clusters, every cluster has different manufacture process (or clock design) so the clusters can have different OPP settings; this kind system the 'LITTLE' core and 'big' core have the same architecture but we can get power benefit from the 'LITTLE' core due it has better power efficiency compared to 'big' core at the same OPP. On the other hand, for this kind system, usually the 'LITTLE' core power efficiency doesn't has huge difference compared to 'big' core's; and furthermore the final CPU power saving percentage will discount twice, so when optimize power for some scenarios, the optimization may not significant as expected; or this means power optimization is not priority issue on these platforms. Regarding the CPU power discouting for whole system, the first discount is related with CPU duty cycle, the second discount is related with SoC/Board baseline power data. We can estimate the CPU power saving percentage for system level with below formula: CPU power saving percentage: CPU_PS% CPU duty cycle: CPU_DC% The percentage between CPU power and whole system: CPU_SYS% So finally the estimated power saving percentage as below: CPU_PS% * CPU_DC% * CPU_SYS% Let's see one example, we have two CA53 clusters, the 'LITTLE' cluster can improve 30% power efficiency than 'big' cluster, so CPU_PS% = 30%; the video playback (1080p) has CPU duty cycle CPU_DC% = 30% (1 core); the ratio between CPU power and system power is CPU_SYS% = 15%, so finally we can save power by using 'LITTLE' compared to 'big' core: CPU_PS% * CPU_DC% * CPU_SYS% = 30% * 30% * 15% = 1.35% Naturally we consider 1.35% percentage is not a significant improvement; but for some cases there have the concept for delta power; and if we compare it with delta power can see the importantance for power saving. Let's use video playback as example, the delta power percentage (DP%) is: DP% = (video_playback_power - home_screen_power) / video_playback_power So DP% is one important criteria for phone models to check some scenarios compared to Android idle syste. If we think DP% = 15% and power saving percentage 1.35%, then power saving with 1.35% is meaningful when we compare 1.35% vs 15%. Another kind of big.LITTLE system has big different power efficiency, if we review the power efficiency on Hikey960, we can see the coefficient (mw/MHz) the worst case is the CA73 is 6.2 times than CA53, if we select the median OPPs as reference we can see CA73 is 2.42 times than CA52. So the highest CPU_PS%(max) = 86%, the median CPU_PS%(median) = 70%. Let's check upper case we can save how much on Hikey960 in theory: CPU_PS%(max) * CPU_DC% * CPU_SYS% = 86% * 30% * 15% = 3.87% CPU_PS%(median) * CPU_DC% * CPU_SYS% = 70% * 30% * 15% = 3.15% We can see power saving percentage 3.87%/3.15% is significant to DP% (15%). So on Hikey960, if there have some scenarios with high CPU duty cycle and sustainable power consuming, the power optimization is important for them. ### Implementations ### Below are detailed implemenation for the optimizations: a) Add back CPU selection based on power efficiency EASv1.2 has function find_best_target(), this function mainly focus on the idlest CPU so reduce the scheduling latency; but in some cases it will miss to select the best power efficiency CPU. So patch 0001/0002 are mainly to add back CPU selection based on power efficiency; we still keep the function find_best_target() but it's only used for "boost" and "prefer_idle" cases, and use power efficiency path for normal cases. b) EAS core algorithm optimization For EAS core algorithm, it should resolve problems for below items: 1. Support more than two clusters; 2. Keep CPU to stay lowest OPP as possible, and pack small tasks when system is idle; 3. Directly migrate waken task to best CPU to meet performance requirement, this means the task could be migrated to higher capacity CPU and vice versa; 4. Consistent result for energy calculation and simple implemenation; Patch 0003/0004 change to select CPU with cluster basis; this means scheduler firstly select candidate CPUs within every cluster, so every cluster can has one candidate CPU or the cluster hasn't any one CPU can meet the requirement. Finally all energy difference calculation happens within these candidate CPUs. This gives us several benefit, the first one is scheduler doesn't couple with previous CPU anymore; in the old code it always compare energy between previous CPU and a new possible CPU, but for some case the previous CPU is completely wrong CPU for the task so the comparison is pointless actually. After applied patch 0003/ 0004, it introduces one side effect: the task can be directly migrate from lower capacity CPU to higher capacity CPU (LITTLE -> big), usually this doesn't happen in old code, due the energy comparison the lower capacity CPU can beat higher capacity CPU so the task is missed the chance to migrate to higher capacity CPU. Patch 0005/0006 are to select best CPU within cluster. In task waken path, EAS core algorithm is responsible for task selection; it should achieve two targets: keep CPU to stay lowest OPP as possible and spread tasks if we can predict the OPP is possible to increased after place waken task on one specific CPU. So patch 0003/0004 are to find the CPU with lowest OPP and has highest utilization compared with other CPUs with the same OPP, so we can rely on EAS core algorithm to spread tasks if CPU OPP is increasing and pack tasks after CPU is decreased to lowest OPP. After applied patch 0003/0004, there introduced much more times energy comparison between one big CPU and one little CPU. As result, it's observed the EAS core algorithm is fragile for some corner cases. So patch 0007/0008/0009 are for more robust energy calculation, especailly 0008/0009 patches introduce an extra signal for "util_waken_avg", by using this signal we can remove waken task value for CPU's utilization, so finally all CPU signals are cleaned by removing waken task stale utilization value. Patch 0010 is a significant change for EAS core algorithm, the main idea is to change energy calculation from CPU oriented to task oriented. Based on energy modeling we can easily anwser the question is: if place the task onto one specific CPU, how much power is consumed by this task? So essentially we can calculate the task consumed energy for specific CPU, so can get to know the power consumption for every possible CPU and finally filter out which CPU is best power saving one. After changed to task oriented energy calculation, it's also more smooth to generate perf idx and energy idx based on task oriented but not CPU oriented so hope this also can benefit for schedTune PE filter as well. c) Tipping point optimization Power saving optimization mainly focus on how to defer the system tipping point so energy aware path can be enabled for most case, but deferring tipping point also means it hurts performance case if system cannot over tipping point for overloaded scenarios (like benchmarks). So the target is: optimize power without performance regression. Patch 0011 is Thara's patch v1 "Per Sched domain over utilization", the patch gives good method for how to store the per sched domain flag. I tweaked it with below criterias for overutilization: 1. If single CPU is more than 80% util, then set lowest level sched domain as 'overutilized'; so this is the tipping point for 'inner overutilized' flag. 2. If any CPU has 'misfit' task or the cluster's overall util > 80% of the cluster overall capacity, then set parent level sched doamin as 'overutilized', this is the tipping point for 'outer overutilized' flag. 3. If overall util > 50% of the all CPU overall capacity, then set root domain's 'overutilized' flag. The 50% actually is a quite high bar, e.g. if there have two clusters that means the overall util > the middle capacity for two clusters, also means the overall util has totally beyond one cluster capacity so kick 'global' tipping point and spread tasks cross two clusters. So with 'per sched domain flag', we can defer the 'global' tipping point and rely on it as a switch for energy aware path. Patch 0011 is to move energy aware function to the beginning of waken path, so this give the function energy_aware_wake_cpu() more chance to execute if system is under tipping point; only when system is over tipping point then it will go back to execute traditional waken balance to select idlest CPU. ### Testing result ### On Hikey960, below is testing enviornment: - Android AOSP kernel 4.4 https://android.googlesource.com/kernel/hikey-linaro branch: android-hikey-linaro-4.4 - CPUFreq governor: sched-freq - Fxied DDR: 400MHz - Fixed GPU: 533MHz - HDMI: unplugged - WIFI: disabled Please note, the video playback (1080p) is using software codec with VLC player on Android, camera recording is use synthesized workload camera-long.json to simulate the camera scenario. Test_Case Referenced_Phone PELT_Optimized PELT_Optimized WALT_Optimized WALT_Optimized (mW) [*] (mW) (Percentage) (mW) (Percentage) homescreen 800 -5.05 [**] -0.63% -10.46 -1.31% Audio(MP3) 200 (LCD OFF) 5.33 2.66% 60.62 30.31% Video(1080p) 1000 133.09 13.31% 26.10 2.61% Camera Recording 2000 163.94 8.20% -79.57 [***] -3.98% [*] The reference phone is not any specific phone model, here I give out some very roughly power data for well optimized commercial phones. So this are only some data based on old experience and they are not not very precise. These power data is based on power data from the battery measurement point with 4.2v. [**] Positive value: power reducing by this patch set Negative value: power increasing by this patch set [***] Camera Recording + WALT power data is much worse with this patch set; Will explian in "conclusion" section. Testing raw data: http://people.linaro.org/~leo.yan/eas_upstream/hikey960_result/ ### Conclusion ### Firstly Hikey960 is a good candidate platform to verify power saving optimization :) This patch set with PELT signal has good result on Hikey960, especially for cases video playback (saving 133.09mW) and camera recording (saving 164.94mW). For audio playback, it can save 5.33mW; for homescreen it has a bit regression (increased 5.05mW), suspect this related with task packing on LITTLE core but need investigate for this. This patch set with WALT signal has good result for audio playback and video playback, but it's broken for camera recording case. After reviewed the trace log, the main issue is many tasks' WALT signal can reach into the range 100~200, so there have many comparision between LITTLE CPU 1844MHz and big CPU 903MHz. From the power modeling parameters, the big CPU 903MHz has lower power efficiency than LITTLE CPU 1844MHz, so tasks are migrated onto big core frequently. Compared to WALT signal, PELT signal can co-work with power modeling parameters well, so we can see the energy awaring algorithm can avoid task easily migration to big cores. (seems to me, this is a question as: what's the signal can match for eas core algorithm?) Some known issues: - CPUFreq governor impacts power consumption much, sched-freq is easily to reach 1844MHz. so need check if have mechanism to optimize the policy to reduce the chance to set 1844MHz; Another testing is to use other governors: schedutil, interactive. - RT threads now are not energy awared, so they are migrated to big cores; - Load balancing flow has no energy awared optimization; - Now fixed DDR frequency, if enable DDR frequency change then power modeling will be changed significant: Need devfreq driver for DDR, and tune power modeling for this. - Though these patches have been verified on Juno there have no harm for performance, need do performance comprision on Hikey960. Leo Yan (12): sched/fair: add function find_nrg_efficient_target() sched/fair: enable energy efficiency selection sched/fair: use new function to select CPU from sched group sched/fair: select candidate CPUs by cluster basis sched/fair: refine find_new_capacity() sched/fair: optimize CPU selection with lowest OPP sched/fair: increase resolution for energy calculation sched/fair: introduce signal util_waken_avg for CPU sched/fair: select idle CPU as backup for waken up path sched/fair: task oriented energy calculation sched/fair: update idle CPU blocked load in update_sg_lb_stats() sched/fair: add trace event for sched group energy Thara Gopinath (1): Per Sched domain over utilization include/linux/sched.h | 2 +- include/trace/events/sched.h | 45 ++++ kernel/sched/fair.c | 593 +++++++++++++++++++++++++++++++------------ kernel/sched/sched.h | 1 + 4 files changed, 484 insertions(+), 157 deletions(-) mode change 100644 => 100755 include/trace/events/sched.h mode change 100644 => 100755 kernel/sched/fair.c -- 1.9.1

8 years, 10 months

1
14
0 0

EAS Android Product Codeline Release/Development Information

by Chris Redpath

Hello eas-dev, I wanted to give you all an update on EAS product codeline development for Android. As you may have noticed, the Android Common Kernel branch android-4.4 (https://android.googlesource.com/kernel/common.git/+/android-4.4) now has the EAS product codeline merged. All the patches in there have been validated on a big.LITTLE device. Some of the more experimental patches which were part of the EASr1.2 stack did not make the cut yet, but we will continue to develop them. All development for the Android Common Kernel will be done in the open so that interested people can see the code, pull the patches and participate in code reviews on the AOSP Gerrit. Development is expected to be continuous - we plan a number of enhancements and upstream backports over the coming months. The main focus for EAS development on the product codeline is to have a kernel which has good performance and efficiency on big.LITTLE devices running Android. As part of this we aim to reduce the delta to mainline as much as possible while of course maintaining any differences necessary for mobile devices. EAS is intended to be an upstream technology eventually and as we upstream various components we will be backporting them to the product codeline where suitable. We intend not only to push patches which we consider ready (for code review) but also more ‘experimental’ patches for things that we are working on, an example is here: https://android-review.googlesource.com/#/c/411501/ This is a stack consisting of 11 patches. The first 6 patches make some changes to the performance index filtering in schedtune, and have a topic of 'fix_performance_index'. On top of those are 2 patches with a topic of 'small_optimizations' which are some optimizations which can be done to existing code. Finally there are 3 patches with a topic of 'experimental_utilest' which is a backport of the current mainline-focussed util_est solution. Util_est is a filtered version of the existing PELT signals intended to address some of the generally acknowledged responsiveness issues PELT has when compared to WALT. There are multiple mainline ideas about making PELT more suitable for the uses we have, util_est is not the only one and may not be attractive to maintainers but it is included here to help evaluating the signal characteristics. Where we have multiple patches which make sense to review together, we will send them as a single set and set the topics as appropriate. The review patch stacks will be structured as less controversial changes at the start, and more experimental/risky patches at the end - it's perfectly possible that some will be merged and not others. As we push patches for review, we intend to post here announcing them. Since we will now be doing continuous development in the open, the EAS releases (EASr1.3 etc.) become documentation points where we describe the features which are merged and are expected to be merged soon. The development model for android is to have frequent merges of the common kernel, and we feel this fits. For EAS product codeline testing, we are starting to use Hikey960 as it has big.LITTLE (4xA73+4xA53). Note that we find it essential to add a heatsink and a fan, so firmware thermal capping doesn’t affect EAS performance. We’ve been using a 14mm x 14mm heatsink on the SoC: http://uk.farnell.com/abl-heatsinks/bga-std-015/heat-sink-bga-standard-26-5… with a 5v fan blowing on it. We are using Baylibre ACME for power measurement, which is supported in LISA (https://github.com/ARM-software/lisa) which we're using for development testing and trace analysis. The current (21-June-2017) Hikey960 status is that cpufreq/cpuidle is working using the ARM Trusted Firmware. There is an open issue relating to OPP switch time on the little cluster which is being investigated. We are also engaged with Leo in validating the energy model. For Android Common Kernel 4.9 (https://android.googlesource.com/kernel/common.git/+/android-4.9) we have a first pass of a patch stack bringing it up to parity with android-4.4 which is being tested on Juno and original hikey. This will be posted for review against the android-4.9 branch and is a change to the existing EAS code that’s there. The tests are passing on Juno, but please note there is significantly less testing than for kernel 4.4 We very much welcome broad participation in the future direction of EAS for products - to participate in EAS product codeline development, please ensure you are registered for a googlesource Gerrit userid which you can get at https://android.googlesource.com/new-password Thanks, Chris IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

8 years, 10 months

1
0
0 0

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

eas-dev