eas-dev

eas-dev@lists.linaro.org

423 discussions

[PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

by Rohit Jain

Currently fast path in the scheduler looks for an idle CPU to schedule threads on. Capacity is taken into account in the function 'select_task_rq_fair' when it calls 'wake_cap', however it ignores the instantaneous capacity and looks at the original capacity. Furthermore select_idle_sibling path of the code, ignores the RT/IRQ threads which are also running on the CPUs it is looking to schedule fair threads on. We don't necessarily have to force the code to go to slow path (by modifying wake_cap), instead we could do a better selection of the CPU in the current domain itself (i.e. in the fast path). This patch makes the fast path aware of capacity, resulting in overall performance improvements as shown in the test results. 1) KVM Test: ----------------------------------------------------------------------- In this test KVM is configured with a ubuntu VM (unchanged kernel, used Ubuntu server 16.04) which is running ping workload in a taskset along with hackbench in a separate taskset. The VM is a virtio VM (which means the host is taking and processing the interrupts). In this case, we want to avoid scheduling vcpus on CPUs which are very busy processing interrupts if there is a better choice available. This machine is a 2 socket 40 CPU 20 core Intel x86 machine. lscpu output: CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 NUMA node0 CPUs: 0-9,20-29 NUMA node1 CPUs: 10-19,30-39 The setup is realistic enough to mirror realistic use cases. KVM is bound to a full NUMA node with CPUs bound to whole cores, i.e. CPU 0 and 1 are bound to core 0, CPU 2 and 3 to core 1 and so on. virsh vcpupin output: VCPU: CPU Affinity ---------------------------------- 0: 0,20 1: 0,20 2: 1,21 3: 1,21 4: 2,22 5: 2,22 6: 3,23 7: 3,23 8: 4,24 9: 4,24 10: 5,25 11: 5,25 12: 6,26 13: 6,26 14: 7,27 15: 7,27 16: 8,28 17: 8,28 18: 9,29 19: 9,29 Here are the results seen with ping and hackbench running inside the KVM. Note: hackbench was run for 10000 loops (lower is better) (Improvement is show in brackets +ve is good, -ve is bad) +-------------------+-----------------+---------------------------+ | | Without patch | With patch | +---+-------+-------+-------+---------+-----------------+---------+ |FD | Groups| Tasks | Mean | Std Dev | Mean | Std Dev | +---+-------+-------+-------+---------+-----------------+---------+ |4 | 1 | 4 | 0.059 | 0.021 | 0.034 (+42.37%) | 0.008 | |4 | 2 | 8 | 0.087 | 0.031 | 0.075 (+13.79%) | 0.021 | |4 | 4 | 16 | 0.124 | 0.022 | 0.089 (+28.23%) | 0.013 | |4 | 8 | 32 | 0.149 | 0.031 | 0.126 (+15.43%) | 0.022 | +---+-------+-------+-------+---------+-----------------+---------+ |8 | 1 | 8 | 0.212 | 0.025 | 0.211 (+0.47%) | 0.023 | |8 | 2 | 16 | 0.246 | 0.055 | 0.225 (+8.54%) | 0.024 | |8 | 4 | 32 | 0.298 | 0.047 | 0.294 (+1.34%) | 0.022 | |8 | 8 | 64 | 0.407 | 0.03 | 0.378 (+7.13%) | 0.032 | +---+-------+-------+-------+---------+-----------------+---------+ |40 | 1 | 40 | 1.703 | 0.133 | 1.451 (+14.80%) | 0.072 | |40 | 2 | 80 | 2.912 | 0.204 | 2.431 (+16.52%) | 0.075 | +---+-------+-------+-------+---------+-----------------+---------+ 2) ping + hackbench test on x86 machine: ----------------------------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: runtime is in seconds (Lower is better) +-----------------------------+----------------+---------------------------+ | | Without patch | With patch | +----------+----+------+------+-------+--------+----------------+----------+ |Loops | FD |Groups|Tasks | Mean | Std Dev|Mean | Std Dev | +----------+----+------+------+-------+--------+----------------+----------+ |1,000,000 | 4 |1 |4 | 2.375 | 0.818 |1.785 (+24.84%) | 0.21 | |1,000,000 | 4 |2 |8 | 2.748 | 0.694 |2.102 (+23.51%) | 0.239 | |1,000,000 | 4 |4 |16 | 3.237 | 0.256 |2.922 (+9.73%) | 0.169 | |1,000,000 | 4 |8 |32 | 3.786 | 0.185 |3.637 (+3.94%) | 0.471 | +----------+----+------+------+-------+--------+----------------+----------+ |1,000,000 | 8 |1 |8 | 7.287 | 1.03 |5.364 (+26.39%) | 1.085 | |1,000,000 | 8 |2 |16 | 7.963 | 0.951 |6.474 (+18.70%) | 0.397 | |1,000,000 | 8 |4 |32 | 8.991 | 0.618 |8.32 (+7.46%) | 0.69 | |1,000,000 | 8 |8 |64 | 13.868| 1.195 |12.881 (+7.12%) | 0.722 | +----------+----+------+------+-------+--------+----------------+----------+ |10,000 | 40 |1 |40 | 0.828 | 0.032 |0.784 (+5.31%) | 0.010 | |10,000 | 40 |2 |80 | 1.087 | 0.246 |0.980 (+9.84%) | 0.037 | |10,000 | 40 |4 |160 | 1.611 | 0.055 |1.591 (+1.24%) | 0.039 | |10,000 | 40 |8 |320 | 2.827 | 0.031 |2.817 (+0.35%) | 0.025 | |10,000 | 40 |16 |640 | 5.107 | 0.085 |5.087 (+0.39%) | 0.045 | |10,000 | 40 |25 |1000 | 7.503 | 0.143 |7.468 (+0.46%) | 0.045 | +----------+----+------+------+-------+--------+----------------+----------+ Sanity tests: ----------------------------------------------------------------------- schbench results on 2 socket, 44 core and 88 threads Intel x86 machine, with 2 message threads (lower is better): +----------+-------------+----------+------------------+ |Threads | Latency | Without | With Patch | | | percentiles | Patch | | | | | (usec) | (usec) | +----------+-------------+----------+------------------+ |16 | 50.0000th | 60 | 62 (-3.33%) | |16 | 75.0000th | 72 | 68 (+5.56%) | |16 | 90.0000th | 81 | 80 (+2.46%) | |16 | 95.0000th | 88 | 83 (+5.68%) | |16 | *99.0000th | 100 | 92 (+8.00%) | |16 | 99.5000th | 105 | 97 (+7.62%) | |16 | 99.9000th | 110 | 100 (+9.09%) | +----------+-------------+----------+------------------+ |32 | 50.0000th | 62 | 62 (0%) | |32 | 75.0000th | 80 | 81 (0%) | |32 | 90.0000th | 93 | 94 (-1.07%) | |32 | 95.0000th | 103 | 105 (-1.94%) | |32 | *99.0000th | 121 | 121 (0%) | |32 | 99.5000th | 127 | 125 (+1.57%) | |32 | 99.9000th | 143 | 135 (+5.59%) | +----------+-------------+----------+------------------+ |44 | 50.0000th | 79 | 79 (0%) | |44 | 75.0000th | 104 | 104 (0%) | |44 | 90.0000th | 126 | 125 (+0.79%) | |44 | 95.0000th | 138 | 137 (+0.72%) | |44 | *99.0000th | 163 | 163 (0%) | |44 | 99.5000th | 174 | 171 (+1.72%) | |44 | 99.9000th | 10832 | 11248 (-3.84%) | +----------+-------------+----------+------------------+ I also ran uperf and sysbench MySQL workloads but I see no statistically significant change. Signed-off-by: Rohit Jain <rohit.k.jain(a)oracle.com> --- kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++---------- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2fe3aa8..371c23c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu) return cpu_rq(cpu)->cpu_capacity_orig; } +static inline bool full_capacity(int cpu) +{ + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int for_each_cpu(cpu, cpu_smt_mask(core)) { cpumask_clear_cpu(cpu, cpus); - if (!idle_cpu(cpu)) + if (!idle_cpu(cpu) || !full_capacity(cpu)) idle = false; } @@ -6102,7 +6107,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { - int cpu; + int cpu, rcpu = -1; + unsigned long max_cap = 0; if (!static_branch_likely(&sched_smt_present)) return -1; @@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu(cpu, cpu_smt_mask(target)) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) - return cpu; + if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) { + max_cap = capacity_of(cpu); + rcpu = cpu; + } } - return -1; + return rcpu; } #else /* CONFIG_SCHED_SMT */ @@ -6143,6 +6151,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 time, cost; s64 delta; int cpu, nr = INT_MAX; + int best_cpu = -1; + unsigned int best_cap = 0; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6173,8 +6183,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) - break; + if (idle_cpu(cpu)) { + if (full_capacity(cpu)) { + best_cpu = cpu; + break; + } else if (capacity_of(cpu) > best_cap) { + best_cap = capacity_of(cpu); + best_cpu = cpu; + } + } } time = local_clock() - time; @@ -6182,7 +6199,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t delta = (s64)(time - cost) / 8; this_sd->avg_scan_cost += delta; - return cpu; + return best_cpu; } /* @@ -6193,13 +6210,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) struct sched_domain *sd; int i; - if (idle_cpu(target)) + if (idle_cpu(target) && full_capacity(target)) return target; /* * If the previous cpu is cache affine and idle, don't be stupid. */ - if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)) + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev) && + full_capacity(prev)) return prev; sd = rcu_dereference(per_cpu(sd_llc, target)); -- 2.7.4

8 years, 4 months

Re: [Eas-dev] [ARM-software/lisa] What is the best way to evaluate EAS versus non-EAS on Linux? (#537)

by Patrick Bellasi

[ +eas_dev ] Hi Leonard, first of all I would like to inform you that, for EAS specific question, you should better post your requests on the eas-dev ML: https://lists.linaro.org/mailman/listinfo/eas-dev This is where you can reach most of the people working on EAS. Hereafter are some comments related to your question. On 05-Jan 06:41, Leonard Crestez wrote: > After porting the EAS patches and I'd like to do a top-level > comparison between EAS and non-EAS. Looking through lisa notebooks I > didn't find anything that obviously fits; many of the tests refuse > to even run without EAS. Most of the tests we have on LISA_HOME/tests/eas are actually to test that EAS is working as expected. > If there is no good top-level synthetic evaluation for linux maybe I > could use one for Android? > > I'm looking for something that can say something like "EAS consumes > x% less power". We actually have a workflow to run a complete set of Android workloads on a custom target and compare power/performance results corresponding to different kernels. That suite is named wltest and it's part of LISA. Here are the instructions to run wltest: https://github.com/ARM-software/lisa/tree/master/tools/wltests lemme know if you have any question/doubts about running that suite of tests. Do please consider that the Google kernel team has a great interest on checking wltests results to assess proposed scheduler changes for the AOSP common kernel. Here is an example of the report generated by wltest when comparing WALT vs PELT kernels: https://gist.github.com/derkling/3a8c3568676a29e608d6dcb15af06241 As a final remark, please do notice that wltest currently supports out-of-the-box only hikey960 boards with an ACME energy meter. However, it's relatively easy to integrate the support for different targets and energy meters. Unfortunately we do not have documentation available, but... everything needed should be just what you find under one of platform folders: https://github.com/ARM-software/lisa/tree/master/tools/wltests/platforms You can just copy the content of: https://github.com/ARM-software/lisa/tree/master/tools/wltests/platforms/hi… and modify the contained files to match your specific target. Internally we have an integration for Google Pixel 2 devices... but still did not find time to push/merge it. Lemme know if you are interested ;-) Cheers Patrick -- #include <best/regards.h> Patrick Bellasi

8 years, 4 months

[Integration Branch] Update 05-Jan-2018

by Michele DiGiorgio

Hi, a new EAS integration branch (tag: 20180105_1000) is available on: http://linux-arm.org/git?p=linux-power.git News: Energy model data is now stored in dt rather than in arch/*/topology.c. Support for getting the energy model from dt into the scheduler has been added. This is to align energy model presentation with what is done in android.googlesource.com/kernel/common experimental/android-4.14. For further information about main features, test coverage and work items for next integration please have a look at: https://developer.arm.com/open-source/energy-aware-scheduling/eas-mainline-… Best Regards, Michele

8 years, 4 months

[Integration Branch] Update 22-Dec-2017

by Dietmar Eggemann

Hi, a new EAS integration branch (tag: 20171222_1000) is available on: http://linux-arm.org/git?p=linux-power.git News: Frequency and CPU Invariance (FIE/CIE) now in base Bugfixes from android.googlesource.com/kernel/common experimental/android-4.14 (1) 'sched: Per-Sched-domain over utilization' now compiles w/ !CONFIG_SMP (2) 'drivers base/arch_topology: Detect SD_SHARE_CAP_STATES flag' now compiles w/ !CONFIG_CPU_FREQ For further information about main features, test coverage and work items for next integration please have a look at: https://developer.arm.com/open-source/energy-aware-scheduling/eas-mainline-… Best Regards, -- Dietmar

8 years, 4 months

[PATCH RESEND] cpufreq: schedutil: Use idle_calls counter of the remote CPU

by Joel Fernandes

Since the recent remote cpufreq callback work, its possible that a cpufreq update is triggered from a remote CPU. For single policies however, the current code uses the local CPU when trying to determine if the remote sg_cpu entered idle or is busy. This is incorrect. To remedy this, compare with the nohz tick idle_calls counter of the remote CPU. Acked-by: Viresh Kumar <viresh.kumar(a)linaro.org> Signed-off-by: Joel Fernandes <joelaf(a)google.com> --- Just resending this which is cpufreq-related as requested by Rafael rebased on linus/master. The other 2 patches in my last series which can go in independent of this one are: https://patchwork.kernel.org/patch/10115395/ https://patchwork.kernel.org/patch/10115401/ I'm still waiting on scheduler maintainers to comment on those. Unfortunately, I haven't heard back anything yet since the last repost of those. include/linux/tick.h | 1 + kernel/sched/cpufreq_schedutil.c | 2 +- kernel/time/tick-sched.c | 13 +++++++++++++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index f442d1a42025..7cc35921218e 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -119,6 +119,7 @@ extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); extern unsigned long tick_nohz_get_idle_calls(void); +extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); #else /* !CONFIG_NO_HZ_COMMON */ diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 2f52ec0f1539..d6717a3331a1 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -244,7 +244,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, unsigned long *util, #ifdef CONFIG_NO_HZ_COMMON static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { - unsigned long idle_calls = tick_nohz_get_idle_calls(); + unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu); bool ret = idle_calls == sg_cpu->saved_idle_calls; sg_cpu->saved_idle_calls = idle_calls; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 99578f06c8d4..77555faf6fbc 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -985,6 +985,19 @@ ktime_t tick_nohz_get_sleep_length(void) return ts->sleep_length; } +/** + * tick_nohz_get_idle_calls_cpu - return the current idle calls counter value + * for a particular CPU. + * + * Called from the schedutil frequency scaling governor in scheduler context. + */ +unsigned long tick_nohz_get_idle_calls_cpu(int cpu) +{ + struct tick_sched *ts = tick_get_tick_sched(cpu); + + return ts->idle_calls; +} + /** * tick_nohz_get_idle_calls - return the current idle calls counter value * -- 2.15.1.504.g5279b80103-goog

8 years, 5 months

[PATCH v2] sched/fair: Consider RT/IRQ pressure in capacity_spare_wake

by Joel Fernandes

capacity_spare_wake in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up. This is fixed in this patch by using capacity_of instead of capacity_orig_of in capacity_spare_wake. Only change since v1 is change in commit message. Tests results from improvements with this change are below. More tests were also done by myself and Matt Fleming to ensure no degradation in different benchmarks. 1) Rohit ran barrier.c test (details below) with following improvements: ------------------------------------------------------------------------ This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees. barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ |Threads | Without patch | With patch | | | | | +--------+--------+---------+-----------------+---------+ | | Mean | Std Dev | Mean | Std Dev | +--------+--------+---------+-----------------+---------+ |1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 | |2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 | |4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 | |8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 | |16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 | |32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 | |64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 | +--------+--------+---------+-----------------+---------+ 2) ping+hackbench test on bare-metal sever (by Rohit) ----------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better). +--------------+-----------------+--------------------------+ |Task Groups | Without patch | With patch | | +-------+---------+----------------+---------+ |(Groups of 40)| Mean | Std Dev | Mean | Std Dev | +--------------+-------+---------+----------------+---------+ |1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 | |2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 | |4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 | |8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 | |16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 | |25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 | +--------------+-------+---------+----------------+---------+ [1] https://patchwork.kernel.org/patch/9991635/ Matt Fleming also ran several different hackbench tests and cyclic test to santiy-check that the patch doesn't harm other usecases. Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Matt Fleming <matt(a)codeblueprint.co.uk> Tested-by: Rohit Jain <rohit.k.jain(a)oracle.com> Signed-off-by: Joel Fernandes <joelaf(a)google.com> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0989676c50e9..832f2ea069ef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5726,7 +5726,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p); static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) { - return capacity_orig_of(cpu) - cpu_util_wake(cpu, p); + return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0); } /* -- 2.15.1.504.g5279b80103-goog

8 years, 5 months

[PATCH] sched/fair: Consider RT/IRQ pressure in capacity_spare_wake

by Joel Fernandes

capacity_spare_wake in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up. Several tests with results are included below to show improvements with this change. 1) Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core) ------------------------------------------------------------ Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms. Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+ 2) Rohit ran barrier.c test (details below) with following improvements: ------------------------------------------------------------------------ This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees. barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ |Threads | Without patch | With patch | | | | | +--------+--------+---------+-----------------+---------+ | | Mean | Std Dev | Mean | Std Dev | +--------+--------+---------+-----------------+---------+ |1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 | |2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 | |4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 | |8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 | |16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 | |32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 | |64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 | +--------+--------+---------+-----------------+---------+ 3) ping+hackbench test on bare-metal sever (Rohit ran this test) ---------------------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better). +--------------+-----------------+--------------------------+ |Task Groups | Without patch | With patch | | +-------+---------+----------------+---------+ |(Groups of 40)| Mean | Std Dev | Mean | Std Dev | +--------------+-------+---------+----------------+---------+ |1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 | |2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 | |4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 | |8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 | |16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 | |25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 | +--------------+-------+---------+----------------+---------+ [1] https://patchwork.kernel.org/patch/9991635/ Matt Fleming also ran cyclictest and several different hackbench tests on his test machines to santiy-check that the patch doesn't harm any of his usecases. Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Cc: Vincent Guittot <vincent.guittot(a)linaro.org> Cc: Morten Ramussen <morten.rasmussen(a)arm.com> Cc: Brendan Jackman <brendan.jackman(a)arm.com> Tested-by: Rohit Jain <rohit.k.jain(a)oracle.com> Tested-by: Matt Fleming <matt(a)codeblueprint.co.uk> Signed-off-by: Joel Fernandes <joelaf(a)google.com> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b8e749..ba9609407cb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p); static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) { - return capacity_orig_of(cpu) - cpu_util_wake(cpu, p); + return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0); } /* -- 2.15.0.448.gf294e3d99a-goog

8 years, 5 months

[util_est] Backport to Hikey960 v4.9 kernel

by Patrick Bellasi

Hi guys, I've just pushed here: git://linux-arm.org/linux-pb.git eas/v1.5/util_est/hikey960 a backport of the util_est patches [1] recently posted on LKML. Apart from the util_est specific patches, this branch is based on top of some patches (suggested by Linaro) to improve power/performance testing on Hikey960. Moreover, at the top there are some additional patches to test with different PELT half-life values. In attachment you can also find a "series file" which can be used with LISA's wltest. Unfortunately, so fare there have not been much review feedbacks on LKML. Would be nice if someone interested can give it a go and report on the list. Cheers Patrick [1] https://lkml.org/lkml/2017/12/5/634 -- #include <best/regards.h> Patrick Bellasi

8 years, 5 months

Energy Model Question regarding the Pixel 2

by Zachariah Kennedy

Good day! I have noticed since release that EM for the Pixel 2 doesnt cover each frequency step. 22 steps for small cores, 31 steps for big cores. There are 22 tuples for the small cores but only 27 tuples for big cores. I have checked and the Pixel 2 is using all frequency steps for both small and big cores, so why doesnt the EM account for the last 4 freq steps for big cores? Thanks as always for taking the time to answer my questions. Kind Regards, Zachariah Kennedy

8 years, 5 months

[Integration Branch] Update 24-Nov-2017

by Michele DiGiorgio

Hello EAS developers, This email is to inform you about the latest EAS integration branch that was published last Friday. All the information on where to get the branch from are available at: https://developer.arm.com/open-source/energy-aware-scheduling/EAS%20Mainlin… The integration branch was conceived to keep the latest EAS patches on track with tip/sched/core. Hence, on top of that the integration branch puts: - some new scheduler features, i.e. patches that relate to scheduler but are not main components of EAS - EAS-core patches - debug patches, i.e. trace events, procfs interfaces, etc. Integration will happen every two weeks. The above website covers the main additions to each integration and the next work items for the ones that will follow. Kind regards, Michele

8 years, 5 months

← Newer
1
...
24
25
26
27
28
29
30
...
43
Older →

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

eas-dev