eas-dev November 2015

eas-dev@lists.linaro.org

14 participants
12 discussions

[PATCH 0/4] EASv5: Fix rt-app-31/38/44 Inconsistent Issue

by Leo Yan

This patch series is a following up for EASv5 power profiling on Hikey. >From profiling result, rt-app-31/38/44 are inconsistent; Finally found this issue can be fixed by these 4 patches. After applied these patch, we can get good improvement for these cases (mW): Energy BestComb Mainline(ndm) noEAS(ndm) EAS(ndm) EAS(sched) EAS(Applied Patches) mp3 412 604.41 551.79 528.99 530.20 491.10 rt-app-6 676 864.18 846.72 792.88 840.33 759.96 rt-app-13 968 1222.47 1210.35 1673.04 1332.13 1253.99 rt-app-19 1348 1412.08 1474.86 1612.12 1421.28 1355.49 rt-app-25 1619 1718.67 1710.73 2104.41 2028.25 1584.25 rt-app-31 1878 1968.08 1965.87 2318.11 2976.59 1903.69 rt-app-38 2283 2580.23 2540.45 2576.46 2724.32 2241.29 rt-app-44 2578 3092.66 3056.92 2913.91 2669.91 2406.45 rt-app-50 2848 3492.36 3423.26 3489.14 3429.41 3290.25 This patch series is ONLY for EXPERIMENTAL purpose. Leo Yan (4): sched/fair: EASv5: Fix CPU shared capacity issue sched/fair: EASv5: snapshot CPU's utilization sched/fair: EASv5: Add CPU's total utilization sched/fair: EASv5: update new capacity index kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++++++++++--------- kernel/sched/sched.h | 1 + 2 files changed, 74 insertions(+), 15 deletions(-) -- 1.9.1

10 years, 4 months

[Question] EAS: Spread Tasks With Lower OPP

by Leo Yan

Hi all, At connect, Steve also brought out related questions: pack tasks with higher OPP or spread tasks with lower OPP? so I'd like to summary this and combind with recently profiling result: - When task_A is waken up, then scheduler need decide it should pack task_A onto a busy CPU, or scheduler need spread task_A to a idle CPU. If pack task_A onto one busy CPU, this will introduce possible power penalty caused by higher OPP; on the other hand, if spread task_A to a idle CPU (The idle CPU's cluster may also stay in idle state), then this will introduce power penalty caused by extra power domain. So I think we can enhance energy calculation algorithm when wake up the task in function energy_aware_wake_cpu(). For example, we can select two candidate CPUs for waken up task, one possible CPU is in the same schedule group with the task's original CPU, and another possible CPU is in another schedule group (this schedule group should have best or equal power efficiency in system). Then finally we can get to know if need spread tasks to different cluster, or need spread task to different CPU but in the same cluster, or just stay on original CPU. - I also observed here have another possible scenario. For example, if tasks have been already packed to several CPUs, and though every task's workload is not quite high (such like rt-app-13) but they accumulate load on one CPU, so finally CPU will run at high OPP. So if EAS pick up only one of these tasks and try to migrate the task to another CPU, usually will not migrate to that CPU. The reason is even target CPU have run at high OPP, but usually it still have capacity to run more workloads with highest OPP; so energy_diff() also will get worse power result after increase OPP, and task will stay on original CPU. [1][2] Even if pick one idle CPU from another cluster, still cannot resolve this issue. Because if spread task to another cluster, the original cluster and CPU's OPP will not decrease but introduce extra power by the new cluster and CPU. So in this case, should consider as a global view and define some criteria: * CPUs don't stay on lowest OPP, but system have idle CPUs; * CPU's lower OPP can meet capacity requirement for all task's average load; * CPU's lower OPP can meet capacity requirement for the highest load task in system. If meet these criteria, EAS can select idle CPU from schedule group with best power efficiency. I think you guys may have discussed this topic yet, so before I start to try with these ideas, want to check if I missed some discussion and welcom any suggestion. [1] http://people.linaro.org/~leo.yan/eas_profiling/eas_tasks_in_one_cluster_hi… [2] http://people.linaro.org/~leo.yan/eas_profiling/eas_tasks_in_one_cluster_en… Thanks, Leo Yan

10 years, 4 months

Regarding EAS usefulness for SMP system

by Nitish Ambastha

Dear All I am going through the EAS project work and trying to port them on my ARM based SMP system (3.10 Linux version) Could you please help me clarify, will EAS be helpful in terms of power/performance for SMP systems as well? Thanks & Regards Nitish Ambastha

10 years, 5 months

decaying of idle CPU utilization

by Steve Muckle

To summarize the current problem with idle CPU capacity votes: - When the last task on a CPU (say CPU X) sleeps and the CPU goes idle, we currently drop its capacity vote to zero. We do not immediately update the cluster frequency based on this information however. - It depends on when other CPUs in the frequency domain have an event which forces re-evaluation of the capacity votes and corresponding frequency. It could occur right away, lowering the frequency only to require raising it again immediately if CPU X is idle a very short time. Or it could be a very long time before such an event occurs which will leave the cluster at an unnecessarily high OPP and waste energy. I have a draft of a change which modifies the nohz idle balance path a bit to ensure that update_blocked_averages() is called for tickless idle CPUs at least every X ms. This alone won't solve the above problems though. You need to force re-evaluation of the capacity votes somewhere to update the cluster frequency. I was originally going to call into cpufreq_sched as idle CPU loads are decayed to update the frequency there but folks didn't seem to like this during Thursday's call. We could get rid of the clearing of the capacity vote when entering idle and use a passive update when decaying idle CPU utilizations (setting the capacity vote but not triggering a re-evaluation of cluster frequency). That would solve the problem of risking the cluster frequency dropping to fmin during a very short idle and having to be immediately ramped up again. It will not solve the issue of the cluster potentially getting stuck idle at fmax/high frequency for long periods of time and wasting energy though. There's been some discussion on this issue in the context of integration of cpuidle with cpufreq and the scheduler (see attached). Rather than force regular load decay updates via the load balancer and figure out when to force frequency re-evaluation I'm inclined to just remove the clearing of the capacity vote in dequeue_task_fair when going idle and tackle this problem within cpuidle as part of an energy aware/platform aware decision (see #2 in the attachment). A possible policy in cpuidle might look like: - If it's a short idle, don't bother removing capacity vote. - If it's a long idle and the system doesn't burn extra power in idle at elevated frequency, passively remove the capacity vote. Frequency gets adjusted if another CPU has a freq-evaluating event, like today. - If it's a long idle and the system burns extra power in idle, actively remove the capacity vote, immediately adjusting frequency if needed. A slack timer mechanism may still be desirable in cpuidle to guard against the prediction being wrong (you think it's a short idle and leave a high capacity vote in, but it ends up being a long idle). Thanks if you've read this far! Also, I hope to migrate these discussions to lkml+linux-pm. Perhaps after the next sched-freq RFC posting which will surely spawn discussions there anyway and get everyone up to speed on our current status and issues, making it a good cutover point.

10 years, 5 months

enqueue_task_fair and task_tick_fair race when setting opp

by Vikram Mulukutla

Hi Juri, Steve, It looks like if cpuX sets its own OPP level in task_tick_fair (to capacity_orig), another cpuY can override this to any value (at least via enqueue_task_fair) before cpuX's request can take effect (i.e. before the throttling timestamp is updated via the kcpufreq thread). The request from cpuX at the next tick may be throttled or the task may go to sleep and its load is decayed enough that the next request after wakeup no longer crosses the threshold and hence we lose the opportunity to go to FMAX. It seems like we need to have a mechanism where a current higher request from cpu for its own capacity should override any other cpu's lower request? Thanks, Vikram

10 years, 5 months

cpufreq_sched: reset capacity in pick_next_task_idle?

by Vikram Mulukutla

Hello, I'm using the EASv5 3.18 tree with cpufreq_sched. With the sched governor enabled I've noticed that after a migration or after a switch from a non-fair task to the idle task, the source CPU goes idle and its (possibly max) capacity request stays in place, preventing other requests from going through until that source CPU decides to wake up and take up some work. I know that there are some ongoing discussions about how to actually enforce a frequency reduction when a CPU enters idle to save power, but this seems to be a more immediate problem since the other CPU(s)' requests are also basically ignored. How about a reset_capacity call in pick_next_task_idle? Throttling is a concern I suppose, but I think the check in dequeue_task_fair is doing the same thing already, so the following would just repeat for non_fair_class->idle_task. diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index c65dac8..555c21d 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -28,6 +28,8 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev) { put_prev_task(rq, prev); + cpufreq_sched_reset_cap(cpu_of(rq)); + schedstat_inc(rq, sched_goidle); return rq->idle; } Thanks, Vikram

10 years, 5 months

delaying update of setting OPP

by Steve Muckle

In cpufreq_sched_set_cap we currently have this: /* * We only change frequency if this cpu's capacity request represents a * new max. If another cpu has requested a capacity greater than the * previous max then we rely on that cpu to hit this code path and make * the change. IOW, the cpu with the new max capacity is responsible * for setting the new capacity/frequency. * * If this cpu is not the new maximum then bail */ if (capacity_max > capacity) goto out; But this can lead to situations like (2 CPU cluster, CPUs start with cap request of 0): 1. CPU0 gets heavily loaded, requests cap = 1024 (fmax) 2. CPU1 gets lightly loaded, requests cap = 10 3. CPU0's load goes away, requests cap = 0 4. CPU1's load of 10 persists for a long time In step #3 we could've set the cluster capacity to 10/1024 but did not because the CPU we were working with at the time (CPU0) was not the CPU driving the new cluster maximum capacity request. As a result we run unnecessarily at fmax for a long time. Any reason to not set the OPP associated with the new max capacity request immediately, regardless of what CPU is driving it? thanks, Steve

10 years, 5 months

[eas-dev] CPU invariant usage tracking question

by Vikram Mulukutla

Hi Dietmar, Juri, I'm evaluating EAS RFCv5 on Qualcomm SoCs. I have a question about the CPU invariant utilization tracking in __update_entity_load_avg. I have Juri's arch-support patches for arm64 such that arch_scale_cpu_capacity(cpu) - returns the maximum capacity of the CPU from the energy model Consider a dual-cluster system that has equally capable CPUs in both clusters with each cluster on its own clock source, but with different max OPP levels for each cluster. If both the clusters are at the same OPP level, the lower-max-opp cluster would accumulate utilization_avg at a slower rate. This doesn't seem right; at the same frequency, a task should exhibit equal performance on either cluster since the CPUs are otherwise equally capable. On such a system scale_cpu should be 1024 for both clusters at least in the context of utilization tracking. What do you think? Also, please confirm my understanding with respect to traditional Big.Little - where utilization will accumulate slower on a little CPU due to the CPU invariance factor. I can understand the frequency scaling factor - a task consumes more absolute CPU cycles at a higher frequency; now it would seem that given the same frequency scaling invariance factor (say little CPU's is 500MHz/1000MHz and big CPU is 1000MHz/2000MHz), we still want to have the little CPU accumulate utilization slower because the amount of work done (in IPC terms perhaps) is less on the little CPU? Thanks, Vikram -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation.

10 years, 5 months

cpufreq/cpuidle integration

by Steve Muckle

Hi Daniel, I've reviewed the draft doc you sent on cpuidle/cpufreq integration. Things have progressed a bit since the initial round of comments in the doc in April, also I thought it'd be good to open the discussion on eas-dev, so I figured I'd try and summarize the main points of the doc here and comment on them. Apologies if I mis-state anything, please correct me if necessary. Some high level thoughts: - I have yet to see a platform where race to idle is a win for power. Because of the exponential shape of the power/perf curves and other issues such as random wakeups interrupting deep sleep, it's always been better in my experience to run at as low an OPP as possible within the performance requirements of the workload. As a general policy at least. - The validation tests mentioned in the doc seemed to be focused uniformly on performance but I think power measurements must be given equal consideration and mention in validating any of these changes. My comments on each of the individual proposals: 1. Managing frequency during idle when blocked load is high. The proposal in this section was to keep the frequency unchanged when entering idle and there is significant blocked load. Blocked load has since been included in the utilization metric which determines frequency. But the policy in sched-freq on what to do when the last task on a runqueue blocks (i.e. the CPU goes idle) is still evolving. Currently the CPU's CFS capacity vote in the frequency domain is passively dropped when that CPU goes idle. It zeros out its capacity vote but does not trigger a recalculation of the frequency domain's new overall required capacity or set a new frequency. This means that the frequency will remain as-is until a different event occurs which forces re-evaluation of the CPU capacity votes in the frequency domain. I won't go into enumerating those events here, suffice it to say I don't think the current policy will work. It's possible for the CPU to stay at an elevated frequency for far too long which would have an unacceptable power impact. I'm also concerned about the way blocked load is included in the utilization metric and potentially keying off that for frequency during periods of idle. It's certainly a more power-hungry policy than what is in place today, plus there's really no way to tune it since it's part of the per-entity load tracking scheme. The interactive governor had a tunable (the slack timer) which controlled how long you could sit idle at a frequency greater than fmin. Schedtune aims to provide a per-task tunable which can boost/scale up the load value for that task calculated by PELT. So that would provide some mechanism to tweak this although it affects the task's contribution at all times rather than only when it is blocked. It also currently only is built to inflate/scale up a task's demand rather than decrease it. 2. Make the idle task in charge of changing the frequency to minimum Proposal is to have idle main loop set frequency to minimum, do idle, and then restore frequency coming out of idle. My thinking is that when we enter idle, somewhere there has to be a decision made as to whether it is worth it to change frequency. The input for that decision could include - the current frequency - the energy data of the target (power consumption at each frequency at the C-state we will be entering) - the expected duration of the idle period - the latency of changing between the two frequencies in question, as well as the latency of C-state entry and exit - performance and latency requirements for the currently blocked tasks The idle task seems to me like a reasonable place for the decision logic, which could then call some not-yet-existent API into sched-freq to ask for the frequency change. Sched class capacity votes would be retained and reinstated on idle exit. The temporary idle frequency would respect any min or max freq constraints that had been previously registered in cpufreq. 3. The expected sleep duration is less than the frequency transition latency Agreed this needs to be considered, a full implementation of the logic I mentioned in #2 would cover this one. 4. Align frequency with task priority I'd agree with MikeT's comment in the doc that changing frequency according to the niceness of the running tasks would be a pretty big change in semantics that we should stay away from. Schedtune may offer a way to negatively bias the performance of a task, although currently it only can be used to inflate a task's performance demands. 5. Consider CPU frequency when evaluating wake-up latency Agreed this should be taken into account. Again would be part of the logic I mentioned in #2 I think, deciding whether we can even change frequencies during idle at all (possibly ruled out due to QoS constraints), and then if it is possible, whether it is worth it from an energy standpoint. 6. Multiple freq governors as input to cpufreq core I'd agree this seems like it'd have value but barring a major shift in priorities, it's going to be a while before this would get focus given the effort required just to get the basic sched-freq feature merged. 7. Increase freq of little cluster when freq of big cluster increases, to make migrations back to little faster. This wouldn't be for everyone IMO due to the power impact. I'm not sure where exactly this policy would go, especially given the current crusade in the community against plugin governors and tunables. Perhaps in the platform-level cpufreq driver. thanks, Steve

10 years, 5 months

alternative name for sched-dvfs

by Steve Muckle

It was pointed out in today's technical syncup that sched-dvfs should rely on its own static key, and that static key should not indicate a dependency on EAS. The static key was named sched_energy_freq. I've renamed/shortened it to sched_freq. I also think we should try and use this name for the feature in general since I believe less people are familiar with the term/acronym DVFS. A git grep -i dvfs suggests it's almost exclusively an ARM term, at least as far as the kernel source goes, with a little usage in sound codecs and GPU. If anyone has a better name or concerns please share... thanks, Steve

10 years, 5 months

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

eas-dev November 2015