Hi Leo,
On 11/11/2015 07:15 AM, Leo Yan wrote:
If pack task_A onto one busy CPU, this will introduce possible power penalty caused by higher OPP; on the other hand, if spread task_A to a idle CPU (The idle CPU's cluster may also stay in idle state), then this will introduce power penalty caused by extra power domain.
So I think we can enhance energy calculation algorithm when wake up the task in function energy_aware_wake_cpu(). For example, we can select two candidate CPUs for waken up task, one possible CPU is in the same schedule group with the task's original CPU, and another possible CPU is in another schedule group (this schedule group should have best or equal power efficiency in system). Then finally we can get to know if need spread tasks to different cluster, or need spread task to different CPU but in the same cluster, or just stay on original CPU.
When placing a waking task I'd think you only need to evaluate one CPU in each cluster:
- If there are utilized CPUs in the cluster with enough free capacity at the current OPP to fit the waking task, then one of these CPUs should be evaluated as the candidate (it's debatable which one IMO, perhaps the one with the most capacity, but perhaps also the least to better pack CPUs).
- If no busy CPUs have enough free capacity at the current OPP to contain the waking task, then the least utilized CPU in the cluster.
Looking briefly at patch 32 of EASv5 (energy-aware task placement) this seems to be what is done but we're only evaluating the smallest cluster that can fit the task, and we are also evaluating the task's prev CPU.
Perhaps we could stop evaluating the prev CPU (if it's not the best CPU in the cluster as described above). Instead, we could evaluate placing the task in at least one other cluster in the system, and choose between those options.
I also observed here have another possible scenario. For example, if tasks have been already packed to several CPUs, and though every task's workload is not quite high (such like rt-app-13) but they accumulate load on one CPU, so finally CPU will run at high OPP.
So if EAS pick up only one of these tasks and try to migrate the task to another CPU, usually will not migrate to that CPU. The reason is even target CPU have run at high OPP, but usually it still have capacity to run more workloads with highest OPP; so energy_diff() also will get worse power result after increase OPP, and task will stay on original CPU. [1][2]
Hopefully applying a policy like the one above would prevent us from getting into a situation like this.
Even if pick one idle CPU from another cluster, still cannot resolve this issue. Because if spread task to another cluster, the original cluster and CPU's OPP will not decrease but introduce extra power by the new cluster and CPU.
I would've expected the current algorithm to deal with this one properly. Although the original cluster OPP doesn't decrease, the utilization of it does, so the power consumed by it goes down. The power should decrease by more than the power increase in the new cluster, assuming the new cluster is indeed operating more efficiently.
So in this case, should consider as a global view and define some criteria:
- CPUs don't stay on lowest OPP, but system have idle CPUs;
- CPU's lower OPP can meet capacity requirement for all task's average load;
- CPU's lower OPP can meet capacity requirement for the highest load task in system.
If meet these criteria, EAS can select idle CPU from schedule group with best power efficiency.
Though I agree this may address the issue you mentioned it'd be nice to avoid adding special cases that we must test for. Do you think it's possible to solve this problem with a more generic tweak to the wake up logic like I proposed above?
thanks, Steve