Hi Steve,
I took one week's holiday, sorry for late response.
On Sun, Nov 15, 2015 at 11:33:02AM -0800, Steve Muckle wrote:
Hi Leo,
On 11/11/2015 07:15 AM, Leo Yan wrote:
If pack task_A onto one busy CPU, this will introduce possible power penalty caused by higher OPP; on the other hand, if spread task_A to a idle CPU (The idle CPU's cluster may also stay in idle state), then this will introduce power penalty caused by extra power domain.
So I think we can enhance energy calculation algorithm when wake up the task in function energy_aware_wake_cpu(). For example, we can select two candidate CPUs for waken up task, one possible CPU is in the same schedule group with the task's original CPU, and another possible CPU is in another schedule group (this schedule group should have best or equal power efficiency in system). Then finally we can get to know if need spread tasks to different cluster, or need spread task to different CPU but in the same cluster, or just stay on original CPU.
When placing a waking task I'd think you only need to evaluate one CPU in each cluster:
- If there are utilized CPUs in the cluster with enough free capacity at
the current OPP to fit the waking task, then one of these CPUs should be evaluated as the candidate (it's debatable which one IMO, perhaps the one with the most capacity, but perhaps also the least to better pack CPUs).
- If no busy CPUs have enough free capacity at the current OPP to
contain the waking task, then the least utilized CPU in the cluster.
Looking briefly at patch 32 of EASv5 (energy-aware task placement) this seems to be what is done but we're only evaluating the smallest cluster that can fit the task, and we are also evaluating the task's prev CPU.
In patch 32 of EASv5, there have assumption: It will "find group with sufficient capacity", so this can work well with big.LITTLE system, but will spread tasks into two clusters for SMP platform.
Perhaps we could stop evaluating the prev CPU (if it's not the best CPU in the cluster as described above). Instead, we could evaluate placing the task in at least one other cluster in the system, and choose between those options.
Thanks for suggestion and agree. We can select CPUs with below prioirty (from high to low):
- Select CPUs in most power efficient for groups (so may be have more than one groups on SMP platform); - Select CPUs with lowest OPP to meet capacity requirement; - Select CPUs with highest utilization (as your said, here need to try to use least one, and I think it's more suitable for rt-app cases, even rt-app-6 also will take 35% CPU's utilization when CPU run at lowest OPP); - Select CPUs with least CPU ID;
If you think here have no obvious logic error, I will try it in next 1~2 weeks and post result after finish related testing.
I also observed here have another possible scenario. For example, if tasks have been already packed to several CPUs, and though every task's workload is not quite high (such like rt-app-13) but they accumulate load on one CPU, so finally CPU will run at high OPP.
So if EAS pick up only one of these tasks and try to migrate the task to another CPU, usually will not migrate to that CPU. The reason is even target CPU have run at high OPP, but usually it still have capacity to run more workloads with highest OPP; so energy_diff() also will get worse power result after increase OPP, and task will stay on original CPU. [1][2]
Hopefully applying a policy like the one above would prevent us from getting into a situation like this.
Even if pick one idle CPU from another cluster, still cannot resolve this issue. Because if spread task to another cluster, the original cluster and CPU's OPP will not decrease but introduce extra power by the new cluster and CPU.
I would've expected the current algorithm to deal with this one properly. Although the original cluster OPP doesn't decrease, the utilization of it does, so the power consumed by it goes down. The power should decrease by more than the power increase in the new cluster, assuming the new cluster is indeed operating more efficiently.
So in this case, should consider as a global view and define some criteria:
- CPUs don't stay on lowest OPP, but system have idle CPUs;
- CPU's lower OPP can meet capacity requirement for all task's average load;
- CPU's lower OPP can meet capacity requirement for the highest load task in system.
If meet these criteria, EAS can select idle CPU from schedule group with best power efficiency.
Though I agree this may address the issue you mentioned it'd be nice to avoid adding special cases that we must test for. Do you think it's possible to solve this problem with a more generic tweak to the wake up logic like I proposed above?
After applied above logic, it may be helpful for rt-app cases, due rt-app cases have consistent load. But in reality task's instant load may change, so when several tasks is waken up with low load, then EAS will pack them. After tasks run for a while, these tasks will increase load, then EAS should detect this and spread tasks from global view.
So EAS will select best power efficient CPU for waken up task, this is purely from task's pespective. I just wander if need do one more time calculation from whole system level.
I'd like to take this with lower proirity, we can resolve the first issue based on your above suggestion :).
Thanks, Leo Yan