Hi all,
At connect, Steve also brought out related questions: pack tasks with higher OPP or spread tasks with lower OPP? so I'd like to summary this and combind with recently profiling result:
- When task_A is waken up, then scheduler need decide it should pack task_A onto a busy CPU, or scheduler need spread task_A to a idle CPU.
If pack task_A onto one busy CPU, this will introduce possible power penalty caused by higher OPP; on the other hand, if spread task_A to a idle CPU (The idle CPU's cluster may also stay in idle state), then this will introduce power penalty caused by extra power domain.
So I think we can enhance energy calculation algorithm when wake up the task in function energy_aware_wake_cpu(). For example, we can select two candidate CPUs for waken up task, one possible CPU is in the same schedule group with the task's original CPU, and another possible CPU is in another schedule group (this schedule group should have best or equal power efficiency in system). Then finally we can get to know if need spread tasks to different cluster, or need spread task to different CPU but in the same cluster, or just stay on original CPU.
- I also observed here have another possible scenario. For example, if tasks have been already packed to several CPUs, and though every task's workload is not quite high (such like rt-app-13) but they accumulate load on one CPU, so finally CPU will run at high OPP.
So if EAS pick up only one of these tasks and try to migrate the task to another CPU, usually will not migrate to that CPU. The reason is even target CPU have run at high OPP, but usually it still have capacity to run more workloads with highest OPP; so energy_diff() also will get worse power result after increase OPP, and task will stay on original CPU. [1][2]
Even if pick one idle CPU from another cluster, still cannot resolve this issue. Because if spread task to another cluster, the original cluster and CPU's OPP will not decrease but introduce extra power by the new cluster and CPU.
So in this case, should consider as a global view and define some criteria: * CPUs don't stay on lowest OPP, but system have idle CPUs; * CPU's lower OPP can meet capacity requirement for all task's average load; * CPU's lower OPP can meet capacity requirement for the highest load task in system.
If meet these criteria, EAS can select idle CPU from schedule group with best power efficiency.
I think you guys may have discussed this topic yet, so before I start to try with these ideas, want to check if I missed some discussion and welcom any suggestion.
[1] http://people.linaro.org/~leo.yan/eas_profiling/eas_tasks_in_one_cluster_hig... [2] http://people.linaro.org/~leo.yan/eas_profiling/eas_tasks_in_one_cluster_ene...
Thanks, Leo Yan
Hi Leo,
On 11/11/2015 07:15 AM, Leo Yan wrote:
If pack task_A onto one busy CPU, this will introduce possible power penalty caused by higher OPP; on the other hand, if spread task_A to a idle CPU (The idle CPU's cluster may also stay in idle state), then this will introduce power penalty caused by extra power domain.
So I think we can enhance energy calculation algorithm when wake up the task in function energy_aware_wake_cpu(). For example, we can select two candidate CPUs for waken up task, one possible CPU is in the same schedule group with the task's original CPU, and another possible CPU is in another schedule group (this schedule group should have best or equal power efficiency in system). Then finally we can get to know if need spread tasks to different cluster, or need spread task to different CPU but in the same cluster, or just stay on original CPU.
When placing a waking task I'd think you only need to evaluate one CPU in each cluster:
- If there are utilized CPUs in the cluster with enough free capacity at the current OPP to fit the waking task, then one of these CPUs should be evaluated as the candidate (it's debatable which one IMO, perhaps the one with the most capacity, but perhaps also the least to better pack CPUs).
- If no busy CPUs have enough free capacity at the current OPP to contain the waking task, then the least utilized CPU in the cluster.
Looking briefly at patch 32 of EASv5 (energy-aware task placement) this seems to be what is done but we're only evaluating the smallest cluster that can fit the task, and we are also evaluating the task's prev CPU.
Perhaps we could stop evaluating the prev CPU (if it's not the best CPU in the cluster as described above). Instead, we could evaluate placing the task in at least one other cluster in the system, and choose between those options.
I also observed here have another possible scenario. For example, if tasks have been already packed to several CPUs, and though every task's workload is not quite high (such like rt-app-13) but they accumulate load on one CPU, so finally CPU will run at high OPP.
So if EAS pick up only one of these tasks and try to migrate the task to another CPU, usually will not migrate to that CPU. The reason is even target CPU have run at high OPP, but usually it still have capacity to run more workloads with highest OPP; so energy_diff() also will get worse power result after increase OPP, and task will stay on original CPU. [1][2]
Hopefully applying a policy like the one above would prevent us from getting into a situation like this.
Even if pick one idle CPU from another cluster, still cannot resolve this issue. Because if spread task to another cluster, the original cluster and CPU's OPP will not decrease but introduce extra power by the new cluster and CPU.
I would've expected the current algorithm to deal with this one properly. Although the original cluster OPP doesn't decrease, the utilization of it does, so the power consumed by it goes down. The power should decrease by more than the power increase in the new cluster, assuming the new cluster is indeed operating more efficiently.
So in this case, should consider as a global view and define some criteria:
- CPUs don't stay on lowest OPP, but system have idle CPUs;
- CPU's lower OPP can meet capacity requirement for all task's average load;
- CPU's lower OPP can meet capacity requirement for the highest load task in system.
If meet these criteria, EAS can select idle CPU from schedule group with best power efficiency.
Though I agree this may address the issue you mentioned it'd be nice to avoid adding special cases that we must test for. Do you think it's possible to solve this problem with a more generic tweak to the wake up logic like I proposed above?
thanks, Steve
Hi Steve,
I took one week's holiday, sorry for late response.
On Sun, Nov 15, 2015 at 11:33:02AM -0800, Steve Muckle wrote:
Hi Leo,
On 11/11/2015 07:15 AM, Leo Yan wrote:
If pack task_A onto one busy CPU, this will introduce possible power penalty caused by higher OPP; on the other hand, if spread task_A to a idle CPU (The idle CPU's cluster may also stay in idle state), then this will introduce power penalty caused by extra power domain.
So I think we can enhance energy calculation algorithm when wake up the task in function energy_aware_wake_cpu(). For example, we can select two candidate CPUs for waken up task, one possible CPU is in the same schedule group with the task's original CPU, and another possible CPU is in another schedule group (this schedule group should have best or equal power efficiency in system). Then finally we can get to know if need spread tasks to different cluster, or need spread task to different CPU but in the same cluster, or just stay on original CPU.
When placing a waking task I'd think you only need to evaluate one CPU in each cluster:
- If there are utilized CPUs in the cluster with enough free capacity at
the current OPP to fit the waking task, then one of these CPUs should be evaluated as the candidate (it's debatable which one IMO, perhaps the one with the most capacity, but perhaps also the least to better pack CPUs).
- If no busy CPUs have enough free capacity at the current OPP to
contain the waking task, then the least utilized CPU in the cluster.
Looking briefly at patch 32 of EASv5 (energy-aware task placement) this seems to be what is done but we're only evaluating the smallest cluster that can fit the task, and we are also evaluating the task's prev CPU.
In patch 32 of EASv5, there have assumption: It will "find group with sufficient capacity", so this can work well with big.LITTLE system, but will spread tasks into two clusters for SMP platform.
Perhaps we could stop evaluating the prev CPU (if it's not the best CPU in the cluster as described above). Instead, we could evaluate placing the task in at least one other cluster in the system, and choose between those options.
Thanks for suggestion and agree. We can select CPUs with below prioirty (from high to low):
- Select CPUs in most power efficient for groups (so may be have more than one groups on SMP platform); - Select CPUs with lowest OPP to meet capacity requirement; - Select CPUs with highest utilization (as your said, here need to try to use least one, and I think it's more suitable for rt-app cases, even rt-app-6 also will take 35% CPU's utilization when CPU run at lowest OPP); - Select CPUs with least CPU ID;
If you think here have no obvious logic error, I will try it in next 1~2 weeks and post result after finish related testing.
I also observed here have another possible scenario. For example, if tasks have been already packed to several CPUs, and though every task's workload is not quite high (such like rt-app-13) but they accumulate load on one CPU, so finally CPU will run at high OPP.
So if EAS pick up only one of these tasks and try to migrate the task to another CPU, usually will not migrate to that CPU. The reason is even target CPU have run at high OPP, but usually it still have capacity to run more workloads with highest OPP; so energy_diff() also will get worse power result after increase OPP, and task will stay on original CPU. [1][2]
Hopefully applying a policy like the one above would prevent us from getting into a situation like this.
Even if pick one idle CPU from another cluster, still cannot resolve this issue. Because if spread task to another cluster, the original cluster and CPU's OPP will not decrease but introduce extra power by the new cluster and CPU.
I would've expected the current algorithm to deal with this one properly. Although the original cluster OPP doesn't decrease, the utilization of it does, so the power consumed by it goes down. The power should decrease by more than the power increase in the new cluster, assuming the new cluster is indeed operating more efficiently.
So in this case, should consider as a global view and define some criteria:
- CPUs don't stay on lowest OPP, but system have idle CPUs;
- CPU's lower OPP can meet capacity requirement for all task's average load;
- CPU's lower OPP can meet capacity requirement for the highest load task in system.
If meet these criteria, EAS can select idle CPU from schedule group with best power efficiency.
Though I agree this may address the issue you mentioned it'd be nice to avoid adding special cases that we must test for. Do you think it's possible to solve this problem with a more generic tweak to the wake up logic like I proposed above?
After applied above logic, it may be helpful for rt-app cases, due rt-app cases have consistent load. But in reality task's instant load may change, so when several tasks is waken up with low load, then EAS will pack them. After tasks run for a while, these tasks will increase load, then EAS should detect this and spread tasks from global view.
So EAS will select best power efficient CPU for waken up task, this is purely from task's pespective. I just wander if need do one more time calculation from whole system level.
I'd like to take this with lower proirity, we can resolve the first issue based on your above suggestion :).
Thanks, Leo Yan
Hi Leo,
On 11/24/2015 07:55 PM, Leo Yan wrote:
Thanks for suggestion and agree. We can select CPUs with below prioirty (from high to low):
- Select CPUs in most power efficient for groups (so may be have more than one groups on SMP platform);
Let's say we are placing a small task on a big.Little system, and that small task could fit on both the big and Little cluster.
Does the above statement imply that we would not evaluate the best CPU in the big cluster? I'd think we should, in addition to the best CPU in the little cluster, and decide between those two options. This is because we can have cases where the big cluster is actually the most efficient place to run a task due to current task loads and the OPP of the little cluster.
- Select CPUs with lowest OPP to meet capacity requirement;
- Select CPUs with highest utilization (as your said, here need to try to use least one, and I think it's more suitable for rt-app cases, even rt-app-6 also will take 35% CPU's utilization when CPU run at lowest OPP);
- Select CPUs with least CPU ID;
If you think here have no obvious logic error, I will try it in next 1~2 weeks and post result after finish related testing.
Could you post your draft changes here prior to testing? It'll help ensure I'm following your proposal correctly.
...
So in this case, should consider as a global view and define some criteria:
- CPUs don't stay on lowest OPP, but system have idle CPUs;
- CPU's lower OPP can meet capacity requirement for all task's average load;
- CPU's lower OPP can meet capacity requirement for the highest load task in system.
If meet these criteria, EAS can select idle CPU from schedule group with best power efficiency.
Though I agree this may address the issue you mentioned it'd be nice to avoid adding special cases that we must test for. Do you think it's possible to solve this problem with a more generic tweak to the wake up logic like I proposed above?
After applied above logic, it may be helpful for rt-app cases, due rt-app cases have consistent load. But in reality task's instant load may change, so when several tasks is waken up with low load, then EAS will pack them. After tasks run for a while, these tasks will increase load, then EAS should detect this and spread tasks from global view.
So EAS will select best power efficient CPU for waken up task, this is purely from task's pespective. I just wander if need do one more time calculation from whole system level.
I'd like to take this with lower proirity, we can resolve the first issue based on your above suggestion :).
Sure - I'm still of the opinion that implementing the logic above would hopefully address this, because as the tasks run and grow over time they are still waking up and sleeping. As they wake up they would at some point be spread due to the preference in the above logic to utilize all CPUs at the current OPP before making a placement that would raise the OPP. But we should see for sure if you're going to be evaluating the above proposal.
cheers, Steve
Hi Steve,
On Wed, Nov 25, 2015 at 10:41:32AM -0800, Steve Muckle wrote:
On 11/24/2015 07:55 PM, Leo Yan wrote:
[...]
Thanks for suggestion and agree. We can select CPUs with below prioirty (from high to low):
- Select CPUs in most power efficient for groups (so may be have more than one groups on SMP platform);
Let's say we are placing a small task on a big.Little system, and that small task could fit on both the big and Little cluster.
Does the above statement imply that we would not evaluate the best CPU in the big cluster? I'd think we should, in addition to the best CPU in the little cluster, and decide between those two options. This is because we can have cases where the big cluster is actually the most efficient place to run a task due to current task loads and the OPP of the little cluster.
Okay, should select CPU from every clusters and compare them (BTW, for general implementation system may have more than two clusters).
- Select CPUs with lowest OPP to meet capacity requirement;
- Select CPUs with highest utilization (as your said, here need to try to use least one, and I think it's more suitable for rt-app cases, even rt-app-6 also will take 35% CPU's utilization when CPU run at lowest OPP);
- Select CPUs with least CPU ID;
If you think here have no obvious logic error, I will try it in next 1~2 weeks and post result after finish related testing.
Could you post your draft changes here prior to testing? It'll help ensure I'm following your proposal correctly.
Sure, will post code at here and let you review firstly; Due I have another task in hand, so will do this in next week.
[...]
Thanks, Leo Yan
Hi Steve,
Sorry for some late.
On Wed, Nov 25, 2015 at 10:41:32AM -0800, Steve Muckle wrote:
On 11/24/2015 07:55 PM, Leo Yan wrote:
[...]
Let's say we are placing a small task on a big.Little system, and that small task could fit on both the big and Little cluster.
Does the above statement imply that we would not evaluate the best CPU in the big cluster? I'd think we should, in addition to the best CPU in the little cluster, and decide between those two options. This is because we can have cases where the big cluster is actually the most efficient place to run a task due to current task loads and the OPP of the little cluster.
- Select CPUs with lowest OPP to meet capacity requirement;
- Select CPUs with highest utilization (as your said, here need to try to use least one, and I think it's more suitable for rt-app cases, even rt-app-6 also will take 35% CPU's utilization when CPU run at lowest OPP);
- Select CPUs with least CPU ID;
If you think here have no obvious logic error, I will try it in next 1~2 weeks and post result after finish related testing.
Could you post your draft changes here prior to testing? It'll help ensure I'm following your proposal correctly.
Below are the code with our discussion, please help review; I also enclosed the patch in case you want to check with diff format.
---<8---
static int find_cpu_new_capacity(int cpu, unsigned long util) { struct sched_domain *sd; struct sched_group_energy *sge; int idx;
sd = rcu_dereference(per_cpu(sd_ea, cpu)); sge = sd->groups->sge;
for (idx = 0; idx < sge->nr_cap_states; idx++) if (sge->cap_states[idx].cap >= util) break;
if (idx == sge->nr_cap_states) idx = idx - 1;
return idx; }
static void find_best_cpu_in_sg(struct cpumask *mask, struct sched_group *sg, struct task_struct *p) { int min_opp = INT_MAX, max_usage = 0, new_usage; int target_cpu = -1, i;
for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) {
int opp;
/* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due the double * accouting. However, the blocked utilization may be zero. */ new_usage = get_cpu_usage(i) + task_utilization(p);
opp = find_cpu_new_capacity(i, new_usage);
/* If need higher OPP, then skip */ if (min_opp < opp) continue;
/* If CPU with lowwer OPP, just use it */ if (min_opp > opp) { min_opp = opp; max_usage = new_usage; target_cpu = i; continue; }
if (max_usage < new_usage) { max_usage = new_usage; target_cpu = i; continue; }
if (i < target_cpu) { target_cpu = i; continue; } }
BUG_ON(target_cpu == -1);
cpumask_set_cpu(target_cpu, mask); return; }
static int find_power_efficient_cpu(struct cpumask *mask, struct task_struct *p) { int i, target_cpu; int min_energy = 0, diff; struct energy_env eenv;
target_cpu = task_cpu(p);
for_each_cpu(i, mask) {
if (i == task_cpu(p)) continue;
memset(&eenv, 0, sizeof(eenv));
eenv.usage_delta = task_utilization(p), eenv.src_cpu = task_cpu(p), eenv.dst_cpu = i, eenv.task = p,
diff = energy_diff(&eenv); if (diff < min_energy) { target_cpu = i; min_energy = diff; } }
return target_cpu; }
static int energy_aware_wake_cpu(struct task_struct *p, int target) { struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_cpu; struct cpumask target_cpus;
sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
if (!sd) return target;
sg = sd->groups; sg_target = sg;
cpumask_clear(&target_cpus);
do { find_best_cpu_in_sg(&target_cpus, sg, p);
} while (sg = sg->next, sg != sd->groups);
if (cpumask_empty(&target_cpus)) cpumask_set_cpu(task_cpu(p), &target_cpus);
target_cpu = find_power_efficient_cpu(&target_cpus, p);
return target_cpu; }
--->8---
Thanks, Leo Yan
Hi all,
On Thu, Dec 10, 2015 at 10:53:06AM +0800, Leo Yan wrote:
On Wed, Nov 25, 2015 at 10:41:32AM -0800, Steve Muckle wrote:
On 11/24/2015 07:55 PM, Leo Yan wrote:
[...]
Let's say we are placing a small task on a big.Little system, and that small task could fit on both the big and Little cluster.
Does the above statement imply that we would not evaluate the best CPU in the big cluster? I'd think we should, in addition to the best CPU in the little cluster, and decide between those two options. This is because we can have cases where the big cluster is actually the most efficient place to run a task due to current task loads and the OPP of the little cluster.
- Select CPUs with lowest OPP to meet capacity requirement;
- Select CPUs with highest utilization (as your said, here need to try to use least one, and I think it's more suitable for rt-app cases, even rt-app-6 also will take 35% CPU's utilization when CPU run at lowest OPP);
- Select CPUs with least CPU ID;
If you think here have no obvious logic error, I will try it in next 1~2 weeks and post result after finish related testing.
Could you post your draft changes here prior to testing? It'll help ensure I'm following your proposal correctly.
Below are the code with our discussion, please help review; I also enclosed the patch in case you want to check with diff format.
---<8---
static int find_cpu_new_capacity(int cpu, unsigned long util) { struct sched_domain *sd; struct sched_group_energy *sge; int idx;
sd = rcu_dereference(per_cpu(sd_ea, cpu)); sge = sd->groups->sge;
for (idx = 0; idx < sge->nr_cap_states; idx++) if (sge->cap_states[idx].cap >= util) break;
if (idx == sge->nr_cap_states) idx = idx - 1;
return idx; }
static void find_best_cpu_in_sg(struct cpumask *mask, struct sched_group *sg, struct task_struct *p) { int min_opp = INT_MAX, max_usage = 0, new_usage; int target_cpu = -1, i;
for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) {
int opp; /* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due the double * accouting. However, the blocked utilization may be zero. */ new_usage = get_cpu_usage(i) + task_utilization(p);
When I continue to profile with this patch, I found I cannot get expected result; the tasks will be migrated in a mess after applied this patch.
The target CPU's selection is quite dependent on CPU's utilization, but from ftrace data file, the cfs_rq::utilization_load_avg will be increased sharply. Then finally it will imapct on CPU's migration.
So in [2], we can see the task even has been migrated on CPU2 with energy calculation, but it will be finally be migrated onto CPU3 becaused CPU2's utilization value is increased sharply and meet the condition of cpu_overutilized().
It's make sense for CPU's utilization's decay, but it should increase in step-wise when CPU is running for some tasks. So I want to confirm if this is the expected behavior for CPU's utilization, which will be increased sharply when enqueue one task on the CPU's rq?
I saw there have many polishment on CPU and task's load tracking recently, so do you think this issue has been fixed with new kernel (I'm using 4.2-rc6)?
Welcome any comment and suggestion.
Thanks, Leo Yan
[1] http://people.linaro.org/~leo.yan/eas_profiling/eas_cpu_utilization_increase... [2] http://people.linaro.org/~leo.yan/eas_profiling/eas_task_migrate_with_high_c...
opp = find_cpu_new_capacity(i, new_usage); /* If need higher OPP, then skip */ if (min_opp < opp) continue; /* If CPU with lowwer OPP, just use it */ if (min_opp > opp) { min_opp = opp; max_usage = new_usage; target_cpu = i; continue; } if (max_usage < new_usage) { max_usage = new_usage; target_cpu = i; continue; } if (i < target_cpu) { target_cpu = i; continue; }
}
BUG_ON(target_cpu == -1);
cpumask_set_cpu(target_cpu, mask); return; }
static int find_power_efficient_cpu(struct cpumask *mask, struct task_struct *p) { int i, target_cpu; int min_energy = 0, diff; struct energy_env eenv;
target_cpu = task_cpu(p);
for_each_cpu(i, mask) {
if (i == task_cpu(p)) continue; memset(&eenv, 0, sizeof(eenv)); eenv.usage_delta = task_utilization(p), eenv.src_cpu = task_cpu(p), eenv.dst_cpu = i, eenv.task = p, diff = energy_diff(&eenv); if (diff < min_energy) { target_cpu = i; min_energy = diff; }
}
return target_cpu; }
static int energy_aware_wake_cpu(struct task_struct *p, int target) { struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_cpu; struct cpumask target_cpus;
sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
if (!sd) return target;
sg = sd->groups; sg_target = sg;
cpumask_clear(&target_cpus);
do { find_best_cpu_in_sg(&target_cpus, sg, p);
} while (sg = sg->next, sg != sd->groups);
if (cpumask_empty(&target_cpus)) cpumask_set_cpu(task_cpu(p), &target_cpus);
target_cpu = find_power_efficient_cpu(&target_cpus, p);
return target_cpu; }
--->8---
Thanks, Leo Yan
From c9dfdeb5b9f38e94eca3c489091314a4e82f4864 Mon Sep 17 00:00:00 2001 From: Leo Yan leo.yan@linaro.org Date: Thu, 10 Dec 2015 10:41:39 +0800 Subject: [PATCH] sched/fair: EASv5: Spread Tasks With Lower OPP
With this patch, we will select best CPU from every sched group with below priority:
- Select CPUs with lowest OPP to meet capacity requirement
- Select CPUs with highest utilization
- Select CPUs with least CPU ID
After the selections, then need compare these candidates CPUs and select best CPU from energy data.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 157 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 104 insertions(+), 53 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ce293ff..127a354 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5038,6 +5038,9 @@ static int find_new_capacity(struct energy_env *eenv, } }
- if (idx == sge->nr_cap_states)
idx = idx - 1;
- eenv->cap_idx = idx; return idx;
} @@ -5557,87 +5560,135 @@ done: return target; }
-static int energy_aware_wake_cpu(struct task_struct *p, int target) +static int find_cpu_new_capacity(int cpu, unsigned long util) { struct sched_domain *sd;
- struct sched_group *sg, *sg_target;
- int target_max_cap = INT_MAX;
- int target_cpu = task_cpu(p);
- int i;
- struct sched_group_energy *sge;
- int idx;
- sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
- sd = rcu_dereference(per_cpu(sd_ea, cpu));
- sge = sd->groups->sge;
- if (!sd)
return target;
- for (idx = 0; idx < sge->nr_cap_states; idx++)
if (sge->cap_states[idx].cap >= util)
break;
- sg = sd->groups;
- sg_target = sg;
- if (idx == sge->nr_cap_states)
idx = idx - 1;
- /*
* Find group with sufficient capacity. We only get here if no cpu is
* overutilized. We may end up overutilizing a cpu by adding the task,
* but that should not be any worse than select_idle_sibling().
* load_balance() should sort it out later as we get above the tipping
* point.
*/
- do {
/* Assuming all cpus are the same in group */
int max_cap_cpu = group_first_cpu(sg);
- return idx;
+}
/*
* Assume smaller max capacity means more energy-efficient.
* Ideally we should query the energy model for the right
* answer but it easily ends up in an exhaustive search.
*/
if (capacity_of(max_cap_cpu) < target_max_cap &&
task_fits_capacity(p, max_cap_cpu)) {
sg_target = sg;
target_max_cap = capacity_of(max_cap_cpu);
}
- } while (sg = sg->next, sg != sd->groups);
+static void find_best_cpu_in_sg(struct cpumask *mask, struct sched_group *sg,
struct task_struct *p)
+{
- int min_opp = INT_MAX, max_usage = 0, new_usage;
- int target_cpu = -1, i;
- for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) {
int opp;
- /* Find cpu with sufficient capacity */
- for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) { /*
*/
- p's blocked utilization is still accounted for on prev_cpu
- so prev_cpu will receive a negative bias due the double
- accouting. However, the blocked utilization may be zero.
int new_usage = get_cpu_usage(i) + task_utilization(p);
new_usage = get_cpu_usage(i) + task_utilization(p);
if (new_usage > capacity_orig_of(i))
opp = find_cpu_new_capacity(i, new_usage);
/* If need higher OPP, then skip */
if (min_opp < opp) continue;
if (new_usage < capacity_curr_of(i)) {
/* If CPU with lowwer OPP, just use it */
if (min_opp > opp) {
min_opp = opp;
max_usage = new_usage; target_cpu = i;
if (cpu_rq(i)->nr_running)
break;
}continue;
/* cpu has capacity at higher OPP, keep it as fallback */
if (target_cpu == task_cpu(p))
if (max_usage < new_usage) {
max_usage = new_usage; target_cpu = i;
- }
continue;
}
- if (target_cpu != task_cpu(p)) {
struct energy_env eenv = {
.usage_delta = task_utilization(p),
.src_cpu = task_cpu(p),
.dst_cpu = target_cpu,
.task = p,
};
if (i < target_cpu) {
target_cpu = i;
continue;
}
- }
/* Not enough spare capacity on previous cpu */
if (cpu_overutilized(task_cpu(p)))
return target_cpu;
- BUG_ON(target_cpu == -1);
if (energy_diff(&eenv) >= 0)
return task_cpu(p);
- cpumask_set_cpu(target_cpu, mask);
- return;
+}
+static int find_power_efficient_cpu(struct cpumask *mask, struct task_struct *p) +{
int i, target_cpu;
int min_energy = 0, diff;
struct energy_env eenv;
target_cpu = task_cpu(p);
for_each_cpu(i, mask) {
if (i == task_cpu(p))
continue;
memset(&eenv, 0, sizeof(eenv));
eenv.usage_delta = task_utilization(p),
eenv.src_cpu = task_cpu(p),
eenv.dst_cpu = i,
eenv.task = p,
diff = energy_diff(&eenv);
if (diff < min_energy) {
target_cpu = i;
min_energy = diff;
}
}
return target_cpu;
}
+static int energy_aware_wake_cpu(struct task_struct *p, int target) +{
- struct sched_domain *sd;
- struct sched_group *sg, *sg_target;
- int target_cpu;
- struct cpumask target_cpus;
- sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
- if (!sd)
return target;
- sg = sd->groups;
- sg_target = sg;
- cpumask_clear(&target_cpus);
- do {
find_best_cpu_in_sg(&target_cpus, sg, p);
- } while (sg = sg->next, sg != sd->groups);
- if (cpumask_empty(&target_cpus))
cpumask_set_cpu(task_cpu(p), &target_cpus);
- target_cpu = find_power_efficient_cpu(&target_cpus, p);
- return target_cpu;
+}
/*
- select_task_rq_fair: Select target runqueue for the waking task in domains
- that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
-- 1.9.1
Hi Leo,
On 12/15/2015 06:17 AM, Leo Yan wrote:
When I continue to profile with this patch, I found I cannot get expected result; the tasks will be migrated in a mess after applied this patch.
The target CPU's selection is quite dependent on CPU's utilization, but from ftrace data file, the cfs_rq::utilization_load_avg will be increased sharply. Then finally it will imapct on CPU's migration.
So in [2], we can see the task even has been migrated on CPU2 with energy calculation, but it will be finally be migrated onto CPU3 becaused CPU2's utilization value is increased sharply and meet the condition of cpu_overutilized().
It's make sense for CPU's utilization's decay, but it should increase in step-wise when CPU is running for some tasks. So I want to confirm if this is the expected behavior for CPU's utilization, which will be increased sharply when enqueue one task on the CPU's rq?
I saw there have many polishment on CPU and task's load tracking recently, so do you think this issue has been fixed with new kernel (I'm using 4.2-rc6)?
Welcome any comment and suggestion.
Though I don't know that it will fix it, I recommend moving to tip since per-entity load tracking has been rewritten. It will be easier and more worthwhile to discuss any problems in the context of the latest code.
thanks, Steve
Hi Leo,
On 12/09/2015 06:53 PM, Leo Yan wrote:
Could you post your draft changes here prior to testing? It'll help ensure I'm following your proposal correctly.
Below are the code with our discussion, please help review; I also enclosed the patch in case you want to check with diff format.
Looked over your changes, they look consistent to me with what was discussed earlier. One minor comment inline below.
---<8---
static int find_cpu_new_capacity(int cpu, unsigned long util) { struct sched_domain *sd; struct sched_group_energy *sge; int idx;
sd = rcu_dereference(per_cpu(sd_ea, cpu)); sge = sd->groups->sge;
for (idx = 0; idx < sge->nr_cap_states; idx++) if (sge->cap_states[idx].cap >= util) break;
if (idx == sge->nr_cap_states) idx = idx - 1;
return idx; }
static void find_best_cpu_in_sg(struct cpumask *mask, struct sched_group *sg, struct task_struct *p) { int min_opp = INT_MAX, max_usage = 0, new_usage; int target_cpu = -1, i;
for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) {
int opp; /* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due the double * accouting. However, the blocked utilization may be zero. */ new_usage = get_cpu_usage(i) + task_utilization(p); opp = find_cpu_new_capacity(i, new_usage); /* If need higher OPP, then skip */ if (min_opp < opp) continue; /* If CPU with lowwer OPP, just use it */ if (min_opp > opp) { min_opp = opp; max_usage = new_usage; target_cpu = i; continue; } if (max_usage < new_usage) { max_usage = new_usage; target_cpu = i; continue; } if (i < target_cpu) { target_cpu = i; continue; }
}
BUG_ON(target_cpu == -1);
cpumask_set_cpu(target_cpu, mask); return; }
static int find_power_efficient_cpu(struct cpumask *mask, struct task_struct *p) { int i, target_cpu; int min_energy = 0, diff; struct energy_env eenv;
target_cpu = task_cpu(p);
for_each_cpu(i, mask) {
if (i == task_cpu(p)) continue; memset(&eenv, 0, sizeof(eenv)); eenv.usage_delta = task_utilization(p), eenv.src_cpu = task_cpu(p), eenv.dst_cpu = i, eenv.task = p, diff = energy_diff(&eenv); if (diff < min_energy) { target_cpu = i; min_energy = diff; }
}
return target_cpu; }
static int energy_aware_wake_cpu(struct task_struct *p, int target) { struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_cpu; struct cpumask target_cpus;
sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
if (!sd) return target;
sg = sd->groups; sg_target = sg;
cpumask_clear(&target_cpus);
do { find_best_cpu_in_sg(&target_cpus, sg, p);
} while (sg = sg->next, sg != sd->groups);
if (cpumask_empty(&target_cpus))
I think you could just return task_cpu(p) here rather than setting the mask and continuing into find_power_efficient_cpu?
cpumask_set_cpu(task_cpu(p), &target_cpus);
target_cpu = find_power_efficient_cpu(&target_cpus, p);
return target_cpu; }
thanks, Steve
Hi Steve,
Thanks for review.
On Tue, Dec 22, 2015 at 04:32:11PM -0800, Steve Muckle wrote:
On 12/09/2015 06:53 PM, Leo Yan wrote:
[...]
static int energy_aware_wake_cpu(struct task_struct *p, int target) { struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_cpu; struct cpumask target_cpus;
sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
if (!sd) return target;
sg = sd->groups; sg_target = sg;
cpumask_clear(&target_cpus);
do { find_best_cpu_in_sg(&target_cpus, sg, p);
} while (sg = sg->next, sg != sd->groups);
if (cpumask_empty(&target_cpus))
I think you could just return task_cpu(p) here rather than setting the mask and continuing into find_power_efficient_cpu?
Good point. I will fix it.
I also saw your suggestion to use latest kernel to verify this patch, Dietmar also suggested me to sync with EAS RFC 5.2 in another email. This patch is very dependent on CPU's utilization to select target CPU. So I will firstly to rebase on EAS RFC 5.2 and continue profiling.
Thanks, Leo Yan
cpumask_set_cpu(task_cpu(p), &target_cpus);
target_cpu = find_power_efficient_cpu(&target_cpus, p);
return target_cpu; }
thanks, Steve
Hi Leo & Steve,
On 12/23/2015 03:14 AM, Leo Yan wrote:
Hi Steve,
Thanks for review.
On Tue, Dec 22, 2015 at 04:32:11PM -0800, Steve Muckle wrote:
On 12/09/2015 06:53 PM, Leo Yan wrote:
[...]
static int energy_aware_wake_cpu(struct task_struct *p, int target) { struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_cpu; struct cpumask target_cpus;
sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p))); if (!sd) return target; sg = sd->groups; sg_target = sg; cpumask_clear(&target_cpus); do { find_best_cpu_in_sg(&target_cpus, sg, p); } while (sg = sg->next, sg != sd->groups);
here we would break with the scalabiliy model in cfs that we first find the most suitable (busiest, idlest, most_energy_efficient, ...) sg and than we look in this group for the most suitable cpu. This is definitely a problem if EAS should be applicable to system other than 2/2 or 2/4 (cluster/cpus) out of the box.
One EAS design goal was that for various platforms (topology-wise (including big.Little vs. SMP, different # of sg's, different # of sd levels, different strategies in power and frequency domain layouts), all the difference is expressed with different values of the appropriate energy model. It is clearly not possible to introduce special EAS code paths for these various topologies we want to support. We could introduce special behavior based on sd topology flags though (existing or new onces) but not entirely new, specific policies.
if (cpumask_empty(&target_cpus))
I think you could just return task_cpu(p) here rather than setting the mask and continuing into find_power_efficient_cpu?
Good point. I will fix it.
Just to try to sync up ... this patches are against the issue related to patch 32/46 of RFCv5 on Hikey (SMP) where you guys saw that the low intensive rt-app tests didn't show enough packing. The initial behavior on Hikey was that task_cpu(p) determines which cluster will be chosen to find the target cpu. But then there was also the argument that too much packing could lead to a higher OPP and more energy consumption (especially on a platform with system wide frequency domain (Hikey). So maybe the initial approach (vanilla) RFCv5 wasn't so bad after all?
Another open issue was against 22/46 of RFCv5, the one with the missing sg spawning the entire frequency domain on Hikey. This one is related to '[Eas-dev] [PATCH 1/4] sched/fair: EASv5: Fix CPU shared capacity issue'. Currently we have two proposals (struct cpumask sg_cap on the stack or '[PATCH] sched: EAS & cpu hotplug interoperability'))
IMHO, we should address both in RFCv6 early next year.
I also saw your suggestion to use latest kernel to verify this patch, Dietmar also suggested me to sync with EAS RFC 5.2 in another email. This patch is very dependent on CPU's utilization to select target CPU. So I will firstly to rebase on EAS RFC 5.2 and continue profiling.
This becomes very much important because everybody who played with this code already on multiple systems knows how fragile the whole thing is.
E.g., currently Patrick is trying to make sched_freq more responsive so he's proposing to change the way we use per cpu utilization signal (so far only shared ARM internally). This is one of these cases where it could have a negative effect on other EAS functionality on a specific platform.
That's why we should always use the latest code and make sure that we run at least the EAS tests in schedtest on big.Little and SMP platforms even we work on related functionalities like sched_freq or sched_tune. I'm planning to have Hikey as an SMP platform integrated into ARM's schedtest environment before RFCv6 hits LKML.
-- Dietmar IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.