This patch series is to optimize performance and refine patches
according to review comments.
- Patch 0001 is add more chance to select previous CPU for cache hot;
- In EAS code, the critical path is task waken up with function
energy_aware_wake_cpu(); this function is purposed to select one
possible target CPU with most energy saving. So it includes two
underlying functionality: the first one is to select most power
efficiency CPU for the task in one cluster, another is to migrate
task from big core to little core if little core can meet performance
requirement.
For first functionality for selection most power efficiency CPU
within cluster, EAS prefers to select a non-idle CPU so as result it
packs tasks into one CPU as possible. This is not an optimal solution
with two reasons: the first reason is this introduces long schedule
latency after multiple tasks on the same rq; the second reason is it
easily gets result as small tasks packing within one CPU with higher
operating point. Finally this is the observed foremost issue if there
have multiple tasks, neither power or performance can achieve optimal
result.
So patch 0002 is to solve this issue to try to select CPU if can keep
CPU at lowest OPP as possible.
- Current code has no mechanism to spread these tasks throughout the
little cluster so tasks are packed on one CPU when CPU is not
“over-utilized”. In this case, only one CPU is very busy but other
CPUs in the same cluster are in idle state.
Patch 0003 is to spread task in lowest schedule domain (in cluster
level) after add a medium state named "half-utilized". This may a
temperary solution, due this likely a better solution is to unify
flag for "over-utilized".
- In CFS, PELT signals take long time to increase to high value and
decay to small value; on the other hand, EAS does not take account
load_avg value (runnable time) but only focus on util_avg value
(running time). So these issues are really dependent on fundamental
signals.
So hope have advanced method to accelerate PELT signals and dismiss
the issue introduced by long runnable time. Patch 0004 we can take it
as a temperary solution, likely we can use the big difference between
load_avg and util_avg to change to use inflate value, also can use it
to reflect runnable time.
Patch 0004 also has side effects for misfit flag. If any CPU has
“misfit” task on it, then EAS will set imbalance value as CPU
capacity and migrate such load from little core to big core. So
“misfit” is quite good for there have only one big task on the
little CPU so the CPU cannot meet task’s performance requirement
with function “task_fits_max(p, rq->cpu)”; but if there have two
tasks on the little CPU, then the task’s utilization value just
half of CPU capacity value so finally EAS considers CPU can meet
task requirement. Patch 0004 can more easily to set true for
misfit: rq->misfit_task = !task_fits_max(p, rq->cpu)
- In function energy_aware_wake_cpu(), it is possible to directly
migrate task from little core to big core, but the conditions are
rigid: the condition 1 is CPU capacity cannot meet this task
requirement; the condition 2 is source CPU is “over-utilized”. If the
source CPU is not “over-utilized” for condition 2, then even little
CPU cannot meet task requirement but EAS will compare CPU energy and
as the end it still selects previous little CPU
Patch 0005 is to add extra path to directly migrate task from little
core to big core.
- For very heavily workload with multi-threads, we observed the tasks
are not migrated within big cluster, also tasks are hard to migrate
from big cluster to little cluster even little cluster have idle CPUs
are available to run. So need optimize EAS to handle this case likely
to go back with CFS behaviour.
Patch 0006 and 0008 are to fix this related issues.
- SMP load balance may migrate small task onto big core, but usually at
this time point we are only looking forward big tasks migration,
finally this hurts both power and performance. So patch 0007 it will
avoid small task to migrate to higher capacity CPU so it will give
more chance to real big task migration to higher capacity CPU.
Leo Yan (8):
sched/fair: optimize to more chance to select previous CPU
sched/fair: select CPU based on using lowest capacity
sched/fair: support to spread task in lowest schedule domain
sched/fair: use load metrics to replace util when have big difference
sched/fair: add path to migrate to higher capacity CPU
sched/fair: force idle balance when busiest group is overloaded
sched/fair: avoid small task to migrate to higher capacity CPU
sched/fair: set imbalance for too many tasks on rq
kernel/sched/fair.c | 193 ++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 173 insertions(+), 20 deletions(-)
--
1.9.1
This patch series is to optimize performance.
Patch 0001 is to optimize CPU selection flow so let task has more
chance to stay on previous CPU. Patch 0002 actually is a big change
for EAS's policy for CPU selection, it trys to select idle CPU as
possible. From profiling result, 0002 have good effect that spread tasks
out if there have many tasks are running at the meantime.
Patches 0003~0004 are to optimize the scenario for single thread case.
In this case, the thread has relative high utilization value, but the
value cannot easily over tipping point. So patche 0004 try to set
criteria to in some condition change to use load_avg rather than
util_avg to boost the single thread.
Patch 0005 is to optimize the flow for spreading tasks within big
cluster.
Patches 0006~0007 is to fix the signal for avg_load.
Leo Yan (8):
sched/fair: optimize to more chance to select previous CPU
sched/fair: select idle CPU for waken up task
sched/fair: add path to migrate to higher capacity CPU
sched/fair: use load to replace util when have big difference
sched/fair: spread tasks in cluster when over tipping point
sched/fair: correct avg_load as CPU average load
sched/fair: fix to calculate average load cross cluster
sched/fair: set imbn to 1 for too many tasks on rq
include/linux/sched.h | 1 +
kernel/sched/fair.c | 93 +++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 84 insertions(+), 10 deletions(-)
--
1.9.1
Hi Patrick,
[ + eas-dev ]
Here have a common question for how to define schedTune threshold
array for payoff. So basically I want check below questions:
- When every CGroup has its own perf_boost_idx for PB region and
perf_constrain_idx for PC region. So do you have suggestion or
guideline to define these index?
And for difference CGroup like "backgroud", "foreground" or
"performance" every CGroup will have its dedicated index or the
platform can share the same index value?
- How to define the array value for "threshold_gains"?
IIUC this array is platform dependency, but what's the
reasonable method to generate this table? Here have some suggested
testing for generating this table?
Or my understanding is wrong so this array is fixed, then just need
ajust perf_boost_idx/perf_constrain_idx for platform is enough?
- So far we cannot set these payoff parameters (including
perf_boost_idx/perf_constrain_idx and threshold_gains) from sysfs
dynamically, so how we can initilizae these value for platform
specific? Suppose now we can only set these value when kernel's
init flow, right?
Thanks,
Leo Yan