o This patch series is to evaluate if can use rb tree to track task
load and util on rq; there have some concern for this method is:
rb tree has O(log(N)) computation complexity, so this will introduce
extra workload by rb tree's maintainence. For this concern using
hackbench to do stress testing, hackbench will generate mass tasks
for message sender and receiver, so there will have many enqueue
and dequeue operations, so we can use hackbench to get to know if
rb tree will introduce big workload or not (Thanks a lot for Chris
suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after
enable feature LB_MIN the scheduler will avoid to migrate task with
load < 16. Somehow this also can help to filter out big tasks for
migration. So we need compare power data between this patch series
with directly setting LB_MIN.
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done'
real user system
baseline 6m00.57s 1m41.72s 34m38.18s
rb tree 5m55.79s 1m33.68s 34m08.38s
For hackbench test case we can see with rb tree it even has better
result than baseline kernel.
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
---------------------------------------------------------
11.3122 8.983429 20.295629
11.337446 8.174061 19.511507
11.256941 8.547895 19.804836
10.994329 9.633028 20.627357
11.483148 8.522364 20.005512
avg. 11.2768128 8.7721554 20.0489682
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
---------------------------------------------------------
11.384301 8.412714 19.797015
11.673992 8.455219 20.129211
11.586081 8.414606 20.000687
11.423509 8.64781 20.071319
11.43709 8.595252 20.032342
avg. 11.5009946 8.5051202 20.0061148
vs LB_MIN +1.99% -3.04% -0.21%
o Known issues:
For patch 2, function detach_tasks() iterates rb tree for tasks, if
there have one task has been detached then it calls rb_first() to
fetch first node and it will iterate again from first node; it's
better to use rb_next() but after change to use rb_next() will
introduce panic.
Welcome any suggestion for better implementation for it.
Leo Yan (3):
sched/fair: support to track biggest task on rq
sched/fair: select biggest task for migration
sched: remove unused rq::cfs_tasks
include/linux/sched.h | 1 +
include/linux/sched/sysctl.h | 1 +
kernel/sched/core.c | 2 -
kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 5 +-
kernel/sysctl.c | 7 +++
6 files changed, 116 insertions(+), 23 deletions(-)
--
1.9.1
Dear Dev,
This is to confirm that one or more of your parcels has been shipped.
Shipment Label is attached to email.
Sincerely,
Jonathan Stanley,
FedEx Station Agent.
o This patch series include performance optimization and some fixes.
One main purpose is to resolve performance issues for
multi-threading, this is finished by patch 0001, 0003, 0005 and
0006; also includes one main fix for tipping point which is
finished by patch 0007.
o All these patches have been tested on Juno R2 board. Especially for
performance optimization patches, the testing result is consistent
and repeatable on Juno board. This will make sure we have more
confidience to upstream these patches into Android common kernel and
mainline kernel.
The testing enviornment is based on ARM LT git tree:
https://git.linaro.org/landing-teams/working/arm/kernel-release.git
branch: origin/lsk-4.4-armlt-experimental
Test case: Geekbench with workload-automation
Test setting:
echo 0 > /proc/sys/kernel/sched_migration_cost_ns
echo 1 > /proc/sys/kernel/sched_domain/cpu0/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu0/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu1/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu1/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu2/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu2/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu3/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu3/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu4/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu4/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu5/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu5/domain1/busy_factor
o Test result:
Optimization with Patch 0001:
baseline Patch 0001 Opt.
Geekbench ST: 953.2 966.2 1.36%
Geekbench MT: 2175.8 2280.8 4.83%
Optimization with Patch 0003:
baseline Patch 0001+0003 Opt.
Geekbench ST: 953.2 969.2 1.68%
Geekbench MT: 2175.8 2356.8 8.32%
Optimization with all patches:
baseline All Patch Opt.
Geekbench ST: 953.2 968.6 1.62%
Geekbench MT: 2175.8 2371.2 8.98%
For performance improvment, three main contributed patches are:
0001: ~4.83%, 0003: ~3.3%, 0005: ~0.7%.
Also need note one thing is: usually sched_migration_cost_ns also has
big impaction on multi-threading performance, but we cannot see
prominent boosting on Juno board; the mainly reason is Juno board has
only 2 big cores.
o Compared to RFCv4 version [1], I have dropped all power optimization
related patches. The related patches are important for power saving,
but in the patches there have many hard-coded code but not general
enough. So I'd like to split these patches into a individe patch set.
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000543.html
Leo Yan (7):
sched/fair: kick nohz idle balance for misfit task
sched/fair: replace capacity_of by capacity_orig_of
sched/fair: fall back to traditional wakeup migration when system is
busy
sched/fair: fix build error for schedtune_task_margin
sched/fair: force load balance when busiest group is overloaded
Documentation: use sysfs for EAS performance tunning
sched/fair: consider CPU overutilized only when it is not idle
Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++-----
2 files changed, 72 insertions(+), 9 deletions(-)
--
1.9.1
o This patch series is to optimize power. For power optimization, it
should resolve issues from two factors, the first one is to find
the method to save power and avoid unnecessary task migrations to
big core, on the other hand it cannot downgrade for performance.
So this patch series is based on performance optimization patch
series [1] to finish furthermore works for power saving and achieve
the target: optimize power but without performance downgradation.
In RFCv3 have introduced power optimization related patches, but
related patches are not general enough. E.g, RFCv3 defines the
criteria for small task is: task_util(p) < 1/4 * cpu_capacity(cpu),
So this is very hard to apply this criteria cross all SoCs. This
patch series tries to figure out more general method for this.
o Below are backgroud info for power optimization:
For first step of power optimization, we should make sure the tasks
in the cluster can spread out; this have two benefits, one benefit is
trying to decrease frequency for every cluster, another benefit is after
spreading tasks within cluster it can explore the CPU capacity as
possible and avoid CPU is overutilized, so as result this can avoid
to migrate tasks to big cores; This is finished by patch 0001.
If there have big tasks and really need to migrate them onto big
core, for this case we should ensure the big tasks can be migrate to
big core firstly rather than small tasks. So introduces rb tree to
track biggest task on RQ in patch 0002, and patch 0003 uses rb tree
to migrate biggest tasks for higher capacity CPU.
Patch 0004 has most affection for power saving, it checks if wakeup
task can run at low capacity CPU. If so, it will force to run energy
aware scheduling path even system is over tipping point. The criteria
for wakeup task can run at low capacity CPU is: if any CPU's spare
bandwidth can meet waken task requirement; so this can ensure even
the task is keeping to run on low capacity CPU, the performanc is not
sacrificed.
o Test result:
Firstly applied patch series "EASv5.2+: Performance Optimization And
Fixes", tested power and performance; Then based on the code base
also applied this power saving patch series. Finally compare the power
data and performance data.
For power comprision the test case is video playback (1080p), below
are results on Juno board:
Items | LITTLE Nrg | big Nrg | Nrg
----------------------------------------------------------------
Perf opt | 11.0520992 | 9.7118762 | 20.7639754
Perf + Power opt | 11.4157602 | 8.7319138 | 20.147674
Comparision | +3.29% | -10.09% | -2.97%
[1] https://lists.linaro.org/pipermail/eas-dev/2016-October/000610.html
Leo Yan (4):
sched/fair: select lowest capacity CPU with packing tasks
sched/fair: support to track biggest task util on rq
sched/fair: migrate highest utilization task to higher capacity CPU
sched/fair: check if wakeup task can run low capacity CPU
include/linux/sched.h | 1 +
kernel/sched/fair.c | 213 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 4 +
3 files changed, 200 insertions(+), 18 deletions(-)
--
1.9.1
Dear Dev,
This is to confirm that one or more of your parcels has been shipped.
Delivery Label is attached to this email.
Yours sincerely,
Roger Dunlap,
FedEx Station Manager.
Dear Dev,
Courier was unable to deliver the parcel to you.
Please, open email attachment to print shipment label.
Yours sincerely,
Dustin Savage,
FedEx Delivery Manager.