The new context tracking subsystem unconditionally includes kvm_host.h
headers for the guest enter/exit macros. This causes a compile
failure when KVM is not enabled.
Fix by adding an IS_ENABLED(CONFIG_KVM) check to kvm_host so it can
be included/compiled even when KVM is not enabled.
Cc: Frederic Weisbecker <fweisbec(a)gmail.com>
Signed-off-by: Kevin Hilman <khilman(a)linaro.org>
---
Applies on v3.9-rc2
include/linux/kvm_host.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cad77fe..a942863 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1,6 +1,8 @@
#ifndef __KVM_HOST_H
#define __KVM_HOST_H
+#if IS_ENABLED(CONFIG_KVM)
+
/*
* This work is licensed under the terms of the GNU GPL, version 2. See
* the COPYING file in the top-level directory.
@@ -1055,5 +1057,8 @@ static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
}
#endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
+#else
+static inline void __guest_enter(void) { return; }
+static inline void __guest_exit(void) { return; }
+#endif /* IS_ENABLED(CONFIG_KVM) */
#endif
-
--
1.8.1.2
Hi,
This patchset takes advantage of the new per-task load tracking that is
available in the kernel for packing the small tasks in as few as possible
CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power
consumption in the low load use cases by minimizing the number of power domain
that are enabled. The packing is done in 2 steps:
The 1st step looks for the best place to pack tasks in a system according to
its topology and it defines a pack buddy CPU for each CPU if there is one
available. We define the best CPU during the build of the sched_domain instead
of evaluating it at runtime because it can be difficult to define a stable
buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is
that we pack at all levels inside a node where a group of CPU can be power
gated independently from others. For describing this capability, a new flag
has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the
groups of CPUs of a scheduling domain are sharing their power state. By
default, this flag has been set in all sched_domain in order to keep unchanged
the current behavior of the scheduler and only ARM platform clears the
SD_SHARE_POWERDOMAIN flag for MC and CPU level.
In a 2nd step, the scheduler checks the load average of a task which wakes up
as well as the load average of the buddy CPU and it can decide to migrate the
light tasks on a not busy buddy. This check is done during the wake up because
small tasks tend to wake up between periodic load balance and asynchronously
to each other which prevents the default mechanism to catch and migrate them
efficiently. A light task is defined by a runnable_avg_sum that is less than
20% of the runnable_avg_period. In fact, the former condition encloses 2 ones:
The average CPU load of the task must be less than 20% and the task must have
been runnable less than 10ms when it woke up last time in order to be
electable for the packing migration. So, a task than runs 1 ms each 5ms will
be considered as a small task but a task that runs 50 ms with a period of
500ms, will not.
Then, the business of the buddy CPU depends of the load average for the rq and
the number of running tasks. A CPU with a load average greater than 50% will
be considered as busy CPU whatever the number of running tasks is and this
threshold will be reduced by the number of running tasks in order to not
increase too much the wake up latency of a task. When the buddy CPU is busy,
the scheduler falls back to default CFS policy.
Change since V2:
- Migrate only a task that wakes up
- Change the light tasks threshold to 20%
- Change the loaded CPU threshold to not pull tasks if the current number of
running tasks is null but the load average is already greater than 50%
- Fix the algorithm for selecting the buddy CPU.
Change since V1:
Patch 2/6
- Change the flag name which was not clear. The new name is
SD_SHARE_POWERDOMAIN.
- Create an architecture dependent function to tune the sched_domain flags
Patch 3/6
- Fix issues in the algorithm that looks for the best buddy CPU
- Use pr_debug instead of pr_info
- Fix for uniprocessor
Patch 4/6
- Remove the use of usage_avg_sum which has not been merged
Patch 5/6
- Change the way the coherency of runnable_avg_sum and runnable_avg_period is
ensured
Patch 6/6
- Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM
platform
New results for v3:
This series has been tested with hackbench on ARM platform and the results
don't show any performance regression
Hackbench 3.9-rc2 +patches
Mean Time (10 tests): 2.048 2.015
stdev : 0.047 0.068
Previous results for V2:
This series has been tested with MP3 play back on ARM platform:
TC2 HMP (dual CA-15 and 3xCA-7 cluster).
The measurements have been done on an Ubuntu image during 60 seconds of
playback and the result has been normalized to 100.
| CA15 | CA7 | total |
-------------------------------------
default | 81 | 97 | 178 |
pack | 13 | 100 | 113 |
-------------------------------------
Previous results for V1:
The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP
(dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have
demonstrated that it's worth packing small tasks at all topology levels.
The performance tests have been done on both platforms with sysbench. The
results don't show any performance regressions. These results are aligned with
the policy which uses the normal behavior with heavy use cases.
test: sysbench --test=cpu --num-threads=N --max-requests=R run
Results below is the average duration of 3 tests on the quad CA-9.
default is the current scheduler behavior (pack buddy CPU is -1)
pack is the scheduler with the pack mechanism
| default | pack |
-----------------------------------
N=8; R=200 | 3.1999 | 3.1921 |
N=8; R=2000 | 31.4939 | 31.4844 |
N=12; R=200 | 3.2043 | 3.2084 |
N=12; R=2000 | 31.4897 | 31.4831 |
N=16; R=200 | 3.1774 | 3.1824 |
N=16; R=2000 | 31.4899 | 31.4897 |
-----------------------------------
The power consumption tests have been done only on TC2 platform which has got
accessible power lines and I have used cyclictest to simulate small tasks. The
tests show some power consumption improvements.
test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
The measurements have been done during 16 seconds and the result has been
normalized to 100
| CA15 | CA7 | total |
-------------------------------------
default | 100 | 40 | 140 |
pack | <1 | 45 | <46 |
-------------------------------------
The A15 cluster is less power efficient than the A7 cluster but if we assume
that the tasks is well spread on both clusters, we can guest estimate that the
power consumption on a dual cluster of CA7 would have been for a default
kernel:
| CA7 | CA7 | total |
-------------------------------------
default | 40 | 40 | 80 |
-------------------------------------
Vincent Guittot (6):
Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for
load-tracking"
sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain
sched: pack small tasks
sched: secure access to other CPU statistics
sched: pack the idle load balance
ARM: sched: clear SD_SHARE_POWERDOMAIN
arch/arm/kernel/topology.c | 9 +++
arch/ia64/include/asm/topology.h | 1 +
arch/tile/include/asm/topology.h | 1 +
include/linux/sched.h | 9 +--
include/linux/topology.h | 4 +
kernel/sched/core.c | 14 ++--
kernel/sched/fair.c | 149 +++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 14 ++--
8 files changed, 169 insertions(+), 32 deletions(-)
--
1.7.9.5
This patchset was called: "Create sched_select_cpu() and use it for workqueues"
for the first three versions.
Earlier discussions over v3, v2 and v1 can be found here:
https://lkml.org/lkml/2013/3/18/364http://lists.linaro.org/pipermail/linaro-dev/2012-November/014344.htmlhttp://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html
For power saving it is better to schedule work on cpus that aren't idle, as
bringing a cpu/cluster from idle state can be very costly (both performance and
power wise). Earlier we tried to use timer infrastructure to take this decision
but we found out later that scheduler gives even better results and so we should
use scheduler for choosing cpu for scheduling work.
In workqueue subsystem workqueues with flag WQ_UNBOUND are the ones which uses
cpu to select target cpu.
Here we are migrating few users of workqueues to WQ_UNBOUND. These drivers are
found to be very much active on idle or lightly busy system and using WQ_UNBOUND
for these gave impressive results.
Setup:
-----
- ARM Vexpress TC2 - big.LITTLE CPU
- Core 0-1: A15, 2-4: A7
- rootfs: linaro-ubuntu-devel
This patchset has been tested on a big LITTLE system (heterogeneous) but is
useful for all other homogeneous systems as well. During these tests audio was
played in background using aplay.
Results:
-------
Cluster A15 Energy Cluster A7 Energy Total
------------------------- ----------------------- ------
Without this patchset (Energy in Joules):
---------------------------------------------------
0.151162 2.183545 2.334707
0.223730 2.687067 2.910797
0.289687 2.732702 3.022389
0.454198 2.745908 3.200106
0.495552 2.746465 3.242017
Average:
0.322866 2.619137 2.942003
With this patchset (Energy in Joules):
-----------------------------------------------
0.226421 2.283658 2.510079
0.151361 2.236656 2.388017
0.197726 2.249849 2.447575
0.221915 2.229446 2.451361
0.347098 2.257707 2.604805
Average:
0.2289042 2.2514632 2.4803674
Above tests are repeated multiple times and events are tracked using trace-cmd
and analysed using kernelshark. And it was easily noticeable that idle time for
many cpus has increased considerably, which eventually saved some power.
PS: All the earlier Acks we got for drivers are reverted here as patches have
been updated significantly.
V3->V4:
-------
- Dropped changes to kernel/sched directory and hence
sched_select_non_idle_cpu().
- Dropped queue_work_on_any_cpu()
- Created system_freezable_unbound_wq
- Changed all patches accordingly.
V2->V3:
-------
- Dropped changes into core queue_work() API, rather create *_on_any_cpu()
APIs
- Dropped running timers migration patch as that was broken
- Migrated few users of workqueues to use *_on_any_cpu() APIs.
Viresh Kumar (4):
workqueue: Add system wide system_freezable_unbound_wq
PHYLIB: queue work on unbound wq
block: queue work on unbound wq
fbcon: queue work on unbound wq
block/blk-core.c | 3 ++-
block/blk-ioc.c | 2 +-
block/genhd.c | 10 ++++++----
drivers/net/phy/phy.c | 9 +++++----
drivers/video/console/fbcon.c | 2 +-
include/linux/workqueue.h | 4 ++++
kernel/workqueue.c | 7 ++++++-
7 files changed, 25 insertions(+), 12 deletions(-)
--
1.7.12.rc2.18.g61b472e
This series is a set of prerequistes for getting the new context
tracking subsystem, and adaptive tickless support working on ARM.
Kevin Hilman (3):
cputime_nsecs: use math64.h for nsec resolution conversion helpers
init/Kconfig: virt CPU accounting: drop 64-bit requirment
ARM: Kconfig: allow virt CPU accounting
arch/arm/Kconfig | 1 +
include/asm-generic/cputime_nsecs.h | 28 +++++++++++++++++++---------
init/Kconfig | 2 +-
3 files changed, 21 insertions(+), 10 deletions(-)
--
1.8.2
The noop functions code is not necessary because the header file is
included in files which are compiled when CONFIG_CPU_IDLE is on.
Signed-off-by: Daniel Lezcano <daniel.lezcano(a)linaro.org>
---
arch/arm/include/asm/cpuidle.h | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/arch/arm/include/asm/cpuidle.h b/arch/arm/include/asm/cpuidle.h
index 2fca60a..7367787 100644
--- a/arch/arm/include/asm/cpuidle.h
+++ b/arch/arm/include/asm/cpuidle.h
@@ -1,13 +1,8 @@
#ifndef __ASM_ARM_CPUIDLE_H
#define __ASM_ARM_CPUIDLE_H
-#ifdef CONFIG_CPU_IDLE
extern int arm_cpuidle_simple_enter(struct cpuidle_device *dev,
- struct cpuidle_driver *drv, int index);
-#else
-static inline int arm_cpuidle_simple_enter(struct cpuidle_device *dev,
- struct cpuidle_driver *drv, int index) { return -ENODEV; }
-#endif
+ struct cpuidle_driver *drv, int index);
/* Common ARM WFI state */
#define ARM_CPUIDLE_WFI_STATE_PWR(p) {\
--
1.7.9.5