I tried to verify kgdb in vanilla kernel on fast model, but it seems that
the single stepping with kgdb doesn't work correctly since its first
appearance at v3.15.
On v3.15, 'stepi' command after breaking the kernel at some breakpoint
steps forward to the next instruction, but the succeeding 'stepi' never
goes beyond that.
On v3.16, 'stepi' moves forward and stops at the next instruction just
after enable_dbg in el1_dbg, and never goes beyond that. This variance of
behavior seems to come in with the following patch in v3.16:
commit 2a2830703a23 ("arm64: debug: avoid accessing mdscr_el1 on fault
paths where possible")
This patch
(1) moves kgdb_disable_single_step() from 'c' command handling to single
step handler.
This makes sure that single stepping gets effective at every 's' command.
Please note that, under the current implementation, single step bit in
spsr, which is cleared by the first single stepping, will not be set
again for the consecutive 's' commands because single step bit in mdscr
is still kept on (that is, kernel_active_single_step() in
kgdb_arch_handle_exception() is true).
(2) re-implements kgdb_roundup_cpus() because the current implementation
enabled interrupts naively. See below.
(3) removes 'enable_dbg' in el1_dbg.
Single step bit in mdscr is turned on in do_handle_exception()->
kgdb_handle_expection() before returning to debugged context, and if
debug exception is enabled in el1_dbg, we will see unexpected single-
stepping in el1_dbg.
Since v3.18, the following patch does the same:
commit 1059c6bf8534 ("arm64: debug: don't re-enable debug exceptions
on return from el1_dbg)
(4) masks interrupts while single-stepping one instruction.
If an interrupt is caught during processing a single-stepping, debug
exception is unintentionally enabled by el1_irq's 'enable_dbg' before
returning to debugged context.
Thus, like in (2), we will see unexpected single-stepping in el1_irq.
Basically (1) and (2) are for v3.15, (3) and (4) for v3.1[67].
* issue fixed by (2):
Without (2), we would see another problem if a breakpoint is set at
interrupt-sensible places, like gic_handle_irq():
KGDB: re-enter error: breakpoint removed ffffffc000081258
------------[ cut here ]------------
WARNING: CPU: 0 PID: 650 at kernel/debug/debug_core.c:435
kgdb_handle_exception+0x1dc/0x1f4()
Modules linked in:
CPU: 0 PID: 650 Comm: sh Not tainted 3.17.0-rc2+ #177
Call trace:
[<ffffffc000087fac>] dump_backtrace+0x0/0x130
[<ffffffc0000880ec>] show_stack+0x10/0x1c
[<ffffffc0004d683c>] dump_stack+0x74/0xb8
[<ffffffc0000ab824>] warn_slowpath_common+0x8c/0xb4
[<ffffffc0000ab90c>] warn_slowpath_null+0x14/0x20
[<ffffffc000121bfc>] kgdb_handle_exception+0x1d8/0x1f4
[<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28
[<ffffffc0000821c8>] brk_handler+0x9c/0xe8
[<ffffffc0000811e8>] do_debug_exception+0x3c/0xac
Exception stack(0xffffffc07e027650 to 0xffffffc07e027770)
...
[<ffffffc000083cac>] el1_dbg+0x14/0x68
[<ffffffc00012178c>] kgdb_cpu_enter+0x464/0x5c0
[<ffffffc000121bb4>] kgdb_handle_exception+0x190/0x1f4
[<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28
[<ffffffc0000821c8>] brk_handler+0x9c/0xe8
[<ffffffc0000811e8>] do_debug_exception+0x3c/0xac
Exception stack(0xffffffc07e027ac0 to 0xffffffc07e027be0)
...
[<ffffffc000083cac>] el1_dbg+0x14/0x68
[<ffffffc00032e4b4>] __handle_sysrq+0x11c/0x190
[<ffffffc00032e93c>] write_sysrq_trigger+0x4c/0x60
[<ffffffc0001e7d58>] proc_reg_write+0x54/0x84
[<ffffffc000192fa4>] vfs_write+0x98/0x1c8
[<ffffffc0001939b0>] SyS_write+0x40/0xa0
Once some interrupt occurs, a breakpoint at gic_handle_irq() triggers kgdb.
Kgdb then calls kgdb_roundup_cpus() to sync with other cpus.
Current kgdb_roundup_cpus() unmasks interrupts temporarily to
use smp_call_function().
This eventually allows another interrupt to occur and likely results in
hitting a breakpoint at gic_handle_irq() again since debug exception is
always enabled in el1_irq.
We can avoid this issue by specifying "nokgdbroundup" in kernel parameter,
but this will also leave other cpus be in unknown state in terms of kgdb,
and may result in interfering with kgdb activity.
Signed-off-by: AKASHI Takahiro <takahiro.akashi(a)linaro.org>
---
arch/arm64/kernel/kgdb.c | 60 +++++++++++++++++++++++++++++++++++-----------
1 file changed, 46 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c
index a0d10c5..81b5910 100644
--- a/arch/arm64/kernel/kgdb.c
+++ b/arch/arm64/kernel/kgdb.c
@@ -19,9 +19,13 @@
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
+#include <linux/cpumask.h>
#include <linux/irq.h>
+#include <linux/irq_work.h>
#include <linux/kdebug.h>
#include <linux/kgdb.h>
+#include <linux/percpu.h>
+#include <asm/ptrace.h>
#include <asm/traps.h>
struct dbg_reg_def_t dbg_reg_def[DBG_MAX_REG_NUM] = {
@@ -95,6 +99,9 @@ struct dbg_reg_def_t dbg_reg_def[DBG_MAX_REG_NUM] = {
{ "fpcr", 4, -1 },
};
+static DEFINE_PER_CPU(unsigned int, kgdb_pstate);
+static DEFINE_PER_CPU(struct irq_work, kgdb_irq_work);
+
char *dbg_get_reg(int regno, void *mem, struct pt_regs *regs)
{
if (regno >= DBG_MAX_REG_NUM || regno < 0)
@@ -176,18 +183,14 @@ int kgdb_arch_handle_exception(int exception_vector, int signo,
* over and over again.
*/
kgdb_arch_update_addr(linux_regs, remcom_in_buffer);
- atomic_set(&kgdb_cpu_doing_single_step, -1);
- kgdb_single_step = 0;
-
- /*
- * Received continue command, disable single step
- */
- if (kernel_active_single_step())
- kernel_disable_single_step();
err = 0;
break;
case 's':
+ /* mask interrupts while single stepping */
+ __this_cpu_write(kgdb_pstate, linux_regs->pstate);
+ linux_regs->pstate |= PSR_I_BIT;
+
/*
* Update step address value with address passed
* with step packet.
@@ -198,8 +201,6 @@ int kgdb_arch_handle_exception(int exception_vector, int signo,
*/
kgdb_arch_update_addr(linux_regs, remcom_in_buffer);
atomic_set(&kgdb_cpu_doing_single_step, raw_smp_processor_id());
- kgdb_single_step = 1;
-
/*
* Enable single step handling
*/
@@ -229,6 +230,18 @@ static int kgdb_compiled_brk_fn(struct pt_regs *regs, unsigned int esr)
static int kgdb_step_brk_fn(struct pt_regs *regs, unsigned int esr)
{
+ unsigned int pstate;
+
+ kernel_disable_single_step();
+ atomic_set(&kgdb_cpu_doing_single_step, -1);
+
+ /* restore interrupt mask status */
+ pstate = __this_cpu_read(kgdb_pstate);
+ if (pstate & PSR_I_BIT)
+ regs->pstate |= PSR_I_BIT;
+ else
+ regs->pstate &= ~PSR_I_BIT;
+
kgdb_handle_exception(1, SIGTRAP, 0, regs);
return 0;
}
@@ -249,16 +262,27 @@ static struct step_hook kgdb_step_hook = {
.fn = kgdb_step_brk_fn
};
-static void kgdb_call_nmi_hook(void *ignored)
+static void kgdb_roundup_hook(struct irq_work *work)
{
kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs());
}
void kgdb_roundup_cpus(unsigned long flags)
{
- local_irq_enable();
- smp_call_function(kgdb_call_nmi_hook, NULL, 0);
- local_irq_disable();
+ int cpu;
+ struct cpumask mask;
+ struct irq_work *work;
+
+ mask = *cpu_online_mask;
+ cpumask_clear_cpu(smp_processor_id(), &mask);
+ cpu = cpumask_first(&mask);
+ if (cpu >= nr_cpu_ids)
+ return;
+
+ for_each_cpu(cpu, &mask) {
+ work = per_cpu_ptr(&kgdb_irq_work, cpu);
+ irq_work_queue_on(work, cpu);
+ }
}
static int __kgdb_notify(struct die_args *args, unsigned long cmd)
@@ -299,6 +323,8 @@ static struct notifier_block kgdb_notifier = {
int kgdb_arch_init(void)
{
int ret = register_die_notifier(&kgdb_notifier);
+ int cpu;
+ struct irq_work *work;
if (ret != 0)
return ret;
@@ -306,6 +332,12 @@ int kgdb_arch_init(void)
register_break_hook(&kgdb_brkpt_hook);
register_break_hook(&kgdb_compiled_brkpt_hook);
register_step_hook(&kgdb_step_hook);
+
+ for_each_possible_cpu(cpu) {
+ work = per_cpu_ptr(&kgdb_irq_work, cpu);
+ init_irq_work(work, kgdb_roundup_hook);
+ }
+
return 0;
}
--
1.7.9.5
From: Kevin Hilman <khilman(a)linaro.org>
The option CONFIG_EXYNOS5420_MCPM is causing imprecise external aborts
during boot testing, causing various userspace startup failures.
Disable until it has gotten more testing.
Cc: Kukjin Kim <kgene.kim(a)samsung.com>,
Cc: Javier Martinez Canillas <javier.martinez(a)collabora.co.uk>,
Cc: Sachin Kamat <sachin.kamat(a)samsung.com>,
Cc: Doug Anderson <dianders(a)chromium.org>,
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie(a)samsung.com>,
Cc: Krzysztof Kozlowski <k.kozlowski(a)samsung.com>,
Cc: Tushar Behera <tushar.behera(a)linaro.org>,
Cc: stable(a)vger.kernel.org # v3.17+
Signed-off-by: Kevin Hilman <khilman(a)linaro.org>
---
This has been reported by a few people[1], but not investigated or fixed, so it's
time to disable this feature until it can be fixed.
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-September/288344…
arch/arm/configs/exynos_defconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/configs/exynos_defconfig b/arch/arm/configs/exynos_defconfig
index 72058b8a6f4d..a250dcbf34cd 100644
--- a/arch/arm/configs/exynos_defconfig
+++ b/arch/arm/configs/exynos_defconfig
@@ -10,7 +10,7 @@ CONFIG_MODULE_UNLOAD=y
CONFIG_PARTITION_ADVANCED=y
CONFIG_ARCH_EXYNOS=y
CONFIG_ARCH_EXYNOS3=y
-CONFIG_EXYNOS5420_MCPM=y
+CONFIG_EXYNOS5420_MCPM=n
CONFIG_SMP=y
CONFIG_BIG_LITTLE=y
CONFIG_BL_SWITCHER=y
--
2.1.0
This series demonstrates cpu frequency scaling via a simple policy
driven by the scheduler. Specifically the policy evaluates cpu frequency
when cpu utilization is updated from enqueue_task_fair and
dequeue_task_fair. The policy itself uses a simple up/down threshold
scheme using the same 80%/20% cpu utilization boundaries that are used
by default in the ondemand cpufreq governor.
This series is not intended for merging, but instead to ignite some
discussion around scheduler-driven cpu frequency selection. Of
particular interest to me is the policy itself and how it might
integrate with task placement in CFS's load_balance. Additionally I'd
like to ask the scheduler experts about which call sites in CFS are
right for evaluating cpu frequency selection; maybe
{en,de}queue_task_fair are not such a good idea?
The messiest part of this series is the cpumask stuff, where I tried to
track which cpus have updated statistics in the case of a sched_entity
which contains several other sched_entities that are spread across cpus.
As discussed at Linux Plumbers Conference 2014, I will replace this
complexity with simpler logic that ignores scheduler cgroups in the next
version. In any case I am posting the code I have now.
This code is experiemental and bugs are Guaranteed.
These patches are based on the scale invariance series from Morten[0].
The variables names in this RFC will doubtless change once that work is
rebased onto Vincent's series[1].
[0] http://lkml.kernel.org/r/<1411403047-32010-1-git-send-email-morten.rasmussen(a)arm.com>
[1] http://lkml.kernel.org/r/<1412684017-16595-1-git-send-email-vincent.guittot(a)linaro.org>
Mike Turquette (6):
sched: cfs: declare capacity_of in sched.h
sched: fair: add usage_util_of helper
cpufreq: add per-governor private data
sched: cfs: cpu frequency scaling arch functions
sched: cfs: cpu frequency scaling based on task placement
sched: energy_model: simple cpu frequency scaling policy
Morten Rasmussen (1):
sched: Make energy awareness a sched feature
drivers/cpufreq/Kconfig | 21 +++
include/linux/cpufreq.h | 6 +
kernel/sched/Makefile | 1 +
kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 69 ++++++++-
kernel/sched/features.h | 6 +
kernel/sched/sched.h | 3 +
7 files changed, 445 insertions(+), 2 deletions(-)
create mode 100644 kernel/sched/energy_model.c
--
1.8.3.2
Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_capacity (this patchset)
-tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
cpu_capacity_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_capacity_orig/capacity_orig) and the current capacity
(cpu_capacity/capacity) of CPUs and sched_groups. A new function
arch_scale_cpu_capacity has been created and replace arch_scale_smt_capacity,
which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores and by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we detect when the
group is fully utilized
Now that we have the original capacity of CPUS and their activity/utilization,
we can evaluate more accuratly the capacity and the level of utilization of a
group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.
Tests results:
I have put below results of 4 kind of tests:
- hackbench -l 500 -s 4096
- perf bench sched pipe -l 400000
- scp of 100MB file on the platform
- ebizzy with various number of threads
on 4 kernels :
- tip = tip/sched/core
- step1 = tip + patches(1-8)
- patchset = tip + whole patchset
- patchset+irq = tip + this patchset + irq accounting
each test has been run 6 times and the figure below show the stdev and the
diff compared to the tip kernel
Dual A7 tip | +step1 | +patchset | patchset+irq
stdev | results stdev | results stdev | results stdev
hackbench (lower is better) (+/-)0.64% | -0.19% (+/-)0.73% | 0.58% (+/-)1.29% | 0.20% (+/-)1.00%
perf (lower is better) (+/-)0.28% | 1.22% (+/-)0.17% | 1.29% (+/-)0.06% | 2.85% (+/-)0.33%
scp (+/-)4.81% | 2.61% (+/-)0.28% | 2.39% (+/-)0.22% | 82.18% (+/-)3.30%
ebizzy -t 1 (+/-)2.31% | -1.32% (+/-)1.90% | -0.79% (+/-)2.88% | 3.10% (+/-)2.32%
ebizzy -t 2 (+/-)0.70% | 8.29% (+/-)6.66% | 1.93% (+/-)5.47% | 2.72% (+/-)5.72%
ebizzy -t 4 (+/-)3.54% | 5.57% (+/-)8.00% | 0.36% (+/-)9.00% | 2.53% (+/-)3.17%
ebizzy -t 6 (+/-)2.36% | -0.43% (+/-)3.29% | -1.93% (+/-)3.47% | 0.57% (+/-)0.75%
ebizzy -t 8 (+/-)1.65% | -0.45% (+/-)0.93% | -1.95% (+/-)1.52% | -1.18% (+/-)1.61%
ebizzy -t 10 (+/-)2.55% | -0.98% (+/-)3.06% | -1.18% (+/-)6.17% | -2.33% (+/-)3.28%
ebizzy -t 12 (+/-)6.22% | 0.17% (+/-)5.63% | 2.98% (+/-)7.11% | 1.19% (+/-)4.68%
ebizzy -t 14 (+/-)5.38% | -0.14% (+/-)5.33% | 2.49% (+/-)4.93% | 1.43% (+/-)6.55%
Quad A15 tip | +patchset1 | +patchset2 | patchset+irq
stdev | results stdev | results stdev | results stdev
hackbench (lower is better) (+/-)0.78% | 0.87% (+/-)1.72% | 0.91% (+/-)2.02% | 3.30% (+/-)2.02%
perf (lower is better) (+/-)2.03% | -0.31% (+/-)0.76% | -2.38% (+/-)1.37% | 1.42% (+/-)3.14%
scp (+/-)0.04% | 0.51% (+/-)1.37% | 1.79% (+/-)0.84% | 1.77% (+/-)0.38%
ebizzy -t 1 (+/-)0.41% | 2.05% (+/-)0.38% | 2.08% (+/-)0.24% | 0.17% (+/-)0.62%
ebizzy -t 2 (+/-)0.78% | 0.60% (+/-)0.63% | 0.43% (+/-)0.48% | 1.61% (+/-)0.38%
ebizzy -t 4 (+/-)0.58% | -0.10% (+/-)0.97% | -0.65% (+/-)0.76% | -0.75% (+/-)0.86%
ebizzy -t 6 (+/-)0.31% | 1.07% (+/-)1.12% | -0.16% (+/-)0.87% | -0.76% (+/-)0.22%
ebizzy -t 8 (+/-)0.95% | -0.30% (+/-)0.85% | -0.79% (+/-)0.28% | -1.66% (+/-)0.21%
ebizzy -t 10 (+/-)0.31% | 0.04% (+/-)0.97% | -1.44% (+/-)1.54% | -0.55% (+/-)0.62%
ebizzy -t 12 (+/-)8.35% | -1.89% (+/-)7.64% | 0.75% (+/-)5.30% | -1.18% (+/-)8.16%
ebizzy -t 14 (+/-)13.17% | 6.22% (+/-)4.71% | 5.25% (+/-)9.14% | 5.87% (+/-)5.77%
I haven't been able to fully test the patchset for a SMT system to check that
the regression that has been reported by Preethi has been solved but the
various tests that i have done, don't show any regression so far.
The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level
should have fix the regression.
The usage_avg_contrib is based on the current implementation of the
load avg tracking. I also have a version of the usage_avg_contrib that is based
on the new implementation [3] but haven't provide the patches and results as
[3] is still under review. I can provide change above [3] to change how
usage_avg_contrib is computed and adapt to new mecanism.
TODO: manage conflict with the next version of [4]
Change since V3:
- add usage_avg_contrib statistic which sums the running time of tasks on a rq
- use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
- fix replacement power by capacity
- update some comments
Change since V2:
- rebase on top of capacity renaming
- fix wake_affine statistic update
- rework nohz_kick_needed
- optimize the active migration of a task from CPU with reduced capacity
- rename group_activity by group_utilization and remove unused total_utilization
- repair SD_PREFER_SIBLING and use it for SMT level
- reorder patchset to gather patches with same topics
Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity
[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377
[3] https://lkml.org/lkml/2014/7/18/110
[4] https://lkml.org/lkml/2014/7/25/589
Vincent Guittot (12):
sched: fix imbalance flag reset
sched: remove a wake_affine condition
sched: fix avg_load computation
sched: Allow all archs to set the capacity_orig
ARM: topology: use new cpu_capacity interface
sched: add per rq cpu_capacity_orig
sched: test the cpu's capacity in wake affine
sched: move cfs task on a CPU with higher capacity
sched: add usage_load_avg
sched: get CPU's utilization statistic
sched: replace capacity_factor by utilization
sched: add SD_PREFER_SIBLING for SMT level
arch/arm/kernel/topology.c | 4 +-
include/linux/sched.h | 4 +-
kernel/sched/core.c | 3 +-
kernel/sched/fair.c | 350 ++++++++++++++++++++++++++-------------------
kernel/sched/sched.h | 3 +-
5 files changed, 207 insertions(+), 157 deletions(-)
--
1.9.1
Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_power (this patchset)
-tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
cpu_power_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_power_orig/power_orig) and the current capacity
(cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
has been created and replace arch_scale_smt_power, which is SMT specifc in the
computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
This assumption generates wrong decision by creating ghost cores and by
removing real ones when the original capacity of CPUs is different from the
default SCHED_POWER_SCALE.
Now that we have the original capacity of a CPUS and its activity/utilization,
we can evaluate more accuratly the capacity of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we can certainly take advantage of this new
statistic in several other places of the load balance.
TODO:
- align variable's and field's name with the renaming [3]
Tests results:
I have put below results of 2 tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform
on a dual cortex-A7
hackbench scp
tip/master 25.75s(+/-0.25) 5.16MB/s(+/-1.49)
+ patches 1,2 25.89s(+/-0.31) 5.18MB/s(+/-1.45)
+ patches 3-10 25.68s(+/-0.22) 7.00MB/s(+/-1.88)
+ irq accounting 25.80s(+/-0.25) 8.06MB/s(+/-0.05)
on a quad cortex-A15
hackbench scp
tip/master 15.69s(+/-0.16) 9.70MB/s(+/-0.04)
+ patches 1,2 15.53s(+/-0.13) 9.72MB/s(+/-0.05)
+ patches 3-10 15.56s(+/-0.22) 9.88MB/s(+/-0.05)
+ irq accounting 15.99s(+/-0.08) 10.37MB/s(+/-0.03)
The improvement of scp bandwidth happens when tasks and irq are using
different CPU which is a bit random without irq accounting config
Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity
[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377
[3] https://lkml.org/lkml/2014/5/14/622
Vincent Guittot (11):
sched: fix imbalance flag reset
sched: remove a wake_affine condition
sched: fix avg_load computation
sched: Allow all archs to set the power_orig
ARM: topology: use new cpu_power interface
sched: add per rq cpu_power_orig
Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"
sched: get CPU's activity statistic
sched: test the cpu's capacity in wake affine
sched: move cfs task on a CPU with higher capacity
sched: replace capacity by activity
arch/arm/kernel/topology.c | 4 +-
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 229 ++++++++++++++++++++++-----------------------
kernel/sched/sched.h | 5 +-
4 files changed, 118 insertions(+), 122 deletions(-)
--
1.9.1
During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we evaluate the usage
of a group and compare it with its capacity.
This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.
The utilization_avg_contrib is based on the current implementation of the
load avg tracking. I also have a version of the utilization_avg_contrib that
is based on the new implementation proposal [1] but haven't provide the patches
and results as [1] is still under review. I can provide change above [1] to
change how utilization_avg_contrib is computed and adapt to new mecanism.
Change since V6
- add group usage tracking
- fix some commits' messages
- minor fix like comments and argument order
Change since V5
- remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
- update commit log and add more details on the purpose of the patches
- fix/remove useless code with the rebase on patchset [2]
- remove capacity_orig in sched_group_capacity as it is not used
- move code in the right patch
- add some helper function to factorize code
Change since V4
- rebase to manage conflicts with changes in selection of busiest group [4]
Change since V3:
- add usage_avg_contrib statistic which sums the running time of tasks on a rq
- use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
- fix replacement power by capacity
- update some comments
Change since V2:
- rebase on top of capacity renaming
- fix wake_affine statistic update
- rework nohz_kick_needed
- optimize the active migration of a task from CPU with reduced capacity
- rename group_activity by group_utilization and remove unused total_utilization
- repair SD_PREFER_SIBLING and use it for SMT level
- reorder patchset to gather patches with same topics
Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity
[1] https://lkml.org/lkml/2014/7/18/110
[2] https://lkml.org/lkml/2014/7/25/589
Morten Rasmussen (1):
sched: Track group sched_entity usage contributions
Vincent Guittot (6):
sched: add per rq cpu_capacity_orig
sched: move cfs task on a CPU with higher capacity
sched: add utilization_avg_contrib
sched: get CPU's usage statistic
sched: replace capacity_factor by usage
sched: add SD_PREFER_SIBLING for SMT level
include/linux/sched.h | 21 +++-
kernel/sched/core.c | 15 +--
kernel/sched/debug.c | 12 ++-
kernel/sched/fair.c | 283 ++++++++++++++++++++++++++++++--------------------
kernel/sched/sched.h | 11 +-
5 files changed, 209 insertions(+), 133 deletions(-)
--
1.9.1
During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we evaluate the usage
of a group and compare it with its capacity
This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.
The utilization_avg_contrib is based on the current implementation of the
load avg tracking. I also have a version of the utilization_avg_contrib that
is based on the new implementation proposal [1] but haven't provide the patches
and results as [1] is still under review. I can provide change above [1] to
change how utilization_avg_contrib is computed and adapt to new mecanism.
Change since V5
- remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
- update commit log and add more details on the purpose of the patches
- fix/remove useless code with the rebase on patchset [2]
- remove capacity_orig in sched_group_capacity as it is not used
- move code in the right patch
- add some helper function to factorize code
Change since V4
- rebase to manage conflicts with changes in selection of busiest group [4]
Change since V3:
- add usage_avg_contrib statistic which sums the running time of tasks on a rq
- use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
- fix replacement power by capacity
- update some comments
Change since V2:
- rebase on top of capacity renaming
- fix wake_affine statistic update
- rework nohz_kick_needed
- optimize the active migration of a task from CPU with reduced capacity
- rename group_activity by group_utilization and remove unused total_utilization
- repair SD_PREFER_SIBLING and use it for SMT level
- reorder patchset to gather patches with same topics
Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity
[1] https://lkml.org/lkml/2014/7/18/110
[2] https://lkml.org/lkml/2014/7/25/589
Vincent Guittot (6):
sched: add per rq cpu_capacity_orig
sched: move cfs task on a CPU with higher capacity
sched: add utilization_avg_contrib
sched: get CPU's usage statistic
sched: replace capacity_factor by usage
sched: add SD_PREFER_SIBLING for SMT level
include/linux/sched.h | 19 +++-
kernel/sched/core.c | 15 +--
kernel/sched/debug.c | 9 +-
kernel/sched/fair.c | 276 ++++++++++++++++++++++++++++++--------------------
kernel/sched/sched.h | 11 +-
5 files changed, 199 insertions(+), 131 deletions(-)
--
1.9.1
This patchset consolidates several changes in the capacity and the usage
tracking of the CPU. It provides a frequency invariant metric of the usage of
CPUs and generally improves the accuracy of load/usage tracking in the
scheduler. The frequency invariant metric is the foundation required for the
consolidation of cpufreq and implementation of a fully invariant load tracking.
These are currently WIP and require several changes to the load balancer
(including how it will use and interprets load and capacity metrics) and
extensive validation. The frequency invariance is done with
arch_scale_freq_capacity and this patchset doesn't provide the backends of
the function which are architecture dependent.
As discussed at LPC14, Morten and I have consolidated our changes into a single
patchset to make it easier to review and merge.
During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. With this patch set, we don't try anymore to
evaluate the number of available cores based on the group_capacity but instead
we evaluate the usage of a group and compare it with its capacity.
This patchset mainly replaces the old capacity_factor method by a new one and
keeps the general policy almost unchanged. These new metrics will be also used
in later patches.
The CPU usage is based on a running time tracking version of the current
implementation of the load average tracking. I also have a version that is
based on the new implementation proposal [1] but I haven't provide the patches
and results as [1] is still under review. I can provide change above [1] to
change how CPU usage is computed and to adapt to new mecanism.
Change since V7
- add freq invariance for usage tracking
- add freq invariance for scale_rt
- update comments and commits' message
- fix init of utilization_avg_contrib
- fix prefer_sibling
Change since V6
- add group usage tracking
- fix some commits' messages
- minor fix like comments and argument order
Change since V5
- remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
- update commit log and add more details on the purpose of the patches
- fix/remove useless code with the rebase on patchset [2]
- remove capacity_orig in sched_group_capacity as it is not used
- move code in the right patch
- add some helper function to factorize code
Change since V4
- rebase to manage conflicts with changes in selection of busiest group
Change since V3:
- add usage_avg_contrib statistic which sums the running time of tasks on a rq
- use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
- fix replacement power by capacity
- update some comments
Change since V2:
- rebase on top of capacity renaming
- fix wake_affine statistic update
- rework nohz_kick_needed
- optimize the active migration of a task from CPU with reduced capacity
- rename group_activity by group_utilization and remove unused total_utilization
- repair SD_PREFER_SIBLING and use it for SMT level
- reorder patchset to gather patches with same topics
Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity
[1] https://lkml.org/lkml/2014/10/10/131
[2] https://lkml.org/lkml/2014/7/25/589
Morten Rasmussen (2):
sched: Track group sched_entity usage contributions
sched: Make sched entity usage tracking scale-invariant
Vincent Guittot (8):
sched: add per rq cpu_capacity_orig
sched: remove frequency scaling from cpu_capacity
sched: move cfs task on a CPU with higher capacity
sched: add utilization_avg_contrib
sched: get CPU's usage statistic
sched: replace capacity_factor by usage
sched: add SD_PREFER_SIBLING for SMT level
sched: make scale_rt invariant with frequency
include/linux/sched.h | 21 ++-
kernel/sched/core.c | 15 +-
kernel/sched/debug.c | 12 +-
kernel/sched/fair.c | 369 ++++++++++++++++++++++++++++++++------------------
kernel/sched/sched.h | 15 +-
5 files changed, 276 insertions(+), 156 deletions(-)
--
1.9.1
From: Mark Brown <broonie(a)linaro.org>
The only requirement the scheduler has on cluster IDs is that they must
be unique. When enumerating the topology based on MPIDR information the
kernel currently generates cluster IDs by using the first level of
affinity above the core ID (either level one or two depending on if the
core has multiple threads) however the ARMv8 architecture allows for up
to three levels of affinity. This means that an ARMv8 system may
contain cores which have MPIDRs identical other than affinity level
three which with current code will cause us to report multiple cores
with the same identification to the scheduler in violation of its
uniqueness requirement.
Ensure that we do not violate the scheduler requirements on systems that
uses all the affinity levels by incorporating both affinity levels two
and three into the cluser ID when the cores are not threaded.
While no currently known hardware uses multi-level clusters it is better
to program defensively, this will help ease bringup of systems that have
them and will ensure that things like distribution install media do not
need to be respun to replace kernels in order to deploy such systems.
In the worst case the system will work but perform suboptimally until a
kernel modified to handle the new topology better is installed, in the
best case this will be an adequate description of such topologies for
the scheduler to perform well.
Signed-off-by: Mark Brown <broonie(a)linaro.org>
---
arch/arm64/kernel/topology.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index b6ee26b..5752c1b 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -255,7 +255,8 @@ void store_cpu_topology(unsigned int cpuid)
/* Multiprocessor system : Multi-threads per core */
cpuid_topo->thread_id = MPIDR_AFFINITY_LEVEL(mpidr, 0);
cpuid_topo->core_id = MPIDR_AFFINITY_LEVEL(mpidr, 1);
- cpuid_topo->cluster_id = MPIDR_AFFINITY_LEVEL(mpidr, 2);
+ cpuid_topo->cluster_id = MPIDR_AFFINITY_LEVEL(mpidr, 2) |
+ MPIDR_AFFINITY_LEVEL(mpidr, 3) << 8;
} else {
/* Multiprocessor system : Single-thread per core */
cpuid_topo->thread_id = -1;
--
2.1.0.rc1
Currently kdb's ftdump command unconditionally crashes due to a null
pointer de-reference whenever the command is run. This in turn causes
the kernel to panic.
The abridged stacktrace (gathered with ARCH=arm) is:
--- cut here ---
[<c09535ac>] (panic) from [<c02132dc>] (die+0x264/0x440)
[<c02132dc>] (die) from [<c0952eb8>]
(__do_kernel_fault.part.11+0x74/0x84)
[<c0952eb8>] (__do_kernel_fault.part.11) from [<c021f954>]
(do_page_fault+0x1d0/0x3c4)
[<c021f954>] (do_page_fault) from [<c020846c>] (do_DataAbort+0x48/0xac)
[<c020846c>] (do_DataAbort) from [<c0213c58>] (__dabt_svc+0x38/0x60)
Exception stack(0xc0deba88 to 0xc0debad0)
ba80: e8c29180 00000001 e9854304 e9854300 c0f567d8
c0df2580
baa0: 00000000 00000000 00000000 c0f117b8 c0e3a3c0 c0debb0c 00000000
c0debad0
bac0: 0000672e c02f4d60 60000193 ffffffff
[<c0213c58>] (__dabt_svc) from [<c02f4d60>] (kdb_ftdump+0x1e4/0x3d8)
[<c02f4d60>] (kdb_ftdump) from [<c02ce328>] (kdb_parse+0x2b8/0x698)
[<c02ce328>] (kdb_parse) from [<c02ceef0>] (kdb_main_loop+0x52c/0x784)
[<c02ceef0>] (kdb_main_loop) from [<c02d1b0c>] (kdb_stub+0x238/0x490)
--- cut here ---
The NULL deref occurs due to the initialized use of struct trace_iter's
buffer_iter member.
This patch solves this by providing a collection of ring_buffer_iter(s)
and using this to initialized buffer_iter. Note that static allocation
is used solely because the trace_iter itself is also static allocated.
Signed-off-by: Daniel Thompson <daniel.thompson(a)linaro.org>
Cc: Jason Wessel <jason.wessel(a)windriver.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Ingo Molnar <mingo(a)redhat.com>
---
kernel/trace/trace_kdb.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/trace/trace_kdb.c b/kernel/trace/trace_kdb.c
index bd90e1b..989288a 100644
--- a/kernel/trace/trace_kdb.c
+++ b/kernel/trace/trace_kdb.c
@@ -20,10 +20,12 @@ static void ftrace_dump_buf(int skip_lines, long cpu_file)
{
/* use static because iter can be a bit big for the stack */
static struct trace_iterator iter;
+ static struct ring_buffer_iter *buffer_iter[CONFIG_NR_CPUS];
unsigned int old_userobj;
int cnt = 0, cpu;
trace_init_global_iter(&iter);
+ iter.buffer_iter = buffer_iter;
for_each_tracing_cpu(cpu) {
atomic_inc(&per_cpu_ptr(iter.trace_buffer->data, cpu)->disabled);
--
1.9.3