linaro-kernel October 2014

linaro-kernel@lists.linaro.org

57 participants
210 discussions

[PATCH] arm64: kgdb: fix single stepping

by AKASHI Takahiro

I tried to verify kgdb in vanilla kernel on fast model, but it seems that the single stepping with kgdb doesn't work correctly since its first appearance at v3.15. On v3.15, 'stepi' command after breaking the kernel at some breakpoint steps forward to the next instruction, but the succeeding 'stepi' never goes beyond that. On v3.16, 'stepi' moves forward and stops at the next instruction just after enable_dbg in el1_dbg, and never goes beyond that. This variance of behavior seems to come in with the following patch in v3.16: commit 2a2830703a23 ("arm64: debug: avoid accessing mdscr_el1 on fault paths where possible") This patch (1) moves kgdb_disable_single_step() from 'c' command handling to single step handler. This makes sure that single stepping gets effective at every 's' command. Please note that, under the current implementation, single step bit in spsr, which is cleared by the first single stepping, will not be set again for the consecutive 's' commands because single step bit in mdscr is still kept on (that is, kernel_active_single_step() in kgdb_arch_handle_exception() is true). (2) re-implements kgdb_roundup_cpus() because the current implementation enabled interrupts naively. See below. (3) removes 'enable_dbg' in el1_dbg. Single step bit in mdscr is turned on in do_handle_exception()-> kgdb_handle_expection() before returning to debugged context, and if debug exception is enabled in el1_dbg, we will see unexpected single- stepping in el1_dbg. Since v3.18, the following patch does the same: commit 1059c6bf8534 ("arm64: debug: don't re-enable debug exceptions on return from el1_dbg) (4) masks interrupts while single-stepping one instruction. If an interrupt is caught during processing a single-stepping, debug exception is unintentionally enabled by el1_irq's 'enable_dbg' before returning to debugged context. Thus, like in (2), we will see unexpected single-stepping in el1_irq. Basically (1) and (2) are for v3.15, (3) and (4) for v3.1[67]. * issue fixed by (2): Without (2), we would see another problem if a breakpoint is set at interrupt-sensible places, like gic_handle_irq(): KGDB: re-enter error: breakpoint removed ffffffc000081258 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 650 at kernel/debug/debug_core.c:435 kgdb_handle_exception+0x1dc/0x1f4() Modules linked in: CPU: 0 PID: 650 Comm: sh Not tainted 3.17.0-rc2+ #177 Call trace: [<ffffffc000087fac>] dump_backtrace+0x0/0x130 [<ffffffc0000880ec>] show_stack+0x10/0x1c [<ffffffc0004d683c>] dump_stack+0x74/0xb8 [<ffffffc0000ab824>] warn_slowpath_common+0x8c/0xb4 [<ffffffc0000ab90c>] warn_slowpath_null+0x14/0x20 [<ffffffc000121bfc>] kgdb_handle_exception+0x1d8/0x1f4 [<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28 [<ffffffc0000821c8>] brk_handler+0x9c/0xe8 [<ffffffc0000811e8>] do_debug_exception+0x3c/0xac Exception stack(0xffffffc07e027650 to 0xffffffc07e027770) ... [<ffffffc000083cac>] el1_dbg+0x14/0x68 [<ffffffc00012178c>] kgdb_cpu_enter+0x464/0x5c0 [<ffffffc000121bb4>] kgdb_handle_exception+0x190/0x1f4 [<ffffffc000092ffc>] kgdb_brk_fn+0x18/0x28 [<ffffffc0000821c8>] brk_handler+0x9c/0xe8 [<ffffffc0000811e8>] do_debug_exception+0x3c/0xac Exception stack(0xffffffc07e027ac0 to 0xffffffc07e027be0) ... [<ffffffc000083cac>] el1_dbg+0x14/0x68 [<ffffffc00032e4b4>] __handle_sysrq+0x11c/0x190 [<ffffffc00032e93c>] write_sysrq_trigger+0x4c/0x60 [<ffffffc0001e7d58>] proc_reg_write+0x54/0x84 [<ffffffc000192fa4>] vfs_write+0x98/0x1c8 [<ffffffc0001939b0>] SyS_write+0x40/0xa0 Once some interrupt occurs, a breakpoint at gic_handle_irq() triggers kgdb. Kgdb then calls kgdb_roundup_cpus() to sync with other cpus. Current kgdb_roundup_cpus() unmasks interrupts temporarily to use smp_call_function(). This eventually allows another interrupt to occur and likely results in hitting a breakpoint at gic_handle_irq() again since debug exception is always enabled in el1_irq. We can avoid this issue by specifying "nokgdbroundup" in kernel parameter, but this will also leave other cpus be in unknown state in terms of kgdb, and may result in interfering with kgdb activity. Signed-off-by: AKASHI Takahiro <takahiro.akashi(a)linaro.org> --- arch/arm64/kernel/kgdb.c | 60 +++++++++++++++++++++++++++++++++++----------- 1 file changed, 46 insertions(+), 14 deletions(-) diff --git a/arch/arm64/kernel/kgdb.c b/arch/arm64/kernel/kgdb.c index a0d10c5..81b5910 100644 --- a/arch/arm64/kernel/kgdb.c +++ b/arch/arm64/kernel/kgdb.c @@ -19,9 +19,13 @@ * along with this program. If not, see <http://www.gnu.org/licenses/>. */ +#include <linux/cpumask.h> #include <linux/irq.h> +#include <linux/irq_work.h> #include <linux/kdebug.h> #include <linux/kgdb.h> +#include <linux/percpu.h> +#include <asm/ptrace.h> #include <asm/traps.h> struct dbg_reg_def_t dbg_reg_def[DBG_MAX_REG_NUM] = { @@ -95,6 +99,9 @@ struct dbg_reg_def_t dbg_reg_def[DBG_MAX_REG_NUM] = { { "fpcr", 4, -1 }, }; +static DEFINE_PER_CPU(unsigned int, kgdb_pstate); +static DEFINE_PER_CPU(struct irq_work, kgdb_irq_work); + char *dbg_get_reg(int regno, void *mem, struct pt_regs *regs) { if (regno >= DBG_MAX_REG_NUM || regno < 0) @@ -176,18 +183,14 @@ int kgdb_arch_handle_exception(int exception_vector, int signo, * over and over again. */ kgdb_arch_update_addr(linux_regs, remcom_in_buffer); - atomic_set(&kgdb_cpu_doing_single_step, -1); - kgdb_single_step = 0; - - /* - * Received continue command, disable single step - */ - if (kernel_active_single_step()) - kernel_disable_single_step(); err = 0; break; case 's': + /* mask interrupts while single stepping */ + __this_cpu_write(kgdb_pstate, linux_regs->pstate); + linux_regs->pstate |= PSR_I_BIT; + /* * Update step address value with address passed * with step packet. @@ -198,8 +201,6 @@ int kgdb_arch_handle_exception(int exception_vector, int signo, */ kgdb_arch_update_addr(linux_regs, remcom_in_buffer); atomic_set(&kgdb_cpu_doing_single_step, raw_smp_processor_id()); - kgdb_single_step = 1; - /* * Enable single step handling */ @@ -229,6 +230,18 @@ static int kgdb_compiled_brk_fn(struct pt_regs *regs, unsigned int esr) static int kgdb_step_brk_fn(struct pt_regs *regs, unsigned int esr) { + unsigned int pstate; + + kernel_disable_single_step(); + atomic_set(&kgdb_cpu_doing_single_step, -1); + + /* restore interrupt mask status */ + pstate = __this_cpu_read(kgdb_pstate); + if (pstate & PSR_I_BIT) + regs->pstate |= PSR_I_BIT; + else + regs->pstate &= ~PSR_I_BIT; + kgdb_handle_exception(1, SIGTRAP, 0, regs); return 0; } @@ -249,16 +262,27 @@ static struct step_hook kgdb_step_hook = { .fn = kgdb_step_brk_fn }; -static void kgdb_call_nmi_hook(void *ignored) +static void kgdb_roundup_hook(struct irq_work *work) { kgdb_nmicallback(raw_smp_processor_id(), get_irq_regs()); } void kgdb_roundup_cpus(unsigned long flags) { - local_irq_enable(); - smp_call_function(kgdb_call_nmi_hook, NULL, 0); - local_irq_disable(); + int cpu; + struct cpumask mask; + struct irq_work *work; + + mask = *cpu_online_mask; + cpumask_clear_cpu(smp_processor_id(), &mask); + cpu = cpumask_first(&mask); + if (cpu >= nr_cpu_ids) + return; + + for_each_cpu(cpu, &mask) { + work = per_cpu_ptr(&kgdb_irq_work, cpu); + irq_work_queue_on(work, cpu); + } } static int __kgdb_notify(struct die_args *args, unsigned long cmd) @@ -299,6 +323,8 @@ static struct notifier_block kgdb_notifier = { int kgdb_arch_init(void) { int ret = register_die_notifier(&kgdb_notifier); + int cpu; + struct irq_work *work; if (ret != 0) return ret; @@ -306,6 +332,12 @@ int kgdb_arch_init(void) register_break_hook(&kgdb_brkpt_hook); register_break_hook(&kgdb_compiled_brkpt_hook); register_step_hook(&kgdb_step_hook); + + for_each_possible_cpu(cpu) { + work = per_cpu_ptr(&kgdb_irq_work, cpu); + init_irq_work(work, kgdb_roundup_hook); + } + return 0; } -- 1.7.9.5

10 years, 6 months

[PATCH] ARM: exynos_defconfig: disable CONFIG_EXYNOS5420_MCPM; not stable

by Kevin Hilman

From: Kevin Hilman <khilman(a)linaro.org> The option CONFIG_EXYNOS5420_MCPM is causing imprecise external aborts during boot testing, causing various userspace startup failures. Disable until it has gotten more testing. Cc: Kukjin Kim <kgene.kim(a)samsung.com>, Cc: Javier Martinez Canillas <javier.martinez(a)collabora.co.uk>, Cc: Sachin Kamat <sachin.kamat(a)samsung.com>, Cc: Doug Anderson <dianders(a)chromium.org>, Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie(a)samsung.com>, Cc: Krzysztof Kozlowski <k.kozlowski(a)samsung.com>, Cc: Tushar Behera <tushar.behera(a)linaro.org>, Cc: stable(a)vger.kernel.org # v3.17+ Signed-off-by: Kevin Hilman <khilman(a)linaro.org> --- This has been reported by a few people[1], but not investigated or fixed, so it's time to disable this feature until it can be fixed. [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-September/288344… arch/arm/configs/exynos_defconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/configs/exynos_defconfig b/arch/arm/configs/exynos_defconfig index 72058b8a6f4d..a250dcbf34cd 100644 --- a/arch/arm/configs/exynos_defconfig +++ b/arch/arm/configs/exynos_defconfig @@ -10,7 +10,7 @@ CONFIG_MODULE_UNLOAD=y CONFIG_PARTITION_ADVANCED=y CONFIG_ARCH_EXYNOS=y CONFIG_ARCH_EXYNOS3=y -CONFIG_EXYNOS5420_MCPM=y +CONFIG_EXYNOS5420_MCPM=n CONFIG_SMP=y CONFIG_BIG_LITTLE=y CONFIG_BL_SWITCHER=y -- 2.1.0

10 years, 7 months

[PATCH RFC 0/7] scheduler-driven cpu frequency scaling

by Mike Turquette

This series demonstrates cpu frequency scaling via a simple policy driven by the scheduler. Specifically the policy evaluates cpu frequency when cpu utilization is updated from enqueue_task_fair and dequeue_task_fair. The policy itself uses a simple up/down threshold scheme using the same 80%/20% cpu utilization boundaries that are used by default in the ondemand cpufreq governor. This series is not intended for merging, but instead to ignite some discussion around scheduler-driven cpu frequency selection. Of particular interest to me is the policy itself and how it might integrate with task placement in CFS's load_balance. Additionally I'd like to ask the scheduler experts about which call sites in CFS are right for evaluating cpu frequency selection; maybe {en,de}queue_task_fair are not such a good idea? The messiest part of this series is the cpumask stuff, where I tried to track which cpus have updated statistics in the case of a sched_entity which contains several other sched_entities that are spread across cpus. As discussed at Linux Plumbers Conference 2014, I will replace this complexity with simpler logic that ignores scheduler cgroups in the next version. In any case I am posting the code I have now. This code is experiemental and bugs are Guaranteed. These patches are based on the scale invariance series from Morten[0]. The variables names in this RFC will doubtless change once that work is rebased onto Vincent's series[1]. [0] http://lkml.kernel.org/r/<1411403047-32010-1-git-send-email-morten.rasmussen(a)arm.com> [1] http://lkml.kernel.org/r/<1412684017-16595-1-git-send-email-vincent.guittot(a)linaro.org> Mike Turquette (6): sched: cfs: declare capacity_of in sched.h sched: fair: add usage_util_of helper cpufreq: add per-governor private data sched: cfs: cpu frequency scaling arch functions sched: cfs: cpu frequency scaling based on task placement sched: energy_model: simple cpu frequency scaling policy Morten Rasmussen (1): sched: Make energy awareness a sched feature drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 6 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 69 ++++++++- kernel/sched/features.h | 6 + kernel/sched/sched.h | 3 + 7 files changed, 445 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/energy_model.c -- 1.8.3.2

10 years, 7 months

[PATCH v4 00/12] sched: consolidation of cpu_capacity

by Vincent Guittot

Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_capacity (this patchset) -tasks packing algorithm SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of cpu_capacity_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_capacity_orig/capacity_orig) and the current capacity (cpu_capacity/capacity) of CPUs and sched_groups. A new function arch_scale_cpu_capacity has been created and replace arch_scale_smt_capacity, which is SMT specifc in the computation of the capapcity of a CPU. During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE. This assumption generates wrong decision by creating ghost cores and by removing real ones when the original capacity of CPUs is different from the default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of available cores based on the group_capacity but instead we detect when the group is fully utilized Now that we have the original capacity of CPUS and their activity/utilization, we can evaluate more accuratly the capacity and the level of utilization of a group of CPUs. This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we could certainly take advantage of this new statistic in several other places of the load balance. Tests results: I have put below results of 4 kind of tests: - hackbench -l 500 -s 4096 - perf bench sched pipe -l 400000 - scp of 100MB file on the platform - ebizzy with various number of threads on 4 kernels : - tip = tip/sched/core - step1 = tip + patches(1-8) - patchset = tip + whole patchset - patchset+irq = tip + this patchset + irq accounting each test has been run 6 times and the figure below show the stdev and the diff compared to the tip kernel Dual A7 tip | +step1 | +patchset | patchset+irq stdev | results stdev | results stdev | results stdev hackbench (lower is better) (+/-)0.64% | -0.19% (+/-)0.73% | 0.58% (+/-)1.29% | 0.20% (+/-)1.00% perf (lower is better) (+/-)0.28% | 1.22% (+/-)0.17% | 1.29% (+/-)0.06% | 2.85% (+/-)0.33% scp (+/-)4.81% | 2.61% (+/-)0.28% | 2.39% (+/-)0.22% | 82.18% (+/-)3.30% ebizzy -t 1 (+/-)2.31% | -1.32% (+/-)1.90% | -0.79% (+/-)2.88% | 3.10% (+/-)2.32% ebizzy -t 2 (+/-)0.70% | 8.29% (+/-)6.66% | 1.93% (+/-)5.47% | 2.72% (+/-)5.72% ebizzy -t 4 (+/-)3.54% | 5.57% (+/-)8.00% | 0.36% (+/-)9.00% | 2.53% (+/-)3.17% ebizzy -t 6 (+/-)2.36% | -0.43% (+/-)3.29% | -1.93% (+/-)3.47% | 0.57% (+/-)0.75% ebizzy -t 8 (+/-)1.65% | -0.45% (+/-)0.93% | -1.95% (+/-)1.52% | -1.18% (+/-)1.61% ebizzy -t 10 (+/-)2.55% | -0.98% (+/-)3.06% | -1.18% (+/-)6.17% | -2.33% (+/-)3.28% ebizzy -t 12 (+/-)6.22% | 0.17% (+/-)5.63% | 2.98% (+/-)7.11% | 1.19% (+/-)4.68% ebizzy -t 14 (+/-)5.38% | -0.14% (+/-)5.33% | 2.49% (+/-)4.93% | 1.43% (+/-)6.55% Quad A15 tip | +patchset1 | +patchset2 | patchset+irq stdev | results stdev | results stdev | results stdev hackbench (lower is better) (+/-)0.78% | 0.87% (+/-)1.72% | 0.91% (+/-)2.02% | 3.30% (+/-)2.02% perf (lower is better) (+/-)2.03% | -0.31% (+/-)0.76% | -2.38% (+/-)1.37% | 1.42% (+/-)3.14% scp (+/-)0.04% | 0.51% (+/-)1.37% | 1.79% (+/-)0.84% | 1.77% (+/-)0.38% ebizzy -t 1 (+/-)0.41% | 2.05% (+/-)0.38% | 2.08% (+/-)0.24% | 0.17% (+/-)0.62% ebizzy -t 2 (+/-)0.78% | 0.60% (+/-)0.63% | 0.43% (+/-)0.48% | 1.61% (+/-)0.38% ebizzy -t 4 (+/-)0.58% | -0.10% (+/-)0.97% | -0.65% (+/-)0.76% | -0.75% (+/-)0.86% ebizzy -t 6 (+/-)0.31% | 1.07% (+/-)1.12% | -0.16% (+/-)0.87% | -0.76% (+/-)0.22% ebizzy -t 8 (+/-)0.95% | -0.30% (+/-)0.85% | -0.79% (+/-)0.28% | -1.66% (+/-)0.21% ebizzy -t 10 (+/-)0.31% | 0.04% (+/-)0.97% | -1.44% (+/-)1.54% | -0.55% (+/-)0.62% ebizzy -t 12 (+/-)8.35% | -1.89% (+/-)7.64% | 0.75% (+/-)5.30% | -1.18% (+/-)8.16% ebizzy -t 14 (+/-)13.17% | 6.22% (+/-)4.71% | 5.25% (+/-)9.14% | 5.87% (+/-)5.77% I haven't been able to fully test the patchset for a SMT system to check that the regression that has been reported by Preethi has been solved but the various tests that i have done, don't show any regression so far. The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level should have fix the regression. The usage_avg_contrib is based on the current implementation of the load avg tracking. I also have a version of the usage_avg_contrib that is based on the new implementation [3] but haven't provide the patches and results as [3] is still under review. I can provide change above [3] to change how usage_avg_contrib is computed and adapt to new mecanism. TODO: manage conflict with the next version of [4] Change since V3: - add usage_avg_contrib statistic which sums the running time of tasks on a rq - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization - fix replacement power by capacity - update some comments Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity [1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2014/3/19/377 [3] https://lkml.org/lkml/2014/7/18/110 [4] https://lkml.org/lkml/2014/7/25/589 Vincent Guittot (12): sched: fix imbalance flag reset sched: remove a wake_affine condition sched: fix avg_load computation sched: Allow all archs to set the capacity_orig ARM: topology: use new cpu_capacity interface sched: add per rq cpu_capacity_orig sched: test the cpu's capacity in wake affine sched: move cfs task on a CPU with higher capacity sched: add usage_load_avg sched: get CPU's utilization statistic sched: replace capacity_factor by utilization sched: add SD_PREFER_SIBLING for SMT level arch/arm/kernel/topology.c | 4 +- include/linux/sched.h | 4 +- kernel/sched/core.c | 3 +- kernel/sched/fair.c | 350 ++++++++++++++++++++++++++------------------- kernel/sched/sched.h | 3 +- 5 files changed, 207 insertions(+), 157 deletions(-) -- 1.9.1

10 years, 7 months

[PATCH v2 00/11] sched: consolidation of cpu_power

by Vincent Guittot

10 years, 7 months

[PATCH v7 0/7] sched: consolidation of cpu_capacity

by Vincent Guittot

During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE. This assumption generates wrong decision by creating ghost cores or by removing real ones when the original capacity of CPUs is different from the default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of available cores based on the group_capacity but instead we evaluate the usage of a group and compare it with its capacity. This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we could certainly take advantage of this new statistic in several other places of the load balance. The utilization_avg_contrib is based on the current implementation of the load avg tracking. I also have a version of the utilization_avg_contrib that is based on the new implementation proposal [1] but haven't provide the patches and results as [1] is still under review. I can provide change above [1] to change how utilization_avg_contrib is computed and adapt to new mecanism. Change since V6 - add group usage tracking - fix some commits' messages - minor fix like comments and argument order Change since V5 - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07 - update commit log and add more details on the purpose of the patches - fix/remove useless code with the rebase on patchset [2] - remove capacity_orig in sched_group_capacity as it is not used - move code in the right patch - add some helper function to factorize code Change since V4 - rebase to manage conflicts with changes in selection of busiest group [4] Change since V3: - add usage_avg_contrib statistic which sums the running time of tasks on a rq - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization - fix replacement power by capacity - update some comments Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity [1] https://lkml.org/lkml/2014/7/18/110 [2] https://lkml.org/lkml/2014/7/25/589 Morten Rasmussen (1): sched: Track group sched_entity usage contributions Vincent Guittot (6): sched: add per rq cpu_capacity_orig sched: move cfs task on a CPU with higher capacity sched: add utilization_avg_contrib sched: get CPU's usage statistic sched: replace capacity_factor by usage sched: add SD_PREFER_SIBLING for SMT level include/linux/sched.h | 21 +++- kernel/sched/core.c | 15 +-- kernel/sched/debug.c | 12 ++- kernel/sched/fair.c | 283 ++++++++++++++++++++++++++++++-------------------- kernel/sched/sched.h | 11 +- 5 files changed, 209 insertions(+), 133 deletions(-) -- 1.9.1

10 years, 7 months

[PATCH v6 0/6] sched: consolidation of cpu_capacity

by Vincent Guittot

During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE. This assumption generates wrong decision by creating ghost cores or by removing real ones when the original capacity of CPUs is different from the default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of available cores based on the group_capacity but instead we evaluate the usage of a group and compare it with its capacity This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we could certainly take advantage of this new statistic in several other places of the load balance. The utilization_avg_contrib is based on the current implementation of the load avg tracking. I also have a version of the utilization_avg_contrib that is based on the new implementation proposal [1] but haven't provide the patches and results as [1] is still under review. I can provide change above [1] to change how utilization_avg_contrib is computed and adapt to new mecanism. Change since V5 - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07 - update commit log and add more details on the purpose of the patches - fix/remove useless code with the rebase on patchset [2] - remove capacity_orig in sched_group_capacity as it is not used - move code in the right patch - add some helper function to factorize code Change since V4 - rebase to manage conflicts with changes in selection of busiest group [4] Change since V3: - add usage_avg_contrib statistic which sums the running time of tasks on a rq - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization - fix replacement power by capacity - update some comments Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity [1] https://lkml.org/lkml/2014/7/18/110 [2] https://lkml.org/lkml/2014/7/25/589 Vincent Guittot (6): sched: add per rq cpu_capacity_orig sched: move cfs task on a CPU with higher capacity sched: add utilization_avg_contrib sched: get CPU's usage statistic sched: replace capacity_factor by usage sched: add SD_PREFER_SIBLING for SMT level include/linux/sched.h | 19 +++- kernel/sched/core.c | 15 +-- kernel/sched/debug.c | 9 +- kernel/sched/fair.c | 276 ++++++++++++++++++++++++++++++-------------------- kernel/sched/sched.h | 11 +- 5 files changed, 199 insertions(+), 131 deletions(-) -- 1.9.1

10 years, 7 months

[PATCH v8 00/10] sched: consolidation of CPU capacity and usage

by Vincent Guittot

This patchset consolidates several changes in the capacity and the usage tracking of the CPU. It provides a frequency invariant metric of the usage of CPUs and generally improves the accuracy of load/usage tracking in the scheduler. The frequency invariant metric is the foundation required for the consolidation of cpufreq and implementation of a fully invariant load tracking. These are currently WIP and require several changes to the load balancer (including how it will use and interprets load and capacity metrics) and extensive validation. The frequency invariance is done with arch_scale_freq_capacity and this patchset doesn't provide the backends of the function which are architecture dependent. As discussed at LPC14, Morten and I have consolidated our changes into a single patchset to make it easier to review and merge. During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE. This assumption generates wrong decision by creating ghost cores or by removing real ones when the original capacity of CPUs is different from the default SCHED_CAPACITY_SCALE. With this patch set, we don't try anymore to evaluate the number of available cores based on the group_capacity but instead we evaluate the usage of a group and compare it with its capacity. This patchset mainly replaces the old capacity_factor method by a new one and keeps the general policy almost unchanged. These new metrics will be also used in later patches. The CPU usage is based on a running time tracking version of the current implementation of the load average tracking. I also have a version that is based on the new implementation proposal [1] but I haven't provide the patches and results as [1] is still under review. I can provide change above [1] to change how CPU usage is computed and to adapt to new mecanism. Change since V7 - add freq invariance for usage tracking - add freq invariance for scale_rt - update comments and commits' message - fix init of utilization_avg_contrib - fix prefer_sibling Change since V6 - add group usage tracking - fix some commits' messages - minor fix like comments and argument order Change since V5 - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07 - update commit log and add more details on the purpose of the patches - fix/remove useless code with the rebase on patchset [2] - remove capacity_orig in sched_group_capacity as it is not used - move code in the right patch - add some helper function to factorize code Change since V4 - rebase to manage conflicts with changes in selection of busiest group Change since V3: - add usage_avg_contrib statistic which sums the running time of tasks on a rq - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization - fix replacement power by capacity - update some comments Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity [1] https://lkml.org/lkml/2014/10/10/131 [2] https://lkml.org/lkml/2014/7/25/589 Morten Rasmussen (2): sched: Track group sched_entity usage contributions sched: Make sched entity usage tracking scale-invariant Vincent Guittot (8): sched: add per rq cpu_capacity_orig sched: remove frequency scaling from cpu_capacity sched: move cfs task on a CPU with higher capacity sched: add utilization_avg_contrib sched: get CPU's usage statistic sched: replace capacity_factor by usage sched: add SD_PREFER_SIBLING for SMT level sched: make scale_rt invariant with frequency include/linux/sched.h | 21 ++- kernel/sched/core.c | 15 +- kernel/sched/debug.c | 12 +- kernel/sched/fair.c | 369 ++++++++++++++++++++++++++++++++------------------ kernel/sched/sched.h | 15 +- 5 files changed, 276 insertions(+), 156 deletions(-) -- 1.9.1

10 years, 7 months

[PATCH] arm64: topology: Fix handling of multi-level cluster MPIDR-based detection

by Mark Brown

From: Mark Brown <broonie(a)linaro.org> The only requirement the scheduler has on cluster IDs is that they must be unique. When enumerating the topology based on MPIDR information the kernel currently generates cluster IDs by using the first level of affinity above the core ID (either level one or two depending on if the core has multiple threads) however the ARMv8 architecture allows for up to three levels of affinity. This means that an ARMv8 system may contain cores which have MPIDRs identical other than affinity level three which with current code will cause us to report multiple cores with the same identification to the scheduler in violation of its uniqueness requirement. Ensure that we do not violate the scheduler requirements on systems that uses all the affinity levels by incorporating both affinity levels two and three into the cluser ID when the cores are not threaded. While no currently known hardware uses multi-level clusters it is better to program defensively, this will help ease bringup of systems that have them and will ensure that things like distribution install media do not need to be respun to replace kernels in order to deploy such systems. In the worst case the system will work but perform suboptimally until a kernel modified to handle the new topology better is installed, in the best case this will be an adequate description of such topologies for the scheduler to perform well. Signed-off-by: Mark Brown <broonie(a)linaro.org> --- arch/arm64/kernel/topology.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index b6ee26b..5752c1b 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -255,7 +255,8 @@ void store_cpu_topology(unsigned int cpuid) /* Multiprocessor system : Multi-threads per core */ cpuid_topo->thread_id = MPIDR_AFFINITY_LEVEL(mpidr, 0); cpuid_topo->core_id = MPIDR_AFFINITY_LEVEL(mpidr, 1); - cpuid_topo->cluster_id = MPIDR_AFFINITY_LEVEL(mpidr, 2); + cpuid_topo->cluster_id = MPIDR_AFFINITY_LEVEL(mpidr, 2) | + MPIDR_AFFINITY_LEVEL(mpidr, 3) << 8; } else { /* Multiprocessor system : Single-thread per core */ cpuid_topo->thread_id = -1; -- 2.1.0.rc1

10 years, 7 months

[PATCH 3.17-rc4] trace: kdb: Fix kernel panic during ftdump

by Daniel Thompson

Currently kdb's ftdump command unconditionally crashes due to a null pointer de-reference whenever the command is run. This in turn causes the kernel to panic. The abridged stacktrace (gathered with ARCH=arm) is: --- cut here --- [<c09535ac>] (panic) from [<c02132dc>] (die+0x264/0x440) [<c02132dc>] (die) from [<c0952eb8>] (__do_kernel_fault.part.11+0x74/0x84) [<c0952eb8>] (__do_kernel_fault.part.11) from [<c021f954>] (do_page_fault+0x1d0/0x3c4) [<c021f954>] (do_page_fault) from [<c020846c>] (do_DataAbort+0x48/0xac) [<c020846c>] (do_DataAbort) from [<c0213c58>] (__dabt_svc+0x38/0x60) Exception stack(0xc0deba88 to 0xc0debad0) ba80: e8c29180 00000001 e9854304 e9854300 c0f567d8 c0df2580 baa0: 00000000 00000000 00000000 c0f117b8 c0e3a3c0 c0debb0c 00000000 c0debad0 bac0: 0000672e c02f4d60 60000193 ffffffff [<c0213c58>] (__dabt_svc) from [<c02f4d60>] (kdb_ftdump+0x1e4/0x3d8) [<c02f4d60>] (kdb_ftdump) from [<c02ce328>] (kdb_parse+0x2b8/0x698) [<c02ce328>] (kdb_parse) from [<c02ceef0>] (kdb_main_loop+0x52c/0x784) [<c02ceef0>] (kdb_main_loop) from [<c02d1b0c>] (kdb_stub+0x238/0x490) --- cut here --- The NULL deref occurs due to the initialized use of struct trace_iter's buffer_iter member. This patch solves this by providing a collection of ring_buffer_iter(s) and using this to initialized buffer_iter. Note that static allocation is used solely because the trace_iter itself is also static allocated. Signed-off-by: Daniel Thompson <daniel.thompson(a)linaro.org> Cc: Jason Wessel <jason.wessel(a)windriver.com> Cc: Steven Rostedt <rostedt(a)goodmis.org> Cc: Ingo Molnar <mingo(a)redhat.com> --- kernel/trace/trace_kdb.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/trace_kdb.c b/kernel/trace/trace_kdb.c index bd90e1b..989288a 100644 --- a/kernel/trace/trace_kdb.c +++ b/kernel/trace/trace_kdb.c @@ -20,10 +20,12 @@ static void ftrace_dump_buf(int skip_lines, long cpu_file) { /* use static because iter can be a bit big for the stack */ static struct trace_iterator iter; + static struct ring_buffer_iter *buffer_iter[CONFIG_NR_CPUS]; unsigned int old_userobj; int cnt = 0, cpu; trace_init_global_iter(&iter); + iter.buffer_iter = buffer_iter; for_each_tracing_cpu(cpu) { atomic_inc(&per_cpu_ptr(iter.trace_buffer->data, cpu)->disabled); -- 1.9.3

10 years, 7 months

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

linaro-kernel October 2014