[PATCH v2 0/11] remove cpu_load in rq

List overview All Threads
Download

newer

older

How to mount rootfs on FastModel

queue boot: 35 pass, 0 fail...

Alex Shi

17 Feb 2014 17 Feb '14

1:55 a.m.

The cpu_load decays on time according past cpu load of rq. The sched_avg also decays tasks' load on time. Now we has 2 kind decay for cpu_load. That is a kind of redundancy. And increase the system load by decay calculation. This patch try to remove the cpu_load decay.

There are 5 load_idx used for cpu_load in sched_domain. busy_idx and idle_idx are not zero usually, but newidle_idx, wake_idx and forkexec_idx are all zero on every arch. A shortcut to remove cpu_Load decay in the first patch. just one line patch for this change.

V2, 1, This version do some tuning on load bias of target load, to maximum match current code logical. 2, Got further to remove the cpu_load in rq. 3, Revert the patch 'Limit sd->*_idx range on sysctl' since no needs

Any testing/comments are appreciated.

This patch rebase on latest tip/master. The git tree for this patchset at: git@github.com:alexshi/power-scheduling.git noload

Thanks Alex

Show replies by date

Alex Shi

17 Feb 17 Feb

1:55 a.m.

New subject: [PATCH v2 01/11] sched: shortcut to remove load_idx

Shortcut to remove rq->cpu_load[load_idx] effect in scheduler. In five load idx, only busy_idx, idle_idx are not zero. Newidle_idx, wake_idx and fork_idx are all zero in all archs.

So, change the idx to zero here can fully remove load_idx effect.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 235cfa7..4fcc3a3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5908,7 +5908,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;

- load_idx = get_sd_load_idx(env->sd, env->idle); + load_idx = 0;

do { struct sg_lb_stats *sgs = &tmp_sgs;

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 02/11] sched: remove rq->cpu_load[load_idx] array

Since load_idx effect removed in load balance, we don't need the load_idx decays in scheduler. that will save some process in sched_tick and others places.

Signed-off-by: Alex Shi alex.shi@linaro.org --- arch/ia64/include/asm/topology.h | 5 --- arch/metag/include/asm/topology.h | 5 --- arch/tile/include/asm/topology.h | 6 --- include/linux/sched.h | 5 --- include/linux/topology.h | 8 ---- kernel/sched/core.c | 58 +++++++----------------- kernel/sched/debug.c | 6 +-- kernel/sched/fair.c | 79 +++++++++------------------------ kernel/sched/proc.c | 92 ++------------------------------------- kernel/sched/sched.h | 3 +- 10 files changed, 42 insertions(+), 225 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..54e5b17 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -55,11 +55,6 @@ void build_cpu_to_node_map(void); .busy_factor = 64, \ .imbalance_pct = 125, \ .cache_nice_tries = 2, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE \ | SD_BALANCE_EXEC \ diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h index 8e9c0b3..d1d15cd 100644 --- a/arch/metag/include/asm/topology.h +++ b/arch/metag/include/asm/topology.h @@ -13,11 +13,6 @@ .busy_factor = 32, \ .imbalance_pct = 125, \ .cache_nice_tries = 2, \ - .busy_idx = 3, \ - .idle_idx = 2, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d15c0d8..05f6ffe 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -57,12 +57,6 @@ static inline const struct cpumask *cpumask_of_node(int node) .busy_factor = 64, \ .imbalance_pct = 125, \ .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ .flags = 1*SD_LOAD_BALANCE \ | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ diff --git a/include/linux/sched.h b/include/linux/sched.h index c49a258..6c416c8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -892,11 +892,6 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ - unsigned int busy_idx; - unsigned int idle_idx; - unsigned int newidle_idx; - unsigned int wake_idx; - unsigned int forkexec_idx; unsigned int smt_gain;

int nohz_idle; /* NOHZ IDLE status */ diff --git a/include/linux/topology.h b/include/linux/topology.h index 12ae6ce..863fad3 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -121,9 +121,6 @@ int arch_update_cpu_topology(void); .busy_factor = 64, \ .imbalance_pct = 125, \ .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ \ .flags = 1*SD_LOAD_BALANCE \ | 1*SD_BALANCE_NEWIDLE \ @@ -151,11 +148,6 @@ int arch_update_cpu_topology(void); .busy_factor = 64, \ .imbalance_pct = 125, \ .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ \ .flags = 1*SD_LOAD_BALANCE \ | 1*SD_BALANCE_NEWIDLE \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fb9764f..ac2f10c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4787,64 +4787,45 @@ static void sd_free_ctl_entry(struct ctl_table **tablep) *tablep = NULL; }

-static int min_load_idx = 0; -static int max_load_idx = CPU_LOAD_IDX_MAX-1; - static void set_table_entry(struct ctl_table *entry, const char *procname, void *data, int maxlen, - umode_t mode, proc_handler *proc_handler, - bool load_idx) + umode_t mode, proc_handler *proc_handler) { entry->procname = procname; entry->data = data; entry->maxlen = maxlen; entry->mode = mode; entry->proc_handler = proc_handler; - - if (load_idx) { - entry->extra1 = &min_load_idx; - entry->extra2 = &max_load_idx; - } }

static struct ctl_table * sd_alloc_ctl_domain_table(struct sched_domain *sd) { - struct ctl_table *table = sd_alloc_ctl_entry(14); + struct ctl_table *table = sd_alloc_ctl_entry(9);

if (table == NULL) return NULL;

set_table_entry(&table[0], "min_interval", &sd->min_interval, - sizeof(long), 0644, proc_doulongvec_minmax, false); + sizeof(long), 0644, proc_doulongvec_minmax); set_table_entry(&table[1], "max_interval", &sd->max_interval, - sizeof(long), 0644, proc_doulongvec_minmax, false); - set_table_entry(&table[2], "busy_idx", &sd->busy_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[3], "idle_idx", &sd->idle_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[4], "newidle_idx", &sd->newidle_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[5], "wake_idx", &sd->wake_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[6], "forkexec_idx", &sd->forkexec_idx, - sizeof(int), 0644, proc_dointvec_minmax, true); - set_table_entry(&table[7], "busy_factor", &sd->busy_factor, - sizeof(int), 0644, proc_dointvec_minmax, false); - set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct, - sizeof(int), 0644, proc_dointvec_minmax, false); - set_table_entry(&table[9], "cache_nice_tries", + sizeof(long), 0644, proc_doulongvec_minmax); + set_table_entry(&table[2], "busy_factor", &sd->busy_factor, + sizeof(int), 0644, proc_dointvec_minmax); + set_table_entry(&table[3], "imbalance_pct", &sd->imbalance_pct, + sizeof(int), 0644, proc_dointvec_minmax); + set_table_entry(&table[4], "cache_nice_tries", &sd->cache_nice_tries, - sizeof(int), 0644, proc_dointvec_minmax, false); + sizeof(int), 0644, proc_dointvec_minmax); set_table_entry(&table[10], "flags", &sd->flags, - sizeof(int), 0644, proc_dointvec_minmax, false); + sizeof(int), 0644, proc_dointvec_minmax); set_table_entry(&table[11], "max_newidle_lb_cost", &sd->max_newidle_lb_cost, - sizeof(long), 0644, proc_doulongvec_minmax, false); + sizeof(long), 0644, proc_doulongvec_minmax); set_table_entry(&table[12], "name", sd->name, - CORENAME_MAX_SIZE, 0444, proc_dostring, false); - /* &table[13] is terminator */ + CORENAME_MAX_SIZE, 0444, proc_dostring); + /* &table[8] is terminator */

return table; } @@ -5967,11 +5948,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) .busy_factor = 32, .imbalance_pct = 125, .cache_nice_tries = 2, - .busy_idx = 3, - .idle_idx = 2, - .newidle_idx = 0, - .wake_idx = 0, - .forkexec_idx = 0,

.flags = 1*SD_LOAD_BALANCE | 1*SD_BALANCE_NEWIDLE @@ -6721,7 +6697,7 @@ DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);

void __init sched_init(void) { - int i, j; + int i; unsigned long alloc_size = 0, ptr;

#ifdef CONFIG_FAIR_GROUP_SCHED @@ -6825,9 +6801,7 @@ void __init sched_init(void) init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL); #endif

- for (j = 0; j < CPU_LOAD_IDX_MAX; j++) - rq->cpu_load[j] = 0; - + rq->cpu_load = 0; rq->last_load_update_tick = jiffies;

#ifdef CONFIG_SMP diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index f3344c3..a24d549 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -303,11 +303,7 @@ do { \ PN(next_balance); SEQ_printf(m, " .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr))); PN(clock); - P(cpu_load[0]); - P(cpu_load[1]); - P(cpu_load[2]); - P(cpu_load[3]); - P(cpu_load[4]); + P(cpu_load); #undef P #undef PN

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4fcc3a3..eeffe75 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1015,8 +1015,8 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, }

static unsigned long weighted_cpuload(const int cpu); -static unsigned long source_load(int cpu, int type); -static unsigned long target_load(int cpu, int type); +static unsigned long source_load(int cpu); +static unsigned long target_load(int cpu); static unsigned long power_of(int cpu); static long effective_load(struct task_group *tg, int cpu, long wl, long wg);

@@ -3952,30 +3952,30 @@ static unsigned long weighted_cpuload(const int cpu) * We want to under-estimate the load of migration sources, to * balance conservatively. */ -static unsigned long source_load(int cpu, int type) +static unsigned long source_load(int cpu) { struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu);

- if (type == 0 || !sched_feat(LB_BIAS)) + if (!sched_feat(LB_BIAS)) return total;

- return min(rq->cpu_load[type-1], total); + return min(rq->cpu_load, total); }

/* * Return a high guess at the load of a migration-target cpu weighted * according to the scheduling class and "nice" value. */ -static unsigned long target_load(int cpu, int type) +static unsigned long target_load(int cpu) { struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu);

- if (type == 0 || !sched_feat(LB_BIAS)) + if (!sched_feat(LB_BIAS)) return total;

- return max(rq->cpu_load[type-1], total); + return max(rq->cpu_load, total); }

static unsigned long power_of(int cpu) @@ -4175,7 +4175,7 @@ static int wake_wide(struct task_struct *p) static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; - int idx, this_cpu, prev_cpu; + int this_cpu, prev_cpu; unsigned long tl_per_task; struct task_group *tg; unsigned long weight; @@ -4188,11 +4188,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (wake_wide(p)) return 0;

- idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); - load = source_load(prev_cpu, idx); - this_load = target_load(this_cpu, idx); + load = source_load(prev_cpu); + this_load = target_load(this_cpu);

/* * If sync wakeup then subtract the (maximum possible) @@ -4248,7 +4247,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)

if (balanced || (this_load <= load && - this_load + target_load(prev_cpu, idx) <= tl_per_task)) { + this_load + target_load(prev_cpu) <= tl_per_task)) { /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and @@ -4267,17 +4266,12 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) * domain. */ static struct sched_group * -find_idlest_group(struct sched_domain *sd, struct task_struct *p, - int this_cpu, int sd_flag) +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) { struct sched_group *idlest = NULL, *group = sd->groups; unsigned long min_load = ULONG_MAX, this_load = 0; - int load_idx = sd->forkexec_idx; int imbalance = 100 + (sd->imbalance_pct-100)/2;

- if (sd_flag & SD_BALANCE_WAKE) - load_idx = sd->wake_idx; - do { unsigned long load, avg_load; int local_group; @@ -4297,9 +4291,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, for_each_cpu(i, sched_group_cpus(group)) { /* Bias balancing toward cpus of our domain */ if (local_group) - load = source_load(i, load_idx); + load = source_load(i); else - load = target_load(i, load_idx); + load = target_load(i);

avg_load += load; } @@ -4453,7 +4447,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f continue; }

- group = find_idlest_group(sd, p, cpu, sd_flag); + group = find_idlest_group(sd, p, cpu); if (!group) { sd = sd->child; continue; @@ -5495,34 +5489,6 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) }; }

-/** - * get_sd_load_idx - Obtain the load index for a given sched domain. - * @sd: The sched_domain whose load_idx is to be obtained. - * @idle: The idle status of the CPU for whose sd load_idx is obtained. - * - * Return: The load index. - */ -static inline int get_sd_load_idx(struct sched_domain *sd, - enum cpu_idle_type idle) -{ - int load_idx; - - switch (idle) { - case CPU_NOT_IDLE: - load_idx = sd->busy_idx; - break; - - case CPU_NEWLY_IDLE: - load_idx = sd->newidle_idx; - break; - default: - load_idx = sd->idle_idx; - break; - } - - return load_idx; -} - static unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu) { return SCHED_POWER_SCALE; @@ -5770,12 +5736,11 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group) * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @env: The load balancing environment. * @group: sched_group whose statistics are to be updated. - * @load_idx: Load index of sched_domain of this_cpu for load calc. * @local_group: Does group contain this_cpu. * @sgs: variable to hold the statistics for this group. */ static inline void update_sg_lb_stats(struct lb_env *env, - struct sched_group *group, int load_idx, + struct sched_group *group, int local_group, struct sg_lb_stats *sgs) { unsigned long load; @@ -5788,9 +5753,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,

/* Bias balancing toward cpus of our domain */ if (local_group) - load = target_load(i, load_idx); + load = target_load(i); else - load = source_load(i, load_idx); + load = source_load(i);

sgs->group_load += load; sgs->sum_nr_running += rq->nr_running; @@ -5903,13 +5868,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; - int load_idx, prefer_sibling = 0; + int prefer_sibling = 0;

if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;

- load_idx = 0; - do { struct sg_lb_stats *sgs = &tmp_sgs; int local_group; @@ -5924,7 +5887,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd update_group_power(env->sd, env->dst_cpu); }

- update_sg_lb_stats(env, sg, load_idx, local_group, sgs); + update_sg_lb_stats(env, sg, local_group, sgs);

if (local_group) goto next_group; diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c index 16f5a30..a2435c5 100644 --- a/kernel/sched/proc.c +++ b/kernel/sched/proc.c @@ -11,7 +11,7 @@ unsigned long this_cpu_load(void) { struct rq *this = this_rq(); - return this->cpu_load[0]; + return this->cpu_load; }

@@ -398,105 +398,19 @@ static void calc_load_account_active(struct rq *this_rq) * End of global load-average stuff */

-/* - * The exact cpuload at various idx values, calculated at every tick would be - * load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load - * - * If a cpu misses updates for n-1 ticks (as it was idle) and update gets called - * on nth tick when cpu may be busy, then we have: - * load = ((2^idx - 1) / 2^idx)^(n-1) * load - * load = (2^idx - 1) / 2^idx) * load + 1 / 2^idx * cur_load - * - * decay_load_missed() below does efficient calculation of - * load = ((2^idx - 1) / 2^idx)^(n-1) * load - * avoiding 0..n-1 loop doing load = ((2^idx - 1) / 2^idx) * load - * - * The calculation is approximated on a 128 point scale. - * degrade_zero_ticks is the number of ticks after which load at any - * particular idx is approximated to be zero. - * degrade_factor is a precomputed table, a row for each load idx. - * Each column corresponds to degradation factor for a power of two ticks, - * based on 128 point scale. - * Example: - * row 2, col 3 (=12) says that the degradation at load idx 2 after - * 8 ticks is 12/128 (which is an approximation of exact factor 3^8/4^8). - * - * With this power of 2 load factors, we can degrade the load n times - * by looking at 1 bits in n and doing as many mult/shift instead of - * n mult/shifts needed by the exact degradation. - */ -#define DEGRADE_SHIFT 7 -static const unsigned char - degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128}; -static const unsigned char - degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = { - {0, 0, 0, 0, 0, 0, 0, 0}, - {64, 32, 8, 0, 0, 0, 0, 0}, - {96, 72, 40, 12, 1, 0, 0}, - {112, 98, 75, 43, 15, 1, 0}, - {120, 112, 98, 76, 45, 16, 2} };

/* - * Update cpu_load for any missed ticks, due to tickless idle. The backlog - * would be when CPU is idle and so we just decay the old load without - * adding any new load. - */ -static unsigned long -decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) -{ - int j = 0; - - if (!missed_updates) - return load; - - if (missed_updates >= degrade_zero_ticks[idx]) - return 0; - - if (idx == 1) - return load >> missed_updates; - - while (missed_updates) { - if (missed_updates % 2) - load = (load * degrade_factor[idx][j]) >> DEGRADE_SHIFT; - - missed_updates >>= 1; - j++; - } - return load; -} - -/* - * Update rq->cpu_load[] statistics. This function is usually called every + * Update rq->cpu_load statistics. This function is usually called every * scheduler tick (TICK_NSEC). With tickless idle this will not be called * every tick. We fix it up based on jiffies. */ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, unsigned long pending_updates) { - int i, scale; - this_rq->nr_load_updates++;

/* Update our load: */ - this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ - for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { - unsigned long old_load, new_load; - - /* scale is effectively 1 << i now, and >> i divides by scale */ - - old_load = this_rq->cpu_load[i]; - old_load = decay_load_missed(old_load, pending_updates - 1, i); - new_load = this_load; - /* - * Round up the averaging division if load is increasing. This - * prevents us from getting stuck on 9 if the load is 10, for - * example. - */ - if (new_load > old_load) - new_load += scale - 1; - - this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i; - } + this_rq->cpu_load = this_load; /* Fasttrack for idx 0 */

sched_avg_update(this_rq); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1bf34c2..5b2d4a1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -517,8 +517,7 @@ struct rq { unsigned int nr_numa_running; unsigned int nr_preferred_running; #endif - #define CPU_LOAD_IDX_MAX 5 - unsigned long cpu_load[CPU_LOAD_IDX_MAX]; + unsigned long cpu_load; unsigned long last_load_update_tick; #ifdef CONFIG_NO_HZ_COMMON u64 nohz_stamp;

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 03/11] sched: clean up cpu_load update

Since we don't decay the rq->cpu_load, so we don't need the pending_updates. But we still want update rq->rt_avg, so still keep rq->last_load_update_tick and func __update_cpu_load.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/proc.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c index a2435c5..057bb9b 100644 --- a/kernel/sched/proc.c +++ b/kernel/sched/proc.c @@ -404,8 +404,7 @@ static void calc_load_account_active(struct rq *this_rq) * scheduler tick (TICK_NSEC). With tickless idle this will not be called * every tick. We fix it up based on jiffies. */ -static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, - unsigned long pending_updates) +static void __update_cpu_load(struct rq *this_rq, unsigned long this_load) { this_rq->nr_load_updates++;

@@ -449,7 +448,6 @@ void update_idle_cpu_load(struct rq *this_rq) { unsigned long curr_jiffies = ACCESS_ONCE(jiffies); unsigned long load = get_rq_runnable_load(this_rq); - unsigned long pending_updates;

/* * bail if there's load or we're actually up-to-date. @@ -457,10 +455,9 @@ void update_idle_cpu_load(struct rq *this_rq) if (load || curr_jiffies == this_rq->last_load_update_tick) return;

- pending_updates = curr_jiffies - this_rq->last_load_update_tick; this_rq->last_load_update_tick = curr_jiffies;

- __update_cpu_load(this_rq, load, pending_updates); + __update_cpu_load(this_rq, load); }

/* @@ -483,7 +480,7 @@ void update_cpu_load_nohz(void) * We were idle, this means load 0, the current load might be * !0 due to remote wakeups and the sort. */ - __update_cpu_load(this_rq, 0, pending_updates); + __update_cpu_load(this_rq, 0); } raw_spin_unlock(&this_rq->lock); } @@ -499,7 +496,7 @@ void update_cpu_load_active(struct rq *this_rq) * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). */ this_rq->last_load_update_tick = jiffies; - __update_cpu_load(this_rq, load, 1); + __update_cpu_load(this_rq, load);

calc_load_account_active(this_rq); }

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

Old code considers the bias in source/target_load already. but still use imbalance_pct as last check in idlest/busiest group finding. It is also a kind of redundant job. If we bias imbalance in source/target_load, we'd better not use imbalance_pct again.

After cpu_load array removed, it is nice time to unify the target bias consideration. So I remove the imbalance_pct from last check and add the live bias using.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/fair.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eeffe75..a85a10b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1016,7 +1016,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,

static unsigned long weighted_cpuload(const int cpu); static unsigned long source_load(int cpu); -static unsigned long target_load(int cpu); +static unsigned long target_load(int cpu, int imbalance_pct); static unsigned long power_of(int cpu); static long effective_load(struct task_group *tg, int cpu, long wl, long wg);

@@ -3967,7 +3967,7 @@ static unsigned long source_load(int cpu) * Return a high guess at the load of a migration-target cpu weighted * according to the scheduling class and "nice" value. */ -static unsigned long target_load(int cpu) +static unsigned long target_load(int cpu, int imbalance_pct) { struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu); @@ -3975,6 +3975,11 @@ static unsigned long target_load(int cpu) if (!sched_feat(LB_BIAS)) return total;

+ /* + * Bias target load with imbalance_pct. + */ + total = total * imbalance_pct / 100; + return max(rq->cpu_load, total); }

@@ -4180,6 +4185,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) struct task_group *tg; unsigned long weight; int balanced; + int bias = 100 + (sd->imbalance_pct - 100) / 2;

/* * If we wake multiple tasks be careful to not bounce @@ -4191,7 +4197,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); load = source_load(prev_cpu); - this_load = target_load(this_cpu); + this_load = target_load(this_cpu, bias);

/* * If sync wakeup then subtract the (maximum possible) @@ -4226,7 +4232,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_eff_load *= this_load + effective_load(tg, this_cpu, weight, weight);

- prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load = bias; prev_eff_load *= power_of(this_cpu); prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);

@@ -4247,7 +4253,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)

if (balanced || (this_load <= load && - this_load + target_load(prev_cpu) <= tl_per_task)) { + this_load + target_load(prev_cpu, sd->imbalance_pct) + <= tl_per_task)) { /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and @@ -4293,7 +4300,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) if (local_group) load = source_load(i); else - load = target_load(i); + load = target_load(i, imbalance);

avg_load += load; } @@ -4309,7 +4316,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) } } while (group = group->next, group != sd->groups);

- if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || this_load < min_load) return NULL; return idlest; } @@ -5745,6 +5752,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, { unsigned long load; int i; + int bias = 100 + (env->sd->imbalance_pct - 100) / 2;

memset(sgs, 0, sizeof(*sgs));

@@ -5752,8 +5760,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, struct rq *rq = cpu_rq(i);

/* Bias balancing toward cpus of our domain */ - if (local_group) - load = target_load(i); + if (local_group && env->idle != CPU_IDLE) + load = target_load(i, bias); else load = source_load(i);

@@ -6193,14 +6201,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) goto out_balanced; - } else { - /* - * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use - * imbalance_pct to be conservative. - */ - if (100 * busiest->avg_load <= - env->sd->imbalance_pct * local->avg_load) - goto out_balanced; }

force_balance:

-- 1.8.1.2

Morten Rasmussen

18 Feb 18 Feb

11:50 a.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

On Mon, Feb 17, 2014 at 01:55:10AM +0000, Alex Shi wrote:

...

Old code considers the bias in source/target_load already. but still use imbalance_pct as last check in idlest/busiest group finding. It is also a kind of redundant job. If we bias imbalance in source/target_load, we'd better not use imbalance_pct again.

After cpu_load array removed, it is nice time to unify the target bias consideration. So I remove the imbalance_pct from last check and add the live bias using.

Signed-off-by: Alex Shi alex.shi@linaro.org

kernel/sched/fair.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eeffe75..a85a10b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1016,7 +1016,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, static unsigned long weighted_cpuload(const int cpu); static unsigned long source_load(int cpu); -static unsigned long target_load(int cpu); +static unsigned long target_load(int cpu, int imbalance_pct); static unsigned long power_of(int cpu); static long effective_load(struct task_group *tg, int cpu, long wl, long wg); @@ -3967,7 +3967,7 @@ static unsigned long source_load(int cpu)

Return a high guess at the load of a migration-target cpu weighted

according to the scheduling class and "nice" value.

*/ -static unsigned long target_load(int cpu) +static unsigned long target_load(int cpu, int imbalance_pct) { struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu); @@ -3975,6 +3975,11 @@ static unsigned long target_load(int cpu) if (!sched_feat(LB_BIAS)) return total;
/*
* Bias target load with imbalance_pct.
*/
total = total * imbalance_pct / 100;

return max(rq->cpu_load, total);
} @@ -4180,6 +4185,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) struct task_group *tg; unsigned long weight; int balanced;

int bias = 100 + (sd->imbalance_pct - 100) / 2;

/* * If we wake multiple tasks be careful to not bounce @@ -4191,7 +4197,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); load = source_load(prev_cpu);

this_load = target_load(this_cpu);

this_load = target_load(this_cpu, bias);

It seems that you now apply the bias to both sides of the comparison. The above should be:

+ this_load = target_load(this_cpu, 100);

to make sense.

...

/* * If sync wakeup then subtract the (maximum possible) @@ -4226,7 +4232,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_eff_load *= this_load + effective_load(tg, this_cpu, weight, weight);
prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load = bias;
prev_eff_load *= power_of(this_cpu); prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
@@ -4247,7 +4253,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (balanced || (this_load <= load &&
    this_load + target_load(prev_cpu) <= tl_per_task)) {
     this_load + target_load(prev_cpu, sd->imbalance_pct)

I think it should be target_load(prev_cpu, 100) here instead. IIUC, it is an unbiased comparison.

Alex Shi

19 Feb 19 Feb

10:05 a.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

On 02/18/2014 07:50 PM, Morten Rasmussen wrote:

...

...
...

this_load = target_load(this_cpu);

this_load = target_load(this_cpu, bias);

It seems that you now apply the bias to both sides of the comparison. The above should be:
this_load = target_load(this_cpu, 100);
to make sense.

It do has some confusing meaning of wake_affine.

...

From reduce cpu cache miss point, I understand it prefer the prev cpu,

so it make this_cpu load a bit heavier than fact.

But in just the following eff_load computing, it set bit heavier of prev cpu: prev_eff_load = bias, while no bias on this_eff_load. That is a bit confusing. Anyone like to explain this?

...

...
...
this_eff_load *= this_load +
	effective_load(tg, this_cpu, weight, weight);
prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load = bias;
prev_eff_load *= power_of(this_cpu); prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
@@ -4247,7 +4253,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (balanced || (this_load <= load &&
    this_load + target_load(prev_cpu) <= tl_per_task)) {
     this_load + target_load(prev_cpu, sd->imbalance_pct)
I think it should be target_load(prev_cpu, 100) here instead. IIUC, it is an unbiased comparison.

If we want to reduce cache miss, we needs to prefer prev cpu. From this point, guess it is right.

-- Thanks Alex

Alex Shi

10:12 a.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

On 02/19/2014 06:05 PM, Alex Shi wrote:

...

On 02/18/2014 07:50 PM, Morten Rasmussen wrote:

...
...
...
...
...
> - this_load = target_load(this_cpu); > + this_load = target_load(this_cpu, bias);

It seems that you now apply the bias to both sides of the comparison. The above should be:
this_load = target_load(this_cpu, 100);
to make sense.
It do has some confusing meaning of wake_affine.

From reduce cpu cache miss point, I understand it prefer the prev cpu, so it make this_cpu load a bit heavier than fact.

But in just the following eff_load computing, it set bit heavier of prev cpu: prev_eff_load = bias, while no bias on this_eff_load. That is a bit confusing. Anyone like to explain this?

Yes, if wake_affine prefer current cpu not prev, I can understand to set wake_idx 0 for nothing bias and heavier prev_eff_load later. But why we prefer this cpu not prev?

-- Thanks Alex

Alex Shi

20 Feb 20 Feb

5:32 a.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

On 02/19/2014 06:12 PM, Alex Shi wrote:

...

Yes, if wake_affine prefer current cpu not prev, I can understand to set wake_idx 0 for nothing bias and heavier prev_eff_load later. But why we prefer this cpu not prev?

I track down to this commit: 4ae7d5cefd4aa 'improve affine wakeups'. And wake_affine was changed many times with different reason. Anyway I wanna not to touch it unless some benchmark pop up.

So, the following is updated patch. I followed the current logical more closely.

=========

...

From 9a56491e701dcf6aaa12bd61963e81774a908ccd Mon Sep 17 00:00:00 2001

From: Alex Shi alex.shi@linaro.org Date: Sat, 23 Nov 2013 23:18:09 +0800 Subject: [PATCH 04/11] sched: unify imbalance bias for target group

After cpu_load array removed, it is nice time to unify the target bias consideration. So I remove the imbalance_pct from last check and add the live bias using.

On wake_affine, since all archs' wake_idx is 0, current logical is just want to prefer current cpu. so we follows this logical. but rename the target_load to source_load to show clear our bias. Thanks for reminding from Morten!

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/fair.c | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eeffe75..7b910cf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1016,7 +1016,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,

+ /* + * Bias target load with imbalance_pct. + */ + total = total * imbalance_pct / 100; + return max(rq->cpu_load, total); }

@@ -4191,7 +4196,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); load = source_load(prev_cpu); - this_load = target_load(this_cpu); + this_load = source_load(this_cpu);

/* * If sync wakeup then subtract the (maximum possible) @@ -4247,7 +4252,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)

if (balanced || (this_load <= load && - this_load + target_load(prev_cpu) <= tl_per_task)) { + this_load + source_load(prev_cpu) <= tl_per_task)) { /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and @@ -4293,7 +4298,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) if (local_group) load = source_load(i); else - load = target_load(i); + load = target_load(i, imbalance);

avg_load += load; } @@ -4309,7 +4314,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) } } while (group = group->next, group != sd->groups);

- if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || this_load < min_load) return NULL; return idlest; } @@ -5745,6 +5750,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, { unsigned long load; int i; + int bias = 100 + (env->sd->imbalance_pct - 100) / 2;

memset(sgs, 0, sizeof(*sgs));

@@ -5752,8 +5758,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, struct rq *rq = cpu_rq(i);

/* Bias balancing toward cpus of our domain */ - if (local_group) - load = target_load(i); + if (local_group && env->idle != CPU_IDLE) + load = target_load(i, bias); else load = source_load(i);

@@ -6193,14 +6199,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) goto out_balanced; - } else { - /* - * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use - * imbalance_pct to be conservative. - */ - if (100 * busiest->avg_load <= - env->sd->imbalance_pct * local->avg_load) - goto out_balanced; }

force_balance:

-- 1.8.1.2 -- Thanks Alex

Morten Rasmussen

2:27 p.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

On Thu, Feb 20, 2014 at 05:32:53AM +0000, Alex Shi wrote:

...

On 02/19/2014 06:12 PM, Alex Shi wrote:

...
Yes, if wake_affine prefer current cpu not prev, I can understand to set wake_idx 0 for nothing bias and heavier prev_eff_load later. But why we prefer this cpu not prev?

I track down to this commit: 4ae7d5cefd4aa 'improve affine wakeups'. And wake_affine was changed many times with different reason. Anyway I wanna not to touch it unless some benchmark pop up.

So, the following is updated patch. I followed the current logical more closely.

========= From 9a56491e701dcf6aaa12bd61963e81774a908ccd Mon Sep 17 00:00:00 2001 From: Alex Shi alex.shi@linaro.org Date: Sat, 23 Nov 2013 23:18:09 +0800 Subject: [PATCH 04/11] sched: unify imbalance bias for target group

Old code considers the bias in source/target_load already. but still use imbalance_pct as last check in idlest/busiest group finding. It is also a kind of redundant job. If we bias imbalance in source/target_load, we'd better not use imbalance_pct again.

After cpu_load array removed, it is nice time to unify the target bias consideration. So I remove the imbalance_pct from last check and add the live bias using.

On wake_affine, since all archs' wake_idx is 0, current logical is just want to prefer current cpu. so we follows this logical. but rename the target_load to source_load to show clear our bias. Thanks for reminding from Morten!

Signed-off-by: Alex Shi alex.shi@linaro.org

kernel/sched/fair.c | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eeffe75..7b910cf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1016,7 +1016,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, static unsigned long weighted_cpuload(const int cpu); static unsigned long source_load(int cpu); -static unsigned long target_load(int cpu); +static unsigned long target_load(int cpu, int imbalance_pct); static unsigned long power_of(int cpu); static long effective_load(struct task_group *tg, int cpu, long wl, long wg); @@ -3967,7 +3967,7 @@ static unsigned long source_load(int cpu)

Return a high guess at the load of a migration-target cpu weighted

according to the scheduling class and "nice" value.

*/ -static unsigned long target_load(int cpu) +static unsigned long target_load(int cpu, int imbalance_pct) { struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu); @@ -3975,6 +3975,11 @@ static unsigned long target_load(int cpu) if (!sched_feat(LB_BIAS)) return total;
/*
* Bias target load with imbalance_pct.
*/
total = total * imbalance_pct / 100;

return max(rq->cpu_load, total);
} @@ -4191,7 +4196,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); load = source_load(prev_cpu);

this_load = target_load(this_cpu);

this_load = source_load(this_cpu);

It looks a bit odd that both source and destination cpu loads are found using a function name source_load(). IMHO, it would be clearer if you got rid of source_load() and target_load() completely, and just used weighted_cpuload() instead. You only use target_load() twice (further down) anyway.

...

/* * If sync wakeup then subtract the (maximum possible) @@ -4247,7 +4252,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (balanced || (this_load <= load &&
    this_load + target_load(prev_cpu) <= tl_per_task)) {
     this_load + source_load(prev_cpu) <= tl_per_task)) {
/*

This domain has SD_WAKE_AFFINE and

p is cache cold in this domain, and
@@ -4293,7 +4298,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) if (local_group) load = source_load(i); else
		load = target_load(i);
		load = target_load(i, imbalance);

Here you could easily use weighted_cpuload() instead and apply the bias as before (below).

...

avg_load += load; } @@ -4309,7 +4314,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) } } while (group = group->next, group != sd->groups);

if (!idlest || 100*this_load < imbalance*min_load)

if (!idlest || this_load < min_load)

This change would go away if you used weighted_cpuload().

...

return NULL;
return idlest; } @@ -5745,6 +5750,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, { unsigned long load; int i;

int bias = 100 + (env->sd->imbalance_pct - 100) / 2;

memset(sgs, 0, sizeof(*sgs)); @@ -5752,8 +5758,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, struct rq *rq = cpu_rq(i); /* Bias balancing toward cpus of our domain */
if (local_group)
	load = target_load(i);
if (local_group && env->idle != CPU_IDLE)
	load = target_load(i, bias);

Could be weighted_cpuload() instead, but you would have to keep the lines you delete below.

...

else
	load = source_load(i);
@@ -6193,14 +6199,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) goto out_balanced;
} else {
/*
 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
 * imbalance_pct to be conservative.
 */
if (100 * busiest->avg_load <=
		env->sd->imbalance_pct * local->avg_load)
	goto out_balanced;
}
force_balance:

I think it clearer now what this patch set does. It rips out cpu_load[] completely and changes all it users to use weighted_cpuload() (cfs.runnable_load_avg) instead. The longer term view provided by the cpu_load[] indexes is not replaced. Whether that is a loss, I'm not sure.

Morten

Alex Shi

24 Feb 24 Feb

2:58 a.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

...

...
@@ -4191,7 +4196,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); load = source_load(prev_cpu);

this_load = target_load(this_cpu);

this_load = source_load(this_cpu);

It looks a bit odd that both source and destination cpu loads are found using a function name source_load(). IMHO, it would be clearer if you got rid of source_load() and target_load() completely, and just used weighted_cpuload() instead. You only use target_load() twice (further down) anyway.

Yes, weighted_cpuload has better meaning.

...

...
/* * If sync wakeup then subtract the (maximum possible) @@ -4247,7 +4252,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (balanced || (this_load <= load &&
    this_load + target_load(prev_cpu) <= tl_per_task)) {
     this_load + source_load(prev_cpu) <= tl_per_task)) {
/*

This domain has SD_WAKE_AFFINE and

p is cache cold in this domain, and
@@ -4293,7 +4298,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) if (local_group) load = source_load(i); else
		load = target_load(i);
		load = target_load(i, imbalance);
Here you could easily use weighted_cpuload() instead and apply the bias as before (below).

...
avg_load += load; } @@ -4309,7 +4314,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) } } while (group = group->next, group != sd->groups);

if (!idlest || 100*this_load < imbalance*min_load)

if (!idlest || this_load < min_load)

This change would go away if you used weighted_cpuload().

Yes, but seem better to left the bias unified in target_load.

...

...
return NULL;
return idlest; } @@ -5745,6 +5750,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, { unsigned long load; int i;

int bias = 100 + (env->sd->imbalance_pct - 100) / 2;

memset(sgs, 0, sizeof(*sgs)); @@ -5752,8 +5758,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, struct rq *rq = cpu_rq(i); /* Bias balancing toward cpus of our domain */
if (local_group)
	load = target_load(i);
if (local_group && env->idle != CPU_IDLE)
	load = target_load(i, bias);
Could be weighted_cpuload() instead, but you would have to keep the lines you delete below.

On current logical target_load may seek before, and do bias when idx is busy or idle. I am afraid weighted_cpuload is not good here. and I prefer to keep bias in a uniform mode, not spread in larger scope.

...

...
else
	load = source_load(i);
@@ -6193,14 +6199,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) goto out_balanced;
} else {
/*
 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
 * imbalance_pct to be conservative.
 */
if (100 * busiest->avg_load <=
		env->sd->imbalance_pct * local->avg_load)
	goto out_balanced;
}
force_balance:
I think it clearer now what this patch set does. It rips out cpu_load[] completely and changes all it users to use weighted_cpuload() (cfs.runnable_load_avg) instead. The longer term view provided by the cpu_load[] indexes is not replaced. Whether that is a loss, I'm not sure.

Thanks! Fengguang's testing system is monitor this branch. so no news is good news. :)

...

Morten

-- Thanks Alex

Morten Rasmussen

2:02 p.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

On Mon, Feb 24, 2014 at 02:58:35AM +0000, Alex Shi wrote:

...

...
...
@@ -4191,7 +4196,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); load = source_load(prev_cpu);

this_load = target_load(this_cpu);

this_load = source_load(this_cpu);

It looks a bit odd that both source and destination cpu loads are found using a function name source_load(). IMHO, it would be clearer if you got rid of source_load() and target_load() completely, and just used weighted_cpuload() instead. You only use target_load() twice (further down) anyway.

Yes, weighted_cpuload has better meaning.

...
...
/* * If sync wakeup then subtract the (maximum possible) @@ -4247,7 +4252,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (balanced || (this_load <= load &&
    this_load + target_load(prev_cpu) <= tl_per_task)) {
     this_load + source_load(prev_cpu) <= tl_per_task)) {
/*

This domain has SD_WAKE_AFFINE and

p is cache cold in this domain, and
@@ -4293,7 +4298,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) if (local_group) load = source_load(i); else
		load = target_load(i);
		load = target_load(i, imbalance);
Here you could easily use weighted_cpuload() instead and apply the bias as before (below).

...
avg_load += load; } @@ -4309,7 +4314,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) } } while (group = group->next, group != sd->groups);

if (!idlest || 100*this_load < imbalance*min_load)

if (!idlest || this_load < min_load)

This change would go away if you used weighted_cpuload().
Yes, but seem better to left the bias unified in target_load.

My point is that you could get rid of target_load() completely. source_load() is already gone in your patches.

...

...
...
return NULL;
return idlest; } @@ -5745,6 +5750,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, { unsigned long load; int i;

int bias = 100 + (env->sd->imbalance_pct - 100) / 2;

memset(sgs, 0, sizeof(*sgs)); @@ -5752,8 +5758,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, struct rq *rq = cpu_rq(i); /* Bias balancing toward cpus of our domain */
if (local_group)
	load = target_load(i);
if (local_group && env->idle != CPU_IDLE)
	load = target_load(i, bias);
Could be weighted_cpuload() instead, but you would have to keep the lines you delete below.
On current logical target_load may seek before, and do bias when idx is busy or idle. I am afraid weighted_cpuload is not good here. and I prefer to keep bias in a uniform mode, not spread in larger scope.

Right, using weighted_cpuload() here would remove the local group versus other group biasing completely. The new bias is different from the existing one though. You have already discarded the bias from source_load(). The new bias is just a scaling factor, so it isn't exactly comparable to the existing one.

Now that source_load() is gone, would it then make sense to rename target_load() to biased_load() or something?

Morten

Alex Shi

25 Feb 25 Feb

1:26 a.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

On 02/24/2014 10:02 PM, Morten Rasmussen wrote:

...

...
...
...
...
Could be weighted_cpuload() instead, but you would have to keep the lines you delete below.

On current logical target_load may seek before, and do bias when idx is busy or idle. I am afraid weighted_cpuload is not good here. and I prefer to keep bias in a uniform mode, not spread in larger scope.

Right, using weighted_cpuload() here would remove the local group versus other group biasing completely. The new bias is different from the existing one though. You have already discarded the bias from source_load(). The new bias is just a scaling factor, so it isn't exactly comparable to the existing one.

Now that source_load() is gone, would it then make sense to rename target_load() to biased_load() or something?

Yes, I did think about this change. But just afraid it has no much sense.

Anyway since you said so. will send out another version. :)

-- Thanks Alex

Morten Rasmussen

20 Feb 20 Feb

2:45 p.m.

New subject: [PATCH v2 04/11] sched: unify imbalance bias for target group

On Wed, Feb 19, 2014 at 10:12:51AM +0000, Alex Shi wrote:

...

On 02/19/2014 06:05 PM, Alex Shi wrote:

...
On 02/18/2014 07:50 PM, Morten Rasmussen wrote:

...
...
...
...
> > - this_load = target_load(this_cpu); > > + this_load = target_load(this_cpu, bias);

It seems that you now apply the bias to both sides of the comparison. The above should be:
this_load = target_load(this_cpu, 100);
to make sense.
It do has some confusing meaning of wake_affine.

My point was just that if you apply the bias on both sides of the comparison it has no effect.

...

...
From reduce cpu cache miss point, I understand it prefer the prev cpu, so it make this_cpu load a bit heavier than fact.

But in just the following eff_load computing, it set bit heavier of prev cpu: prev_eff_load = bias, while no bias on this_eff_load. That is a bit confusing. Anyone like to explain this?

Yes, if wake_affine prefer current cpu not prev, I can understand to set wake_idx 0 for nothing bias and heavier prev_eff_load later. But why we prefer this cpu not prev?

My understanding is that we do have a slight preference for this_cpu by biasing the prev_cpu load. We are still comparing load, so if this_load is high we use prev_cpu. But since this_cpu is already interrupted we might as well just start the task there if this_load is equal or less than prev_load. If this_load = 0 it can start running immediately, while it would risk having to wait on the rq if we sent it back to prev_cpu.

AFAIU, we don't consider cache hotness at wakeup. Also, since we looking sibling cores when using wake_affine, the cache impact is probably quite limited. I believe it is much worse being scheduled on a cpu where you have to wait longer to get to run than having to migrate your L1 contents.

Morten

Alex Shi

17 Feb 17 Feb

1:55 a.m.

New subject: [PATCH v2 05/11] sched: rewrite update_cpu_load_nohz

After change to sched_avg, the cpu load in idle exit was decayed. So, it maybe near zero if waking a long time sleep task, or, a full non-decay load if waking a new forked task. Then, we can use it to reflect the cpu load, don't need to pretend 0.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/proc.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c index 057bb9b..383c4ba 100644 --- a/kernel/sched/proc.c +++ b/kernel/sched/proc.c @@ -461,28 +461,13 @@ void update_idle_cpu_load(struct rq *this_rq) }

/* - * Called from tick_nohz_idle_exit() -- try and fix up the ticks we missed. + * Called from tick_nohz_idle_exit() */ void update_cpu_load_nohz(void) { struct rq *this_rq = this_rq(); - unsigned long curr_jiffies = ACCESS_ONCE(jiffies); - unsigned long pending_updates; - - if (curr_jiffies == this_rq->last_load_update_tick) - return;

- raw_spin_lock(&this_rq->lock); - pending_updates = curr_jiffies - this_rq->last_load_update_tick; - if (pending_updates) { - this_rq->last_load_update_tick = curr_jiffies; - /* - * We were idle, this means load 0, the current load might be - * !0 due to remote wakeups and the sort. - */ - __update_cpu_load(this_rq, 0); - } - raw_spin_unlock(&this_rq->lock); + update_idle_cpu_load(this_rq); } #endif /* CONFIG_NO_HZ */

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 06/11] sched: clean up source_load/target_load

Don't need 'rq' variable now.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/fair.c | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a85a10b..2da0e3b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3954,13 +3954,7 @@ static unsigned long weighted_cpuload(const int cpu) */ static unsigned long source_load(int cpu) { - struct rq *rq = cpu_rq(cpu); - unsigned long total = weighted_cpuload(cpu); - - if (!sched_feat(LB_BIAS)) - return total; - - return min(rq->cpu_load, total); + return weighted_cpuload(cpu); }

/* @@ -3969,7 +3963,6 @@ static unsigned long source_load(int cpu) */ static unsigned long target_load(int cpu, int imbalance_pct) { - struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu);

if (!sched_feat(LB_BIAS)) @@ -3978,9 +3971,7 @@ static unsigned long target_load(int cpu, int imbalance_pct) /* * Bias target load with imbalance_pct. */ - total = total * imbalance_pct / 100; - - return max(rq->cpu_load, total); + return total * imbalance_pct / 100; }

static unsigned long power_of(int cpu)

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 07/11] sched: clean up weighted_cpuload

weighted_cpuload is used instead of source_load() when !idx. Now idx is always 0. so unify the usage to soruce_load. That make code more readable.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/fair.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2da0e3b..5cdc838 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1045,7 +1045,7 @@ static void update_numa_stats(struct numa_stats *ns, int nid) struct rq *rq = cpu_rq(cpu);

ns->nr_running += rq->nr_running; - ns->load += weighted_cpuload(cpu); + ns->load += source_load(cpu); ns->power += power_of(cpu);

cpus++; @@ -3940,7 +3940,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

#ifdef CONFIG_SMP /* Used instead of source_load when we know the type == 0 */ -static unsigned long weighted_cpuload(const int cpu) +static inline unsigned long weighted_cpuload(const int cpu) { return cpu_rq(cpu)->cfs.runnable_load_avg; } @@ -4324,7 +4324,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)

/* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { - load = weighted_cpuload(i); + load = source_load(i);

if (load < min_load || (load == min_load && i == this_cpu)) { min_load = load; @@ -5762,7 +5762,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->nr_numa_running += rq->nr_numa_running; sgs->nr_preferred_running += rq->nr_preferred_running; #endif - sgs->sum_weighted_load += weighted_cpuload(i); + sgs->sum_weighted_load += source_load(i); if (idle_cpu(i)) sgs->idle_cpus++; } @@ -6248,10 +6248,10 @@ static struct rq *find_busiest_queue(struct lb_env *env, if (!capacity) capacity = fix_small_capacity(env->sd, group);

- wl = weighted_cpuload(i); + wl = source_load(i);

/* - * When comparing with imbalance, use weighted_cpuload() + * When comparing with imbalance, use source_load() * which is not scaled with the cpu power. */ if (capacity && rq->nr_running == 1 && wl > env->imbalance) @@ -6259,7 +6259,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,

/* * For the load comparisons with the other cpu's, consider - * the weighted_cpuload() scaled with the cpu power, so that + * the source_load() scaled with the cpu power, so that * the load can be moved away from the cpu that is potentially * running at a lower capacity. *

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 08/11] sched: remove weighted_load()

Although weighted_load is a inline founction, it's not needed in fact. so remove it.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/fair.c | 23 ++++------------------- 1 file changed, 4 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5cdc838..6c37ee1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3939,31 +3939,16 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) }

#ifdef CONFIG_SMP -/* Used instead of source_load when we know the type == 0 */ -static inline unsigned long weighted_cpuload(const int cpu) -{ - return cpu_rq(cpu)->cfs.runnable_load_avg; -} - -/* - * Return a low guess at the load of a migration-source cpu weighted - * according to the scheduling class and "nice" value. - * - * We want to under-estimate the load of migration sources, to - * balance conservatively. - */ +/* Return the real load of 'cpu' */ static unsigned long source_load(int cpu) { - return weighted_cpuload(cpu); + return cpu_rq(cpu)->cfs.runnable_load_avg; }

-/* - * Return a high guess at the load of a migration-target cpu weighted - * according to the scheduling class and "nice" value. - */ +/* Return a high bias at the load of a migration-target cpu weighted */ static unsigned long target_load(int cpu, int imbalance_pct) { - unsigned long total = weighted_cpuload(cpu); + unsigned long total = cpu_rq(cpu)->cfs.runnable_load_avg;

if (!sched_feat(LB_BIAS)) return total;

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 09/11] sched: remove rq->cpu_load and rq->nr_load_updates

The cpu_load is the copy of rq->cfs.runnable_load_avg. And it updated on time. So we can use the latter directly. Thus saved 2 rq variables: cpu_load and nr_load_updates. Then don't need __update_cpu_load(), just keep sched_avg_update(). Thus removed get_rq_runnable_load() which used for update_cpu_load only.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/core.c | 2 -- kernel/sched/debug.c | 2 -- kernel/sched/proc.c | 55 +++++++++++++--------------------------------------- kernel/sched/sched.h | 2 -- 4 files changed, 13 insertions(+), 48 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ac2f10c..32602595 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6800,8 +6800,6 @@ void __init sched_init(void) INIT_LIST_HEAD(&rq->leaf_rt_rq_list); init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL); #endif - - rq->cpu_load = 0; rq->last_load_update_tick = jiffies;

#ifdef CONFIG_SMP diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index a24d549..83737ce 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -298,12 +298,10 @@ do { \ SEQ_printf(m, " .%-30s: %lu\n", "load", rq->load.weight); P(nr_switches); - P(nr_load_updates); P(nr_uninterruptible); PN(next_balance); SEQ_printf(m, " .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr))); PN(clock); - P(cpu_load); #undef P #undef PN

diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c index 383c4ba..dd3c2d9 100644 --- a/kernel/sched/proc.c +++ b/kernel/sched/proc.c @@ -8,12 +8,19 @@

#include "sched.h"

+#ifdef CONFIG_SMP unsigned long this_cpu_load(void) { - struct rq *this = this_rq(); - return this->cpu_load; + struct rq *rq = this_rq(); + return rq->cfs.runnable_load_avg; } - +#else +unsigned long this_cpu_load(void) +{ + struct rq *rq = this_rq(); + return rq->load.weight; +} +#endif

/* * Global load-average calculations @@ -398,34 +405,6 @@ static void calc_load_account_active(struct rq *this_rq) * End of global load-average stuff */

- -/* - * Update rq->cpu_load statistics. This function is usually called every - * scheduler tick (TICK_NSEC). With tickless idle this will not be called - * every tick. We fix it up based on jiffies. - */ -static void __update_cpu_load(struct rq *this_rq, unsigned long this_load) -{ - this_rq->nr_load_updates++; - - /* Update our load: */ - this_rq->cpu_load = this_load; /* Fasttrack for idx 0 */ - - sched_avg_update(this_rq); -} - -#ifdef CONFIG_SMP -static inline unsigned long get_rq_runnable_load(struct rq *rq) -{ - return rq->cfs.runnable_load_avg; -} -#else -static inline unsigned long get_rq_runnable_load(struct rq *rq) -{ - return rq->load.weight; -} -#endif - #ifdef CONFIG_NO_HZ_COMMON /* * There is no sane way to deal with nohz on smp when using jiffies because the @@ -447,17 +426,15 @@ static inline unsigned long get_rq_runnable_load(struct rq *rq) void update_idle_cpu_load(struct rq *this_rq) { unsigned long curr_jiffies = ACCESS_ONCE(jiffies); - unsigned long load = get_rq_runnable_load(this_rq);

/* * bail if there's load or we're actually up-to-date. */ - if (load || curr_jiffies == this_rq->last_load_update_tick) + if (curr_jiffies == this_rq->last_load_update_tick) return;

this_rq->last_load_update_tick = curr_jiffies; - - __update_cpu_load(this_rq, load); + sched_avg_update(this_rq); }

/* @@ -466,7 +443,6 @@ void update_idle_cpu_load(struct rq *this_rq) void update_cpu_load_nohz(void) { struct rq *this_rq = this_rq(); - update_idle_cpu_load(this_rq); } #endif /* CONFIG_NO_HZ */ @@ -476,12 +452,7 @@ void update_cpu_load_nohz(void) */ void update_cpu_load_active(struct rq *this_rq) { - unsigned long load = get_rq_runnable_load(this_rq); - /* - * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). - */ this_rq->last_load_update_tick = jiffies; - __update_cpu_load(this_rq, load); - + sched_avg_update(this_rq); calc_load_account_active(this_rq); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5b2d4a1..c623131 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -517,7 +517,6 @@ struct rq { unsigned int nr_numa_running; unsigned int nr_preferred_running; #endif - unsigned long cpu_load; unsigned long last_load_update_tick; #ifdef CONFIG_NO_HZ_COMMON u64 nohz_stamp; @@ -530,7 +529,6 @@ struct rq {

/* capture load from *all* tasks on this cpu: */ struct load_weight load; - unsigned long nr_load_updates; u64 nr_switches;

struct cfs_rq cfs;

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 10/11] sched: rename update_*_cpu_load

Since we have no cpu_load update, rename the related functions: s/update_idle_cpu_load/update_idle_rt_avg/ s/update_cpu_load_nohz/update_rt_avg_nohz/ s/update_cpu_load_active/update_avg_load_active/

No functional change.

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index bd36598..2fe46b5 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -1542,12 +1542,12 @@ Doing the same with chrt -r 5 and function-trace set. <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit - <idle>-0 3dN.1 13us : update_cpu_load_nohz <-tick_nohz_idle_exit - <idle>-0 3dN.1 13us : _raw_spin_lock <-update_cpu_load_nohz + <idle>-0 3dN.1 13us : update_rt_avg_nohz <-tick_nohz_idle_exit + <idle>-0 3dN.1 13us : _raw_spin_lock <-update_rt_avg_nohz <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock - <idle>-0 3dN.2 13us : __update_cpu_load <-update_cpu_load_nohz + <idle>-0 3dN.2 13us : __update_cpu_load <-update_rt_avg_nohz <idle>-0 3dN.2 14us : sched_avg_update <-__update_cpu_load - <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_cpu_load_nohz + <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_rt_avg_nohz <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit diff --git a/include/linux/sched.h b/include/linux/sched.h index 6c416c8..f6afcb3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -174,7 +174,7 @@ extern unsigned long this_cpu_load(void);

extern void calc_global_load(unsigned long ticks); -extern void update_cpu_load_nohz(void); +extern void update_rt_avg_nohz(void);

extern unsigned long get_parent_ip(unsigned long addr);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 32602595..74dae0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2431,7 +2431,7 @@ void scheduler_tick(void) raw_spin_lock(&rq->lock); update_rq_clock(rq); curr->sched_class->task_tick(rq, curr, 0); - update_cpu_load_active(rq); + update_avg_load_active(rq); raw_spin_unlock(&rq->lock);

perf_event_task_tick(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6c37ee1..1b008ac 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6986,7 +6986,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

raw_spin_lock_irq(&rq->lock); update_rq_clock(rq); - update_idle_cpu_load(rq); + update_idle_rt_avg(rq); raw_spin_unlock_irq(&rq->lock);

rebalance_domains(rq, CPU_IDLE); diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c index dd3c2d9..42b7706 100644 --- a/kernel/sched/proc.c +++ b/kernel/sched/proc.c @@ -423,7 +423,7 @@ static void calc_load_account_active(struct rq *this_rq) * Called from nohz_idle_balance() to update the load ratings before doing the * idle balance. */ -void update_idle_cpu_load(struct rq *this_rq) +void update_idle_rt_avg(struct rq *this_rq) { unsigned long curr_jiffies = ACCESS_ONCE(jiffies);

@@ -440,17 +440,17 @@ void update_idle_cpu_load(struct rq *this_rq) /* * Called from tick_nohz_idle_exit() */ -void update_cpu_load_nohz(void) +void update_rt_avg_nohz(void) { struct rq *this_rq = this_rq(); - update_idle_cpu_load(this_rq); + update_idle_rt_avg(this_rq); } #endif /* CONFIG_NO_HZ */

/* * Called from scheduler_tick() */ -void update_cpu_load_active(struct rq *this_rq) +void update_avg_load_active(struct rq *this_rq) { this_rq->last_load_update_tick = jiffies; sched_avg_update(this_rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c623131..ab310c2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -21,7 +21,7 @@ extern unsigned long calc_load_update; extern atomic_long_t calc_load_tasks;

extern long calc_load_fold_active(struct rq *this_rq); -extern void update_cpu_load_active(struct rq *this_rq); +extern void update_avg_load_active(struct rq *this_rq);

/* * Helpers for converting nanosecond timing to jiffy resolution @@ -1194,7 +1194,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se);

unsigned long to_ratio(u64 period, u64 runtime);

-extern void update_idle_cpu_load(struct rq *this_rq); +extern void update_idle_rt_avg(struct rq *this_rq);

extern void init_task_runnable_average(struct task_struct *p);

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9f8af69..b1a400a 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -866,7 +866,7 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now) { /* Update jiffies first */ tick_do_update_jiffies64(now); - update_cpu_load_nohz(); + update_rt_avg_nohz();

calc_load_exit_idle(); touch_softlockup_watchdog();

-- 1.8.1.2

Alex Shi

1:55 a.m.

New subject: [PATCH v2 11/11] sched: clean up task_hot function

task_hot doesn't need the 'sched_domain' parameter, so remove it.

Signed-off-by: Alex Shi alex.shi@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1b008ac..e81a790 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5003,7 +5003,7 @@ static void move_task(struct task_struct *p, struct lb_env *env) * Is this task likely cache-hot: */ static int -task_hot(struct task_struct *p, u64 now, struct sched_domain *sd) +task_hot(struct task_struct *p, u64 now) { s64 delta;

@@ -5164,7 +5164,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) * 2) task is cache cold, or * 3) too many balance attempts have failed. */ - tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd); + tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq)); if (!tsk_cache_hot) tsk_cache_hot = migrate_degrades_locality(p, env);

-- 1.8.1.2

Alex Shi

18 Feb 18 Feb

2:37 a.m.

On 02/17/2014 09:55 AM, Alex Shi wrote:

...

The cpu_load decays on time according past cpu load of rq. The sched_avg also decays tasks' load on time. Now we has 2 kind decay for cpu_load. That is a kind of redundancy. And increase the system load by decay calculation. This patch try to remove the cpu_load decay.

There are 5 load_idx used for cpu_load in sched_domain. busy_idx and idle_idx are not zero usually, but newidle_idx, wake_idx and forkexec_idx are all zero on every arch. A shortcut to remove cpu_Load decay in the first patch. just one line patch for this change.

V2, 1, This version do some tuning on load bias of target load, to maximum match current code logical. 2, Got further to remove the cpu_load in rq. 3, Revert the patch 'Limit sd->*_idx range on sysctl' since no needs

Any comments for this? :)

...

Any testing/comments are appreciated.

This patch rebase on latest tip/master. The git tree for this patchset at: git@github.com:alexshi/power-scheduling.git noload

Thanks Alex

-- Thanks Alex

Michael wang

4:52 a.m.

On 02/17/2014 09:55 AM, Alex Shi wrote:

...

The cpu_load decays on time according past cpu load of rq. The sched_avg also decays tasks' load on time. Now we has 2 kind decay for cpu_load. That is a kind of redundancy. And increase the system load by decay calculation. This patch try to remove the cpu_load decay.

There are 5 load_idx used for cpu_load in sched_domain. busy_idx and idle_idx are not zero usually, but newidle_idx, wake_idx and forkexec_idx are all zero on every arch. A shortcut to remove cpu_Load decay in the first patch. just one line patch for this change.

V2, 1, This version do some tuning on load bias of target load, to maximum match current code logical. 2, Got further to remove the cpu_load in rq. 3, Revert the patch 'Limit sd->*_idx range on sysctl' since no needs

Any testing/comments are appreciated.

Tested on 12-cpu-x86 box with tip/master, ebizzy and hackbench works fine, show little improvements for each time's testing.

ebizzy default:

BASE PATCHED

hackbench 10000 loops:

BASE PATCHED

Running with 48*40 (== 1920) tasks. |Running with 48*40 (== 1920) tasks. Time: 30.934 |Time: 29.965 Running with 48*40 (== 1920) tasks. |Running with 48*40 (== 1920) tasks. Time: 31.603 |Time: 30.410 Running with 48*40 (== 1920) tasks. |Running with 48*40 (== 1920) tasks. Time: 31.724 |Time: 30.627 Running with 48*40 (== 1920) tasks. |Running with 48*40 (== 1920) tasks. Time: 31.648 |Time: 30.596 Running with 48*40 (== 1920) tasks. |Running with 48*40 (== 1920) tasks. Time: 31.799 |Time: 30.763 Running with 48*40 (== 1920) tasks. |Running with 48*40 (== 1920) tasks. Time: 31.847 |Time: 30.532 Running with 48*40 (== 1920) tasks. |Running with 48*40 (== 1920) tasks. Time: 31.828 |Time: 30.871 Running with 24*40 (== 960) tasks. |Running with 24*40 (== 960) tasks. Time: 15.768 |Time: 15.284 Running with 24*40 (== 960) tasks. |Running with 24*40 (== 960) tasks. Time: 15.720 |Time: 15.228 Running with 24*40 (== 960) tasks. |Running with 24*40 (== 960) tasks. Time: 15.819 |Time: 15.373 Running with 24*40 (== 960) tasks. |Running with 24*40 (== 960) tasks. Time: 15.888 |Time: 15.184 Running with 24*40 (== 960) tasks. |Running with 24*40 (== 960) tasks. Time: 15.888 |Time: 15.184 Running with 24*40 (== 960) tasks. |Running with 24*40 (== 960) tasks. Time: 15.660 |Time: 15.525 Running with 24*40 (== 960) tasks. |Running with 24*40 (== 960) tasks. Time: 15.934 |Time: 15.337 Running with 24*40 (== 960) tasks. |Running with 24*40 (== 960) tasks. Time: 15.669 |Time: 15.357 Running with 12*40 (== 480) tasks. |Running with 12*40 (== 480) tasks. Time: 7.699 |Time: 7.458 Running with 12*40 (== 480) tasks. |Running with 12*40 (== 480) tasks. Time: 7.693 |Time: 7.498 Running with 12*40 (== 480) tasks. |Running with 12*40 (== 480) tasks. Time: 7.705 |Time: 7.439 Running with 12*40 (== 480) tasks. |Running with 12*40 (== 480) tasks. Time: 7.664 |Time: 7.553 Running with 12*40 (== 480) tasks. |Running with 12*40 (== 480) tasks. Time: 7.603 |Time: 7.470 Running with 12*40 (== 480) tasks. |Running with 12*40 (== 480) tasks. Time: 7.651 |Time: 7.491 Running with 12*40 (== 480) tasks. |Running with 12*40 (== 480) tasks. Time: 7.647 |Time: 7.535 Running with 12*40 (== 480) tasks. |Running with 12*40 (== 480) tasks. Time: 7.647 |Time: 7.535 Running with 6*40 (== 240) tasks. |Running with 6*40 (== 240) tasks. Time: 6.054 |Time: 5.293 Running with 6*40 (== 240) tasks. |Running with 6*40 (== 240) tasks. Time: 5.417 |Time: 5.701 Running with 6*40 (== 240) tasks. |Running with 6*40 (== 240) tasks. Time: 5.287 |Time: 5.240 Running with 6*40 (== 240) tasks. |Running with 6*40 (== 240) tasks. Time: 5.594 |Time: 5.571 Running with 6*40 (== 240) tasks. |Running with 6*40 (== 240) tasks. Time: 5.347 |Time: 6.136 Running with 6*40 (== 240) tasks. |Running with 6*40 (== 240) tasks. Time: 5.430 |Time: 5.323 Running with 6*40 (== 240) tasks. |Running with 6*40 (== 240) tasks. Time: 5.691 |Time: 5.481 Running with 1*40 (== 40) tasks. |Running with 1*40 (== 40) tasks. Time: 1.192 |Time: 1.140 Running with 1*40 (== 40) tasks. |Running with 1*40 (== 40) tasks. Time: 1.190 |Time: 1.125 Running with 1*40 (== 40) tasks. |Running with 1*40 (== 40) tasks. Time: 1.189 |Time: 1.013 Running with 1*40 (== 40) tasks. |Running with 1*40 (== 40) tasks. Time: 1.189 |Time: 1.013 Running with 1*40 (== 40) tasks. |Running with 1*40 (== 40) tasks. Time: 1.163 |Time: 1.060 Running with 1*40 (== 40) tasks. |Running with 1*40 (== 40) tasks. Time: 1.186 |Time: 1.131 Running with 1*40 (== 40) tasks. |Running with 1*40 (== 40) tasks. Time: 1.175 |Time: 1.125 Running with 1*40 (== 40) tasks. |Running with 1*40 (== 40) tasks. Time: 1.157 |Time: 0.998

BTW, I got panic while rebooting, but should not caused by this patch set, will recheck and post the report later.

Regards, Michael Wang

INFO: rcu_sched detected stalls on CPUs/tasks: { 7} (detected by 1, t=21002 jiffies, g=6707, c=6706, q=227) Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 7 CPU: 7 PID: 1040 Comm: bioset Not tainted 3.14.0-rc2-test+ #402 Hardware name: IBM System x3650 M3 -[794582A]-/94Y7614, BIOS -[D6E154AUS-1.13]- 09/23/2011 0000000000000000 ffff88097f2e7bd8 ffffffff8156b38a 0000000000004f27 ffffffff817ecb90 ffff88097f2e7c58 ffffffff81561d8d ffff88097f2e7c08 ffffffff00000010 ffff88097f2e7c68 ffff88097f2e7c08 ffff88097f2e7c78 Call Trace: <NMI> [<ffffffff8156b38a>] dump_stack+0x46/0x58 [<ffffffff81561d8d>] panic+0xbe/0x1ce [<ffffffff810e6b03>] watchdog_overflow_callback+0xb3/0xc0 [<ffffffff8111e928>] __perf_event_overflow+0x98/0x220 [<ffffffff8111f224>] perf_event_overflow+0x14/0x20 [<ffffffff8101eef2>] intel_pmu_handle_irq+0x1c2/0x2c0 [<ffffffff81089af9>] ? load_balance+0xf9/0x590 [<ffffffff81089b0d>] ? load_balance+0x10d/0x590 [<ffffffff81562ac2>] ? printk+0x4d/0x4f [<ffffffff815763b4>] perf_event_nmi_handler+0x34/0x60 [<ffffffff81575b6e>] nmi_handle+0x7e/0x140 [<ffffffff81575d1a>] default_do_nmi+0x5a/0x250 [<ffffffff81575fa0>] do_nmi+0x90/0xd0 [<ffffffff815751e7>] end_repeat_nmi+0x1e/0x2e [<ffffffff81089340>] ? find_busiest_group+0x120/0x7e0 [<ffffffff81089340>] ? find_busiest_group+0x120/0x7e0 [<ffffffff81089340>] ? find_busiest_group+0x120/0x7e0 <<EOE>> [<ffffffff81089b7c>] load_balance+0x17c/0x590 [<ffffffff8108a49f>] idle_balance+0x10f/0x1c0 [<ffffffff8108a66e>] pick_next_task_fair+0x11e/0x2a0 [<ffffffff8107ba53>] ? dequeue_task+0x73/0x90 [<ffffffff815712b7>] __schedule+0x127/0x670 [<ffffffff815718d9>] schedule+0x29/0x70 [<ffffffff8104e3b5>] do_exit+0x2a5/0x470 [<ffffffff81066c90>] ? process_scheduled_works+0x40/0x40 [<ffffffff8106e78a>] kthread+0xba/0xe0 [<ffffffff8106e6d0>] ? flush_kthread_worker+0xb0/0xb0 [<ffffffff8157d0ec>] ret_from_fork+0x7c/0xb0 [<ffffffff8106e6d0>] ? flush_kthread_worker+0xb0/0xb0 Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

...

This patch rebase on latest tip/master. The git tree for this patchset at: git@github.com:alexshi/power-scheduling.git noload

Thanks Alex

-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

Alex Shi

6:03 a.m.

On 02/18/2014 12:52 PM, Michael wang wrote:

...

On 02/17/2014 09:55 AM, Alex Shi wrote:

...
The cpu_load decays on time according past cpu load of rq. The sched_avg also decays tasks' load on time. Now we has 2 kind decay for cpu_load. That is a kind of redundancy. And increase the system load by decay calculation. This patch try to remove the cpu_load decay.

There are 5 load_idx used for cpu_load in sched_domain. busy_idx and idle_idx are not zero usually, but newidle_idx, wake_idx and forkexec_idx are all zero on every arch. A shortcut to remove cpu_Load decay in the first patch. just one line patch for this change.

V2, 1, This version do some tuning on load bias of target load, to maximum match current code logical. 2, Got further to remove the cpu_load in rq. 3, Revert the patch 'Limit sd->*_idx range on sysctl' since no needs

Any testing/comments are appreciated.

Tested on 12-cpu-x86 box with tip/master, ebizzy and hackbench works fine, show little improvements for each time's testing.

Thanks a lot for your data!

...

BTW, I got panic while rebooting, but should not caused by this patch set, will recheck and post the report later.

I reviewed my patch again. Also didn't find suspicious line for the following rcu stall. Will wait for your report. :)

...

-- Thanks Alex

Michael wang

6:17 a.m.

On 02/18/2014 02:03 PM, Alex Shi wrote: [snip]

...

...
I reviewed my patch again. Also didn't find suspicious line for the following rcu stall. Will wait for your report. :)

Posted, it will be triggered in pure tip/master, your patch set was innocent ;-)

Regards, Michael Wang

...

...

Morten Rasmussen

12:05 p.m.

On Mon, Feb 17, 2014 at 01:55:06AM +0000, Alex Shi wrote:

...

The cpu_load decays on time according past cpu load of rq. The sched_avg also decays tasks' load on time. Now we has 2 kind decay for cpu_load. That is a kind of redundancy. And increase the system load by decay calculation. This patch try to remove the cpu_load decay.

There are 5 load_idx used for cpu_load in sched_domain. busy_idx and idle_idx are not zero usually, but newidle_idx, wake_idx and forkexec_idx are all zero on every arch. A shortcut to remove cpu_Load decay in the first patch. just one line patch for this change.

V2, 1, This version do some tuning on load bias of target load, to maximum match current code logical. 2, Got further to remove the cpu_load in rq. 3, Revert the patch 'Limit sd->*_idx range on sysctl' since no needs

Any testing/comments are appreciated.

Removing cpu_load completely certainly makes things simpler, my worry is just how much was lost by doing it. I agree that cpu_load needs a cleanup, but I can't convince myself that just removing it completely and not having any longer term view of cpu load anymore is without any negative side-effects.

{source, target}_load() are now instantaneous views of the cpu load, which means that they may change very frequently. That could potentially lead to more task migrations at all levels in the domain hierarchy as we no longer have the more conservative cpu_load[] indexes that were used at NUMA level.

Maybe some of the NUMA experts have an opinion about this?

In the discussions around V1 I think blocked load came up again as a potential replacement for the current cpu_load array. There are some issues that need to be solved around blocked_load first though.

Morten

Vincent Guittot

12:28 p.m.

On 18 February 2014 13:05, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

On Mon, Feb 17, 2014 at 01:55:06AM +0000, Alex Shi wrote:

...
The cpu_load decays on time according past cpu load of rq. The sched_avg also decays tasks' load on time. Now we has 2 kind decay for cpu_load. That is a kind of redundancy. And increase the system load by decay calculation. This patch try to remove the cpu_load decay.

There are 5 load_idx used for cpu_load in sched_domain. busy_idx and idle_idx are not zero usually, but newidle_idx, wake_idx and forkexec_idx are all zero on every arch. A shortcut to remove cpu_Load decay in the first patch. just one line patch for this change.

V2, 1, This version do some tuning on load bias of target load, to maximum match current code logical. 2, Got further to remove the cpu_load in rq. 3, Revert the patch 'Limit sd->*_idx range on sysctl' since no needs

Any testing/comments are appreciated.

Removing cpu_load completely certainly makes things simpler, my worry is just how much was lost by doing it. I agree that cpu_load needs a cleanup, but I can't convince myself that just removing it completely and not having any longer term view of cpu load anymore is without any negative side-effects.

Hi Alex,

Have you followed this thread about load_idx and the interest of using them to use different average period ? https://lkml.org/lkml/2014/1/6/499

Vincent

...

{source, target}_load() are now instantaneous views of the cpu load, which means that they may change very frequently. That could potentially lead to more task migrations at all levels in the domain hierarchy as we no longer have the more conservative cpu_load[] indexes that were used at NUMA level.

Maybe some of the NUMA experts have an opinion about this?

In the discussions around V1 I think blocked load came up again as a potential replacement for the current cpu_load array. There are some issues that need to be solved around blocked_load first though.

Morten

Alex Shi

19 Feb 19 Feb

10:23 a.m.

...

...
Removing cpu_load completely certainly makes things simpler, my worry is just how much was lost by doing it. I agree that cpu_load needs a cleanup, but I can't convince myself that just removing it completely and not having any longer term view of cpu load anymore is without any negative side-effects.

Hi Alex,

Have you followed this thread about load_idx and the interest of using them to use different average period ? https://lkml.org/lkml/2014/1/6/499

Yes, I hoped to use blocked load before. But I still can not figure out the correct usage of it. Or maybe we need more quick decay for blocked load? Or, maybe clean cpu_load is helpful to make room to reconsider this.

...

Vincent

...
{source, target}_load() are now instantaneous views of the cpu load, which means that they may change very frequently. That could potentially lead to more task migrations at all levels in the domain hierarchy as we no longer have the more conservative cpu_load[] indexes that were used at NUMA level.

Maybe some of the NUMA experts have an opinion about this?

In the discussions around V1 I think blocked load came up again as a potential replacement for the current cpu_load array. There are some issues that need to be solved around blocked_load first though.

Morten

-- Thanks Alex

Alex Shi

10:18 a.m.

On 02/18/2014 08:05 PM, Morten Rasmussen wrote:

...

On Mon, Feb 17, 2014 at 01:55:06AM +0000, Alex Shi wrote:

...
The cpu_load decays on time according past cpu load of rq. The sched_avg also decays tasks' load on time. Now we has 2 kind decay for cpu_load. That is a kind of redundancy. And increase the system load by decay calculation. This patch try to remove the cpu_load decay.

There are 5 load_idx used for cpu_load in sched_domain. busy_idx and idle_idx are not zero usually, but newidle_idx, wake_idx and forkexec_idx are all zero on every arch. A shortcut to remove cpu_Load decay in the first patch. just one line patch for this change.

V2, 1, This version do some tuning on load bias of target load, to maximum match current code logical. 2, Got further to remove the cpu_load in rq. 3, Revert the patch 'Limit sd->*_idx range on sysctl' since no needs

Any testing/comments are appreciated.

Removing cpu_load completely certainly makes things simpler, my worry is just how much was lost by doing it. I agree that cpu_load needs a cleanup, but I can't convince myself that just removing it completely and not having any longer term view of cpu load anymore is without any negative side-effects.

{source, target}_load() are now instantaneous views of the cpu load, which means that they may change very frequently. That could potentially lead to more task migrations at all levels in the domain hierarchy as we no longer have the more conservative cpu_load[] indexes that were used at NUMA level.

Maybe some of the NUMA experts have an opinion about this?

cc to Mel Gorman.

...

In the discussions around V1 I think blocked load came up again as a potential replacement for the current cpu_load array. There are some issues that need to be solved around blocked_load first though.

Morten

-- Thanks Alex

4269

days inactive

4277

days old

linaro-kernel@lists.linaro.org

28 comments

participants

tags (0)

participants (4)

Alex Shi
Michael wang
Morten Rasmussen
Vincent Guittot