This series is to backport following patches for v6.6: link: https://lore.kernel.org/lkml/20251107160645.929564468@infradead.org/
Peter Zijlstra (3): sched/fair: Revert max_newidle_lb_cost bump sched/fair: Small cleanup to sched_balance_newidle() sched/fair: Small cleanup to update_newidle_cost() sched/fair: Proportional newidle balance
include/linux/sched/topology.h | 3 ++ kernel/sched/core.c | 3 ++ kernel/sched/fair.c | 75 +++++++++++++++++++++++----------- kernel/sched/features.h | 5 +++ kernel/sched/sched.h | 7 ++++ kernel/sched/topology.c | 6 +++ 6 files changed, 75 insertions(+), 24 deletions(-)
From: Peter Zijlstra peterz@infradead.org
commit d206fbad9328ddb68ebabd7cf7413392acd38081 upstream.
Many people reported regressions on their database workloads due to:
155213a2aed4 ("sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails")
For instance Adam Li reported a 6% regression on SpecJBB.
Conversely this will regress schbench again; on my machine from 2.22 Mrps/s down to 2.04 Mrps/s.
Reported-by: Joseph Salisbury joseph.salisbury@oracle.com Reported-by: Adam Li adamli@os.amperecomputing.com Reported-by: Dietmar Eggemann dietmar.eggemann@arm.com Reported-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Chris Mason clm@meta.com Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com Link: https://lkml.kernel.org/r/006c9df2-b691-47f1-82e6-e233c3f91faf@oracle.com Link: https://patch.msgid.link/20251107161739.406147760@infradead.org [ Ajay: Modified to apply on v6.6 ] Signed-off-by: Ajay Kaher ajay.kaher@broadcom.com --- kernel/sched/fair.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7f23b866c..842d54a91 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11710,14 +11710,8 @@ static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) /* * Track max cost of a domain to make sure to not delay the * next wakeup on the CPU. - * - * sched_balance_newidle() bumps the cost whenever newidle - * balance fails, and we don't want things to grow out of - * control. Use the sysctl_sched_migration_cost as the upper - * limit, plus a litle extra to avoid off by ones. */ - sd->max_newidle_lb_cost = - min(cost, sysctl_sched_migration_cost + 200); + sd->max_newidle_lb_cost = cost; sd->last_decay_max_lb_cost = jiffies; } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) { /* @@ -12403,17 +12397,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; + update_newidle_cost(sd, domain_cost); + curr_cost += domain_cost; t0 = t1; - - /* - * Failing newidle means it is not effective; - * bump the cost so we end up doing less of it. - */ - if (!pulled_task) - domain_cost = (3 * sd->max_newidle_lb_cost) / 2; - - update_newidle_cost(sd, domain_cost); }
/*
From: Peter Zijlstra peterz@infradead.org
commit e78e70dbf603c1425f15f32b455ca148c932f6c1 upstream.
Pull out the !sd check to simplify code.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Chris Mason clm@meta.com Link: https://patch.msgid.link/20251107161739.525916173@infradead.org [ Ajay: Modified to apply on v6.6 ] Signed-off-by: Ajay Kaher ajay.kaher@broadcom.com --- kernel/sched/fair.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 842d54a91..e47bf8d6c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12362,14 +12362,15 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); + if (!sd) { + rcu_read_unlock(); + goto out; + }
if (!READ_ONCE(this_rq->rd->overload) || - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { - - if (sd) - update_next_balance(sd, &next_balance); + this_rq->avg_idle < sd->max_newidle_lb_cost) { + update_next_balance(sd, &next_balance); rcu_read_unlock(); - goto out; } rcu_read_unlock();
From: Peter Zijlstra (Intel) peterz@infradead.org
commit 33cf66d88306663d16e4759e9d24766b0aaa2e17 upstream.
Add a randomized algorithm that runs newidle balancing proportional to its success rate.
This improves schbench significantly:
6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S
Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%:
6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1%
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Chris Mason clm@meta.com Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org [ Ajay: Modified to apply on v6.6 ] Signed-off-by: Ajay Kaher ajay.kaher@broadcom.com --- include/linux/sched/topology.h | 3 +++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++---- kernel/sched/features.h | 5 ++++ kernel/sched/sched.h | 7 ++++++ kernel/sched/topology.c | 6 +++++ 6 files changed, 64 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 9671b7234..197039bab 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -106,6 +106,9 @@ struct sched_domain { unsigned int nr_balance_failed; /* initialise to 0 */
/* idle_balance() stats */ + unsigned int newidle_call; + unsigned int newidle_success; + unsigned int newidle_ratio; u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1b5e4389f..c4a9797e9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -116,6 +116,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU(struct rnd_state, sched_rnd_state);
#ifdef CONFIG_SCHED_DEBUG /* @@ -9872,6 +9873,8 @@ void __init sched_init_smp(void) { sched_init_numa(NUMA_NO_NODE);
+ prandom_init_once(&sched_rnd_state); + /* * There's no userspace yet to cause hotplug operations; hence all the * CPU masks are stable and all blatant races in the below code cannot diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f93a6a12e..a10df85e4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11704,11 +11704,27 @@ void update_max_interval(void) max_load_balance_interval = HZ*num_online_cpus()/10; }
-static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) +static inline void update_newidle_stats(struct sched_domain *sd, unsigned int success) +{ + sd->newidle_call++; + sd->newidle_success += success; + + if (sd->newidle_call >= 1024) { + sd->newidle_ratio = sd->newidle_success; + sd->newidle_call /= 2; + sd->newidle_success /= 2; + } +} + +static inline bool +update_newidle_cost(struct sched_domain *sd, u64 cost, unsigned int success) { unsigned long next_decay = sd->last_decay_max_lb_cost + HZ; unsigned long now = jiffies;
+ if (cost) + update_newidle_stats(sd, success); + if (cost > sd->max_newidle_lb_cost) { /* * Track max cost of a domain to make sure to not delay the @@ -11756,7 +11772,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) * Decay the newidle max times here because this is a regular * visit to all the domains. */ - need_decay = update_newidle_cost(sd, 0); + need_decay = update_newidle_cost(sd, 0, 0); max_cost += sd->max_newidle_lb_cost;
/* @@ -12394,6 +12410,22 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) break;
if (sd->flags & SD_BALANCE_NEWIDLE) { + unsigned int weight = 1; + + if (sched_feat(NI_RANDOM)) { + /* + * Throw a 1k sided dice; and only run + * newidle_balance according to the success + * rate. + */ + u32 d1k = sched_rng() % 1024; + weight = 1 + sd->newidle_ratio; + if (d1k > weight) { + update_newidle_stats(sd, 0); + continue; + } + weight = (1024 + weight/2) / weight; + }
pulled_task = load_balance(this_cpu, this_rq, sd, CPU_NEWLY_IDLE, @@ -12401,10 +12433,14 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; - update_newidle_cost(sd, domain_cost); - curr_cost += domain_cost; t0 = t1; + + /* + * Track max cost of a domain to make sure to not delay the + * next wakeup on the CPU. + */ + update_newidle_cost(sd, domain_cost, weight * !!pulled_task); }
/* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index f77016823..48b104ab5 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -88,4 +88,9 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
SCHED_FEAT(LATENCY_WARN, false)
+/* + * Do newidle balancing proportional to its success rate using randomization. + */ +SCHED_FEAT(NI_RANDOM, true) + SCHED_FEAT(HZ_BW, true) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 64634314a..e1913e253 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -5,6 +5,7 @@ #ifndef _KERNEL_SCHED_SCHED_H #define _KERNEL_SCHED_SCHED_H
+#include <linux/prandom.h> #include <linux/sched/affinity.h> #include <linux/sched/autogroup.h> #include <linux/sched/cpufreq.h> @@ -1205,6 +1206,12 @@ static inline bool is_migration_disabled(struct task_struct *p) }
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU(struct rnd_state, sched_rnd_state); + +static inline u32 sched_rng(void) +{ + return prandom_u32_state(this_cpu_ptr(&sched_rnd_state)); +}
#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b87426b74..9fa77e7a6 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1600,6 +1600,12 @@ sd_init(struct sched_domain_topology_level *tl,
.last_balance = jiffies, .balance_interval = sd_weight, + + /* 50% success rate */ + .newidle_call = 512, + .newidle_success = 256, + .newidle_ratio = 512, + .max_newidle_lb_cost = 0, .last_decay_max_lb_cost = jiffies, .child = child,
From: Peter Zijlstra peterz@infradead.org
commit 08d473dd8718e4a4d698b1113a14a40ad64a909b upstream.
Simplify code by adding a few variables.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Chris Mason clm@meta.com Link: https://patch.msgid.link/20251107161739.655208666@infradead.org [ Ajay: Modified to apply on v6.6 ] Signed-off-by: Ajay Kaher ajay.kaher@broadcom.com --- kernel/sched/fair.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e47bf8d6c..f93a6a12e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11706,22 +11706,25 @@ void update_max_interval(void)
static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) { + unsigned long next_decay = sd->last_decay_max_lb_cost + HZ; + unsigned long now = jiffies; + if (cost > sd->max_newidle_lb_cost) { /* * Track max cost of a domain to make sure to not delay the * next wakeup on the CPU. */ sd->max_newidle_lb_cost = cost; - sd->last_decay_max_lb_cost = jiffies; - } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) { + sd->last_decay_max_lb_cost = now; + + } else if (time_after(now, next_decay)) { /* * Decay the newidle max times by ~1% per second to ensure that * it is not outdated and the current max cost is actually * shorter. */ sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256; - sd->last_decay_max_lb_cost = jiffies; - + sd->last_decay_max_lb_cost = now; return true; }
linux-stable-mirror@lists.linaro.org