The patch below does not apply to the 6.12-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y git checkout FETCH_HEAD git cherry-pick -x 779b1a1cb13ae17028aeddb2fbbdba97357a1e15 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2025082214-daughter-popsicle-4c08@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 779b1a1cb13ae17028aeddb2fbbdba97357a1e15 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com Date: Wed, 13 Aug 2025 12:25:58 +0200 Subject: [PATCH] cpuidle: governors: menu: Avoid selecting states with too much latency
Occasionally, the exit latency of the idle state selected by the menu governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the scheduler tick has been stopped already and predicted_ns is greater than the tick period length, the governor may return an idle state whose exit latency exceeds latency_req because that decision is made before checking the current idle state's exit latency.
For instance, say that there are 3 idle states, 0, 1, and 2. For idle states 0 and 1, the exit latency is equal to the target residency and the values are 0 and 5 us, respectively. State 2 is deeper and has the exit latency and target residency of 200 us and 2 ms (which is greater than the tick period length), respectively.
Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency limit is 20 us. After the first two iterations of the main loop in menu_select(), idx becomes 1 and in the third iteration of it the target residency of the current state (state 2) is greater than predicted_ns. State 2 is not a polling one and predicted_ns is not less than TICK_NSEC, so the check on whether or not the tick has been stopped is done. Say that the tick has been stopped already and there are no imminent timers (that is, delta_tick is greater than the target residency of state 2). In that case, idx becomes 2 and it is returned immediately, but the exit latency of state 2 exceeds the latency limit.
Address this issue by modifying the code to compare the exit latency of the current idle state (idle state i) with the latency limit before comparing its target residency with predicted_ns, which allows one more exit_latency_ns check that becomes redundant to be dropped.
However, after the above change, latency_req cannot take the predicted_ns value any more, which takes place after commit 38f83090f515 ("cpuidle: menu: Remove iowait influence"), because it may cause a polling state to be returned prematurely.
In the context of the previous example say that predicted_ns is 3000 and the PM QoS latency limit is still 20 us. Additionally, say that idle state 0 is a polling one. Moving the exit_latency_ns check before the target_residency_ns one causes the loop to terminate in the second iteration, before the target_residency_ns check, so idle state 0 will be returned even though previously state 1 would be returned if there were no imminent timers.
For this reason, remove the assignment of the predicted_ns value to latency_req from the code.
Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively") Cc: 4.17+ stable@vger.kernel.org # 4.17+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Reviewed-by: Christian Loehle christian.loehle@arm.com Link: https://patch.msgid.link/5043159.31r3eYUQgx@rafael.j.wysocki
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 81306612a5c6..b2e3d0b0a116 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -287,20 +287,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, return 0; }
- if (tick_nohz_tick_stopped()) { - /* - * If the tick is already stopped, the cost of possible short - * idle duration misprediction is much higher, because the CPU - * may be stuck in a shallow idle state for a long time as a - * result of it. In that case say we might mispredict and use - * the known time till the closest timer event for the idle - * state selection. - */ - if (predicted_ns < TICK_NSEC) - predicted_ns = data->next_timer_ns; - } else if (latency_req > predicted_ns) { - latency_req = predicted_ns; - } + /* + * If the tick is already stopped, the cost of possible short idle + * duration misprediction is much higher, because the CPU may be stuck + * in a shallow idle state for a long time as a result of it. In that + * case, say we might mispredict and use the known time till the closest + * timer event for the idle state selection. + */ + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC) + predicted_ns = data->next_timer_ns;
/* * Find the idle state with the lowest power while satisfying @@ -316,13 +311,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, if (idx == -1) idx = i; /* first enabled state */
+ if (s->exit_latency_ns > latency_req) + break; + if (s->target_residency_ns > predicted_ns) { /* * Use a physical idle state, not busy polling, unless * a timer is going to trigger soon enough. */ if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) && - s->exit_latency_ns <= latency_req && s->target_residency_ns <= data->next_timer_ns) { predicted_ns = s->target_residency_ns; idx = i; @@ -354,8 +351,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
return idx; } - if (s->exit_latency_ns > latency_req) - break;
idx = i; }
From: Christian Loehle christian.loehle@arm.com
[ Upstream commit 38f83090f515b4b5d59382dfada1e7457f19aa47 ]
Remove CPU iowaiters influence on idle state selection.
Remove the menu notion of performance multiplier which increased with the number of tasks that went to iowait sleep on this CPU and haven't woken up yet.
Relying on iowait for cpuidle is problematic for a few reasons:
1. There is no guarantee that an iowaiting task will wake up on the same CPU.
2. The task being in iowait says nothing about the idle duration, we could be selecting shallower states for a long time.
3. The task being in iowait doesn't always imply a performance hit with increased latency.
4. If there is such a performance hit, the number of iowaiting tasks doesn't directly correlate.
5. The definition of iowait altogether is vague at best, it is sprinkled across kernel code.
Signed-off-by: Christian Loehle christian.loehle@arm.com Link: https://patch.msgid.link/20240905092645.2885200-2-christian.loehle@arm.com [ rjw: Minor edits in the changelog ] Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Stable-dep-of: 779b1a1cb13a ("cpuidle: governors: menu: Avoid selecting states with too much latency") Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/cpuidle/governors/menu.c | 76 ++++---------------------------- 1 file changed, 9 insertions(+), 67 deletions(-)
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 01322a905414..f07f76ccfc5e 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -19,7 +19,7 @@
#include "gov.h"
-#define BUCKETS 12 +#define BUCKETS 6 #define INTERVAL_SHIFT 3 #define INTERVALS (1UL << INTERVAL_SHIFT) #define RESOLUTION 1024 @@ -29,12 +29,11 @@ /* * Concepts and ideas behind the menu governor * - * For the menu governor, there are 3 decision factors for picking a C + * For the menu governor, there are 2 decision factors for picking a C * state: * 1) Energy break even point - * 2) Performance impact - * 3) Latency tolerance (from pmqos infrastructure) - * These three factors are treated independently. + * 2) Latency tolerance (from pmqos infrastructure) + * These two factors are treated independently. * * Energy break even point * ----------------------- @@ -75,30 +74,6 @@ * intervals and if the stand deviation of these 8 intervals is below a * threshold value, we use the average of these intervals as prediction. * - * Limiting Performance Impact - * --------------------------- - * C states, especially those with large exit latencies, can have a real - * noticeable impact on workloads, which is not acceptable for most sysadmins, - * and in addition, less performance has a power price of its own. - * - * As a general rule of thumb, menu assumes that the following heuristic - * holds: - * The busier the system, the less impact of C states is acceptable - * - * This rule-of-thumb is implemented using a performance-multiplier: - * If the exit latency times the performance multiplier is longer than - * the predicted duration, the C state is not considered a candidate - * for selection due to a too high performance impact. So the higher - * this multiplier is, the longer we need to be idle to pick a deep C - * state, and thus the less likely a busy CPU will hit such a deep - * C state. - * - * Currently there is only one value determining the factor: - * 10 points are added for each process that is waiting for IO on this CPU. - * (This value was experimentally determined.) - * Utilization is no longer a factor as it was shown that it never contributed - * significantly to the performance multiplier in the first place. - * */
struct menu_device { @@ -112,19 +87,10 @@ struct menu_device { int interval_ptr; };
-static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters) +static inline int which_bucket(u64 duration_ns) { int bucket = 0;
- /* - * We keep two groups of stats; one with no - * IO pending, one without. - * This allows us to calculate - * E(duration)|iowait - */ - if (nr_iowaiters) - bucket = BUCKETS/2; - if (duration_ns < 10ULL * NSEC_PER_USEC) return bucket; if (duration_ns < 100ULL * NSEC_PER_USEC) @@ -138,19 +104,6 @@ static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters) return bucket + 5; }
-/* - * Return a multiplier for the exit latency that is intended - * to take performance requirements into account. - * The more performance critical we estimate the system - * to be, the higher this multiplier, and thus the higher - * the barrier to go to an expensive C state. - */ -static inline int performance_multiplier(unsigned int nr_iowaiters) -{ - /* for IO wait tasks (per cpu!) we add 10x each */ - return 1 + 10 * nr_iowaiters; -} - static DEFINE_PER_CPU(struct menu_device, menu_devices);
static void menu_update_intervals(struct menu_device *data, unsigned int interval_us) @@ -277,8 +230,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, struct menu_device *data = this_cpu_ptr(&menu_devices); s64 latency_req = cpuidle_governor_latency_req(dev->cpu); u64 predicted_ns; - u64 interactivity_req; - unsigned int nr_iowaiters; ktime_t delta, delta_tick; int i, idx;
@@ -295,8 +246,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, menu_update_intervals(data, UINT_MAX); }
- nr_iowaiters = nr_iowait_cpu(dev->cpu); - /* Find the shortest expected idle interval. */ predicted_ns = get_typical_interval(data) * NSEC_PER_USEC; if (predicted_ns > RESIDENCY_THRESHOLD_NS) { @@ -310,7 +259,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, }
data->next_timer_ns = delta; - data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters); + data->bucket = which_bucket(data->next_timer_ns);
/* Round up the result for half microseconds. */ timer_us = div_u64((RESOLUTION * DECAY * NSEC_PER_USEC) / 2 + @@ -328,7 +277,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, */ data->next_timer_ns = KTIME_MAX; delta_tick = TICK_NSEC / 2; - data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); + data->bucket = which_bucket(KTIME_MAX); }
if (unlikely(drv->state_count <= 1 || latency_req == 0) || @@ -355,15 +304,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, */ if (predicted_ns < TICK_NSEC) predicted_ns = data->next_timer_ns; - } else { - /* - * Use the performance multiplier and the user-configurable - * latency_req to determine the maximum exit latency. - */ - interactivity_req = div64_u64(predicted_ns, - performance_multiplier(nr_iowaiters)); - if (latency_req > interactivity_req) - latency_req = interactivity_req; + } else if (latency_req > predicted_ns) { + latency_req = predicted_ns; }
/*
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
[ Upstream commit 779b1a1cb13ae17028aeddb2fbbdba97357a1e15 ]
Occasionally, the exit latency of the idle state selected by the menu governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the scheduler tick has been stopped already and predicted_ns is greater than the tick period length, the governor may return an idle state whose exit latency exceeds latency_req because that decision is made before checking the current idle state's exit latency.
For instance, say that there are 3 idle states, 0, 1, and 2. For idle states 0 and 1, the exit latency is equal to the target residency and the values are 0 and 5 us, respectively. State 2 is deeper and has the exit latency and target residency of 200 us and 2 ms (which is greater than the tick period length), respectively.
Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency limit is 20 us. After the first two iterations of the main loop in menu_select(), idx becomes 1 and in the third iteration of it the target residency of the current state (state 2) is greater than predicted_ns. State 2 is not a polling one and predicted_ns is not less than TICK_NSEC, so the check on whether or not the tick has been stopped is done. Say that the tick has been stopped already and there are no imminent timers (that is, delta_tick is greater than the target residency of state 2). In that case, idx becomes 2 and it is returned immediately, but the exit latency of state 2 exceeds the latency limit.
Address this issue by modifying the code to compare the exit latency of the current idle state (idle state i) with the latency limit before comparing its target residency with predicted_ns, which allows one more exit_latency_ns check that becomes redundant to be dropped.
However, after the above change, latency_req cannot take the predicted_ns value any more, which takes place after commit 38f83090f515 ("cpuidle: menu: Remove iowait influence"), because it may cause a polling state to be returned prematurely.
In the context of the previous example say that predicted_ns is 3000 and the PM QoS latency limit is still 20 us. Additionally, say that idle state 0 is a polling one. Moving the exit_latency_ns check before the target_residency_ns one causes the loop to terminate in the second iteration, before the target_residency_ns check, so idle state 0 will be returned even though previously state 1 would be returned if there were no imminent timers.
For this reason, remove the assignment of the predicted_ns value to latency_req from the code.
Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively") Cc: 4.17+ stable@vger.kernel.org # 4.17+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Reviewed-by: Christian Loehle christian.loehle@arm.com Link: https://patch.msgid.link/5043159.31r3eYUQgx@rafael.j.wysocki Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/cpuidle/governors/menu.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-)
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index f07f76ccfc5e..3eb543b1644d 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -293,20 +293,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, return 0; }
- if (tick_nohz_tick_stopped()) { - /* - * If the tick is already stopped, the cost of possible short - * idle duration misprediction is much higher, because the CPU - * may be stuck in a shallow idle state for a long time as a - * result of it. In that case say we might mispredict and use - * the known time till the closest timer event for the idle - * state selection. - */ - if (predicted_ns < TICK_NSEC) - predicted_ns = data->next_timer_ns; - } else if (latency_req > predicted_ns) { - latency_req = predicted_ns; - } + /* + * If the tick is already stopped, the cost of possible short idle + * duration misprediction is much higher, because the CPU may be stuck + * in a shallow idle state for a long time as a result of it. In that + * case, say we might mispredict and use the known time till the closest + * timer event for the idle state selection. + */ + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC) + predicted_ns = data->next_timer_ns;
/* * Find the idle state with the lowest power while satisfying @@ -322,13 +317,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, if (idx == -1) idx = i; /* first enabled state */
+ if (s->exit_latency_ns > latency_req) + break; + if (s->target_residency_ns > predicted_ns) { /* * Use a physical idle state, not busy polling, unless * a timer is going to trigger soon enough. */ if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) && - s->exit_latency_ns <= latency_req && s->target_residency_ns <= data->next_timer_ns) { predicted_ns = s->target_residency_ns; idx = i; @@ -360,8 +357,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
return idx; } - if (s->exit_latency_ns > latency_req) - break;
idx = i; }
linux-stable-mirror@lists.linaro.org