The nr_busy_cpus field of the sched_group_power is sometime different from 0 whereas the platform is fully idle. This serie fixes 3 use cases: - when the SCHED softirq is raised on an idle core for idle load balance but the platform doesn't go out of the cpuidle state - when some CPUs enter idle state while booting all CPUs - when a CPU is unplug and/or replug
Vincent Guittot (3): sched: fix nr_busy_cpus with coupled cpuidle sched: fix init NOHZ_IDLE flag sched: fix update NOHZ_IDLE flag
kernel/sched/core.c | 1 + kernel/sched/fair.c | 2 +- kernel/time/tick-sched.c | 2 ++ 3 files changed, 4 insertions(+), 1 deletion(-)
With the coupled cpuidle driver (but probably also with other drivers), a CPU loops in a temporary safe state while waiting for other CPUs of its cluster to be ready to enter the coupled C-state. If an IRQ or a softirq occurs, the CPU will stay in this internal loop if there is no need to resched. The SCHED softirq clears the NOHZ and increases nr_busy_cpus. If there is no need to resched, we will not call set_cpu_sd_state_idle because of this internal loop in a cpuidle state. We have to call set_cpu_sd_state_idle in tick_nohz_irq_exit which is used to handle such situation.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/time/tick-sched.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 955d35b..b8d74ea 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -570,6 +570,8 @@ void tick_nohz_irq_exit(void) if (!ts->inidle) return;
+ set_cpu_sd_state_idle(); + /* Cancel the timer because CPU already waken up from the C-states*/ menu_hrtimer_cancel(); __tick_nohz_idle_enter(ts);
2012/12/3 Vincent Guittot vincent.guittot@linaro.org:
With the coupled cpuidle driver (but probably also with other drivers), a CPU loops in a temporary safe state while waiting for other CPUs of its cluster to be ready to enter the coupled C-state. If an IRQ or a softirq occurs, the CPU will stay in this internal loop if there is no need to resched. The SCHED softirq clears the NOHZ and increases nr_busy_cpus. If there is no need to resched, we will not call set_cpu_sd_state_idle because of this internal loop in a cpuidle state. We have to call set_cpu_sd_state_idle in tick_nohz_irq_exit which is used to handle such situation.
I'm a bit confused with this.
set_cpu_sd_state_busy() is only called from nohz_kick_needed(). And it checks idle_cpu() before doing anything. So if no task is going to be scheduled, idle_cpu() prevents from calling set_cpu_sd_state_busy().
I'm probably missing something.
Thanks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/time/tick-sched.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 955d35b..b8d74ea 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -570,6 +570,8 @@ void tick_nohz_irq_exit(void) if (!ts->inidle) return;
set_cpu_sd_state_idle();
/* Cancel the timer because CPU already waken up from the C-states*/ menu_hrtimer_cancel(); __tick_nohz_idle_enter(ts);
-- 1.7.9.5
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 24 January 2013 17:44, Frederic Weisbecker fweisbec@gmail.com wrote:
2012/12/3 Vincent Guittot vincent.guittot@linaro.org:
With the coupled cpuidle driver (but probably also with other drivers), a CPU loops in a temporary safe state while waiting for other CPUs of its cluster to be ready to enter the coupled C-state. If an IRQ or a softirq occurs, the CPU will stay in this internal loop if there is no need to resched. The SCHED softirq clears the NOHZ and increases nr_busy_cpus. If there is no need to resched, we will not call set_cpu_sd_state_idle because of this internal loop in a cpuidle state. We have to call set_cpu_sd_state_idle in tick_nohz_irq_exit which is used to handle such situation.
I'm a bit confused with this.
set_cpu_sd_state_busy() is only called from nohz_kick_needed(). And it checks idle_cpu() before doing anything. So if no task is going to be scheduled, idle_cpu() prevents from calling set_cpu_sd_state_busy().
I'm probably missing something.
Hi Frederic
I can't find back the trace that i had saved with the issue but IIRC the sequence is: The CPU is kicked for ILB The wake_list of the CPU becomes not empty so cpu id not idle CPU wakes up, updates is timer framework and call nohz_kick_needed the execute the ILB sequence we don't go out of the cpuidle driver function because we don't need to resched so we don't clear the busy state
I'm going to look for the saved trace to check the sequence above
Vincent
Thanks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/time/tick-sched.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 955d35b..b8d74ea 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -570,6 +570,8 @@ void tick_nohz_irq_exit(void) if (!ts->inidle) return;
set_cpu_sd_state_idle();
/* Cancel the timer because CPU already waken up from the C-states*/ menu_hrtimer_cancel(); __tick_nohz_idle_enter(ts);
-- 1.7.9.5
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Le 24 janv. 2013 17:55, "Vincent Guittot" vincent.guittot@linaro.org a écrit :
On 24 January 2013 17:44, Frederic Weisbecker fweisbec@gmail.com wrote:
2012/12/3 Vincent Guittot vincent.guittot@linaro.org:
With the coupled cpuidle driver (but probably also with other drivers), a CPU loops in a temporary safe state while waiting for other CPUs of
its
cluster to be ready to enter the coupled C-state. If an IRQ or a
softirq
occurs, the CPU will stay in this internal loop if there is no need to resched. The SCHED softirq clears the NOHZ and increases nr_busy_cpus. If there is no need to resched, we will not call set_cpu_sd_state_idle because of this internal loop in a cpuidle state. We have to call set_cpu_sd_state_idle in tick_nohz_irq_exit which is
used
to handle such situation.
I'm a bit confused with this.
set_cpu_sd_state_busy() is only called from nohz_kick_needed(). And it checks idle_cpu() before doing anything. So if no task is going to be scheduled, idle_cpu() prevents from calling set_cpu_sd_state_busy().
I'm probably missing something.
Hi Frederic
I can't find back the trace that i had saved with the issue but IIRC the sequence is: The CPU is kicked for ILB The wake_list of the CPU becomes not empty so cpu id not idle CPU wakes up, updates is timer framework and call nohz_kick_needed the execute the ILB sequence we don't go out of the cpuidle driver function because we don't need to resched so we don't clear the busy state
This sequence is not the right one
I'm going to look for the saved trace to check the sequence above
I haven't been able to reproduce the bug that this patch was supposed to solved. The patch 2 and 3 seem enough to fix the nr_busy_cpus field. I will continue to try to reproduce it but it seems that it was a side effect of the 2 others fixes of the series
Vincent
Vincent
Thanks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/time/tick-sched.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 955d35b..b8d74ea 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -570,6 +570,8 @@ void tick_nohz_irq_exit(void) if (!ts->inidle) return;
set_cpu_sd_state_idle();
/* Cancel the timer because CPU already waken up from the
C-states*/
menu_hrtimer_cancel(); __tick_nohz_idle_enter(ts);
-- 1.7.9.5
-- To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2013/1/25 Vincent Guittot vincent.guittot@linaro.org:
This sequence is not the right one
I'm going to look for the saved trace to check the sequence above
I haven't been able to reproduce the bug that this patch was supposed to solved. The patch 2 and 3 seem enough to fix the nr_busy_cpus field. I will continue to try to reproduce it but it seems that it was a side effect of the 2 others fixes of the series
Ok. I just checked again as well and I can't find a scenario where this can happen. If you find it out or trigger the bug again, don't hesitate to resend this patch.
Thanks.
Le 25 janv. 2013 13:00, "Frederic Weisbecker" fweisbec@gmail.com a écrit :
2013/1/25 Vincent Guittot vincent.guittot@linaro.org:
This sequence is not the right one
I'm going to look for the saved trace to check the sequence above
I haven't been able to reproduce the bug that this patch was supposed to solved. The patch 2 and 3 seem enough to fix the nr_busy_cpus field. I
will
continue to try to reproduce it but it seems that it was a side effect
of
the 2 others fixes of the series
Ok. I just checked again as well and I can't find a scenario where this can happen. If you find it out or trigger the bug again, don't hesitate to resend this patch.
Ok. I'm going to update the patch serie without this patch
Thanks
Thanks.
2013/1/25 Vincent Guittot vincent.guittot@linaro.org:
Le 25 janv. 2013 13:00, "Frederic Weisbecker" fweisbec@gmail.com a écrit :
2013/1/25 Vincent Guittot vincent.guittot@linaro.org:
This sequence is not the right one
I'm going to look for the saved trace to check the sequence above
I haven't been able to reproduce the bug that this patch was supposed to solved. The patch 2 and 3 seem enough to fix the nr_busy_cpus field. I will continue to try to reproduce it but it seems that it was a side effect of the 2 others fixes of the series
Ok. I just checked again as well and I can't find a scenario where this can happen. If you find it out or trigger the bug again, don't hesitate to resend this patch.
Ok. I'm going to update the patch serie without this patch
Actually your second patch may cause this, as it clears the NOHZ_IDLE flag on CPUs that are idle on boot and which could stay that way for a while. And your second patch is spotting something serious. I'll reply on it after more thoughts.
On my smp platform which is made of 5 cores in 2 clusters,I have the nr_busy_cpus field of sched_group_power struct that is not null when the platform is fully idle. The root cause seems to be: During the boot sequence, some CPUs reach the idle loop and set their NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus field is initialized later with the assumption that all CPUs are in the busy state whereas some CPUs have already set their NOHZ_IDLE flag. We clear the NOHZ_IDLE flag when nr_busy_cpus is initialized in order to have a coherent configuration.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bae620a..77a01c8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5875,6 +5875,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
update_group_power(sd, cpu); atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight); + clear_bit(NOHZ_IDLE, nohz_flags(cpu)); }
int __weak arch_sd_sibling_asym_packing(void)
The function nohz_kick_needed modifies NOHZ_IDLE flag that is used to update the nr_busy_cpus of the sched_group. When the sched_domain are updated (because of the unplug of a CPUs as an example) a null_domain is attached to CPUs. We have to test likely(!on_null_domain(cpu) first in order to detect such intialization step and to not modify the NOHZ_IDLE flag
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 24a5588..1ef57a8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6311,7 +6311,7 @@ void trigger_load_balance(struct rq *rq, int cpu) likely(!on_null_domain(cpu))) raise_softirq(SCHED_SOFTIRQ); #ifdef CONFIG_NO_HZ - if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu))) + if (likely(!on_null_domain(cpu)) && nohz_kick_needed(rq, cpu)) nohz_balancer_kick(cpu); #endif }
On Mon, Dec 3, 2012 at 5:56 PM, Vincent Guittot vincent.guittot@linaro.org wrote:
The nr_busy_cpus field of the sched_group_power is sometime different from 0 whereas the platform is fully idle. This serie fixes 3 use cases:
- when the SCHED softirq is raised on an idle core for idle load balance but the platform doesn't go out of the cpuidle state
- when some CPUs enter idle state while booting all CPUs
- when a CPU is unplug and/or replug
You don't want me to pick these, correct? As they aren't based of an -rc release i believe.
-- viresh