I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform.
CPU 0 and CPU 1 are running tasks and CPU 2 is idle
CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 1 sets NOHZ_BALANCE_KICK for CPU 2 CPU 1 sends a reschedule IPI to CPU 2 While CPU 2 wakes up, CPU 0 or CPU 1 migrates a waking task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance Task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
Whenever CPU 2 will be selected for the ILB, reschedule IPI will be not sent to CPU2, which is idle, because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed.
We must wait for the sched softirq to be raised on CPU 2 thanks to another part of the kernel to clear NOHZ_BALANCE_KICKand come back to a normal situation.
The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..51fc715 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1420,7 +1420,8 @@ void scheduler_ipi(void) if (unlikely(got_nohz_idle_kick() && !need_resched())) { this_rq()->idle_balance = 1; raise_softirq_irqoff(SCHED_SOFTIRQ); - } + } else + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(smp_processor_id())); irq_exit(); }
On Thu, May 30, 2013 at 05:23:05PM +0200, Vincent Guittot wrote:
I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform.
CPU 0 and CPU 1 are running tasks and CPU 2 is idle
CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 1 sets NOHZ_BALANCE_KICK for CPU 2 CPU 1 sends a reschedule IPI to CPU 2 While CPU 2 wakes up, CPU 0 or CPU 1 migrates a waking task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance Task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
Whenever CPU 2 will be selected for the ILB, reschedule IPI will be not sent to CPU2, which is idle, because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed.
We must wait for the sched softirq to be raised on CPU 2 thanks to another part of the kernel to clear NOHZ_BALANCE_KICKand come back to a normal situation.
The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..51fc715 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1420,7 +1420,8 @@ void scheduler_ipi(void) if (unlikely(got_nohz_idle_kick() && !need_resched())) { this_rq()->idle_balance = 1; raise_softirq_irqoff(SCHED_SOFTIRQ);
- }
- } else
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(smp_processor_id()));
But then do we reach this if the IPI happens while running the non-idle task in CPU 2? The first got_nohz_idle_kick() test would drop us out early from scheduler_ipi() due to the idle_cpu() test. So the flag doesn't get cleared in this case.
On 4 June 2013 00:48, Frederic Weisbecker fweisbec@gmail.com wrote:
On Thu, May 30, 2013 at 05:23:05PM +0200, Vincent Guittot wrote:
I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform.
CPU 0 and CPU 1 are running tasks and CPU 2 is idle
CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 1 sets NOHZ_BALANCE_KICK for CPU 2 CPU 1 sends a reschedule IPI to CPU 2 While CPU 2 wakes up, CPU 0 or CPU 1 migrates a waking task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance Task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
Whenever CPU 2 will be selected for the ILB, reschedule IPI will be not sent to CPU2, which is idle, because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed.
We must wait for the sched softirq to be raised on CPU 2 thanks to another part of the kernel to clear NOHZ_BALANCE_KICKand come back to a normal situation.
The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..51fc715 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1420,7 +1420,8 @@ void scheduler_ipi(void) if (unlikely(got_nohz_idle_kick() && !need_resched())) { this_rq()->idle_balance = 1; raise_softirq_irqoff(SCHED_SOFTIRQ);
}
} else
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(smp_processor_id()));
But then do we reach this if the IPI happens while running the non-idle task in CPU 2? The first got_nohz_idle_kick() test would drop us out early from scheduler_ipi() due to the idle_cpu() test. So the flag doesn't get cleared in this case.
The 1st point is that only idle cpu can be selected for idle load balance. But this doesn't prevent the cpu to wake up while it is kicked for idle load balance. I had added the clear_bit for the 1st got_nohz_idle_kick in the draft version of this patch but the test of the emptiness of the wake_list, the call to smp_send_reschedule in the various way to wake up the idle cpu and the results of the tests have convinced me (may be wrongly) that it was not necessary.
Vincent
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
--- kernel/sched/core.c | 35 +++++++++++++++++++++++++---------- kernel/sched/fair.c | 6 ++++-- 2 files changed, 29 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 142c682..4846223 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -630,15 +630,23 @@ void wake_up_nohz_cpu(int cpu) wake_up_idle_cpu(cpu); }
-static inline bool got_nohz_idle_kick(void) +static inline bool got_nohz_idle_kick(int cpu) { - int cpu = smp_processor_id(); - return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + + if (!nohz_kick) + return false; + + if (idle_cpu(cpu)) + return true; + + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + return false; }
#else /* CONFIG_NO_HZ_COMMON */
-static inline bool got_nohz_idle_kick(void) +static inline bool got_nohz_idle_kick(int cpu) { return false; } @@ -1395,8 +1403,11 @@ static void sched_ttwu_pending(void)
void scheduler_ipi(void) { - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick() - && !tick_nohz_full_cpu(smp_processor_id())) + int this_cpu = smp_processor_id(); + bool nohz_kick = got_nohz_idle_kick(this_cpu); + + if (llist_empty(&this_rq()->wake_list) && !nohz_kick && + !tick_nohz_full_cpu(this_cpu)) return;
/* @@ -1413,15 +1424,19 @@ void scheduler_ipi(void) * somewhat pessimize the simple resched case. */ irq_enter(); - tick_nohz_full_check(); + tick_nohz_full_check(); /* restart tick if required */ sched_ttwu_pending();
/* * Check if someone kicked us for doing the nohz idle load balance. */ - if (unlikely(got_nohz_idle_kick() && !need_resched())) { - this_rq()->idle_balance = 1; - raise_softirq_irqoff(SCHED_SOFTIRQ); + if (unlikely(nohz_kick)) { + if (!need_resched()) { + this_rq()->idle_balance = 1; + raise_softirq_irqoff(SCHED_SOFTIRQ); + } else { + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); + } } irq_exit(); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 143dcdb..4e6cff4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5600,8 +5600,10 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) struct rq *rq; int balance_cpu;
- if (idle != CPU_IDLE || - !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) + if (!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) + return; + + if (idle != CPU_IDLE) goto end;
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..330136b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -630,15 +630,14 @@ void wake_up_nohz_cpu(int cpu) wake_up_idle_cpu(cpu); }
-static inline bool got_nohz_idle_kick(void) +static inline bool got_nohz_idle_kick(int cpu) { - int cpu = smp_processor_id(); - return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + return test_and_clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); }
#else /* CONFIG_NO_HZ_COMMON */
-static inline bool got_nohz_idle_kick(void) +static inline bool got_nohz_idle_kick(int cpu) { return false; } @@ -1393,8 +1392,12 @@ static void sched_ttwu_pending(void)
void scheduler_ipi(void) { - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick() - && !tick_nohz_full_cpu(smp_processor_id())) + int cpu = smp_processor_id(); + bool idle_kick = got_nohz_idle_kick(cpu); + + if (!(idle_kick && idle_cpu(cpu)) + && llist_empty(&this_rq()->wake_list) + && !tick_nohz_full_cpu(cpu) return;
/* @@ -1417,8 +1420,8 @@ void scheduler_ipi(void) /* * Check if someone kicked us for doing the nohz idle load balance. */ - if (unlikely(got_nohz_idle_kick() && !need_resched())) { - this_rq()->idle_balance = 1; + if (unlikely(idle_kick && idle_cpu(cpu) && !need_resched())) { + this_rq()->idle_balance = IDLE_NOHZ_BALANCE; raise_softirq_irqoff(SCHED_SOFTIRQ); } irq_exit(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c61a614..816e7b0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5577,15 +5577,14 @@ out: * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the * rebalancing for all the cpus for whom scheduler ticks are stopped. */ -static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) +static void nohz_idle_balance(int this_cpu) { struct rq *this_rq = cpu_rq(this_cpu); struct rq *rq; int balance_cpu;
- if (idle != CPU_IDLE || - !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) - goto end; + if (this_rq->idle_balance != IDLE_NOHZ_BALANCE) + return;
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) @@ -5612,8 +5611,12 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) this_rq->next_balance = rq->next_balance; } nohz.next_balance = this_rq->next_balance; -end: - clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); + + /* There could be concurrent updates from irqs but we don't care */ + if (idle_cpu(this_cpu)) + this_rq->idle_balance = IDLE_BALANCE; + else + this_rq->idle_balance = 0; }
/* @@ -5679,7 +5682,7 @@ need_kick: return 1; } #else -static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } +static void nohz_idle_balance(int this_cpu) { } #endif
/* @@ -5700,7 +5703,7 @@ static void run_rebalance_domains(struct softirq_action *h) * balancing on behalf of the other idle cpus whose ticks are * stopped. */ - nohz_idle_balance(this_cpu, idle); + nohz_idle_balance(this_cpu); }
static inline int on_null_domain(int cpu) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ce39224..e9de976 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -387,6 +387,11 @@ extern struct root_domain def_root_domain;
#endif /* CONFIG_SMP */
+enum idle_balance_type { + IDLE_BALANCE = 1, + IDLE_NOHZ_BALANCE = 2, +}; + /* * This is the main, per-CPU runqueue data structure. * @@ -458,7 +463,7 @@ struct rq {
unsigned long cpu_power;
- unsigned char idle_balance; + enum idle_balance_type idle_balance; /* For active balancing */ int post_schedule; int active_balance;
On 4 June 2013 12:26, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
I'm not sure that we can have less than 2 places to clear it: cancel place or acknowledge place otherwise we can face a situation where idle load balance will be triggered 2 consecutive times because NOHZ_BALANCE_KICK will be cleared before the idle load balance has been done and had a chance to migrate tasks.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..330136b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -630,15 +630,14 @@ void wake_up_nohz_cpu(int cpu) wake_up_idle_cpu(cpu); }
-static inline bool got_nohz_idle_kick(void) +static inline bool got_nohz_idle_kick(int cpu) {
int cpu = smp_processor_id();
return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
return test_and_clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
}
#else /* CONFIG_NO_HZ_COMMON */
-static inline bool got_nohz_idle_kick(void) +static inline bool got_nohz_idle_kick(int cpu) { return false; } @@ -1393,8 +1392,12 @@ static void sched_ttwu_pending(void)
void scheduler_ipi(void) {
if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
&& !tick_nohz_full_cpu(smp_processor_id()))
int cpu = smp_processor_id();
bool idle_kick = got_nohz_idle_kick(cpu);
if (!(idle_kick && idle_cpu(cpu))
&& llist_empty(&this_rq()->wake_list)
&& !tick_nohz_full_cpu(cpu) return; /*
@@ -1417,8 +1420,8 @@ void scheduler_ipi(void) /* * Check if someone kicked us for doing the nohz idle load balance. */
if (unlikely(got_nohz_idle_kick() && !need_resched())) {
this_rq()->idle_balance = 1;
if (unlikely(idle_kick && idle_cpu(cpu) && !need_resched())) {
this_rq()->idle_balance = IDLE_NOHZ_BALANCE; raise_softirq_irqoff(SCHED_SOFTIRQ); } irq_exit();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c61a614..816e7b0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5577,15 +5577,14 @@ out:
- In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
- rebalancing for all the cpus for whom scheduler ticks are stopped.
*/ -static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) +static void nohz_idle_balance(int this_cpu) { struct rq *this_rq = cpu_rq(this_cpu); struct rq *rq; int balance_cpu;
if (idle != CPU_IDLE ||
!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
goto end;
if (this_rq->idle_balance != IDLE_NOHZ_BALANCE)
return; for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
@@ -5612,8 +5611,12 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) this_rq->next_balance = rq->next_balance; } nohz.next_balance = this_rq->next_balance; -end:
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
/* There could be concurrent updates from irqs but we don't care */
if (idle_cpu(this_cpu))
this_rq->idle_balance = IDLE_BALANCE;
else
this_rq->idle_balance = 0;
}
/* @@ -5679,7 +5682,7 @@ need_kick: return 1; } #else -static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } +static void nohz_idle_balance(int this_cpu) { } #endif
/* @@ -5700,7 +5703,7 @@ static void run_rebalance_domains(struct softirq_action *h) * balancing on behalf of the other idle cpus whose ticks are * stopped. */
nohz_idle_balance(this_cpu, idle);
nohz_idle_balance(this_cpu);
}
static inline int on_null_domain(int cpu) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ce39224..e9de976 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -387,6 +387,11 @@ extern struct root_domain def_root_domain;
#endif /* CONFIG_SMP */
+enum idle_balance_type {
IDLE_BALANCE = 1,
IDLE_NOHZ_BALANCE = 2,
+};
/*
- This is the main, per-CPU runqueue data structure.
@@ -458,7 +463,7 @@ struct rq {
unsigned long cpu_power;
unsigned char idle_balance;
enum idle_balance_type idle_balance; /* For active balancing */ int post_schedule; int active_balance;
On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 12:26, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
I'm not sure that we can have less than 2 places to clear it: cancel place or acknowledge place otherwise we can face a situation where idle load balance will be triggered 2 consecutive times because NOHZ_BALANCE_KICK will be cleared before the idle load balance has been done and had a chance to migrate tasks.
I guess it depends what is the minimum value of rq->next_balance, it seems to be large enough to avoid this kind of incident. Although I don't know well the whole logic with rq->next_balance and ilb trigger so I must defer to you.
On 4 June 2013 13:19, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 12:26, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
I'm not sure that we can have less than 2 places to clear it: cancel place or acknowledge place otherwise we can face a situation where idle load balance will be triggered 2 consecutive times because NOHZ_BALANCE_KICK will be cleared before the idle load balance has been done and had a chance to migrate tasks.
I guess it depends what is the minimum value of rq->next_balance, it seems to be large enough to avoid this kind of incident. Although I don't know well the whole logic with rq->next_balance and ilb trigger so I must defer to you.
In the trace that was showing the issue, i can see that both CPU0 and CPU1 were trying to trig ILB almost simultaneously and the test_and_set NOHZ_BALANCE_KICK filters one request so i would say that clearing the bit before the end of the idle load balance sequence can generate such sequence
In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK in 2 places : acknowledge and cancel. I have reused part of the proposal from peter which clears the bit if the condition doesn't match but i have reordered the tests to done that only if all other condition are matching
static inline bool got_nohz_idle_kick(void) { - int cpu = smp_processor_id(); - return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + + if (!nohz_kick) + return false; + + if (idle_cpu(cpu) && !need_resched()) + return true; + + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + return false; }
#else /* CONFIG_NO_HZ_COMMON */ @@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)
void scheduler_ipi(void) { - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick() - && !tick_nohz_full_cpu(smp_processor_id())) + if (llist_empty(&this_rq()->wake_list) + && !tick_nohz_full_cpu(smp_processor_id()) + && !got_nohz_idle_kick()) return;
/* @@ -1417,7 +1426,7 @@ void scheduler_ipi(void) /* * Check if someone kicked us for doing the nohz idle load balance. */ - if (unlikely(got_nohz_idle_kick() && !need_resched())) { + if (unlikely(got_nohz_idle_kick())) { this_rq()->idle_balance = 1; raise_softirq_irqoff(SCHED_SOFTIRQ); }
On Tue, Jun 04, 2013 at 01:48:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 13:19, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 12:26, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
I'm not sure that we can have less than 2 places to clear it: cancel place or acknowledge place otherwise we can face a situation where idle load balance will be triggered 2 consecutive times because NOHZ_BALANCE_KICK will be cleared before the idle load balance has been done and had a chance to migrate tasks.
I guess it depends what is the minimum value of rq->next_balance, it seems to be large enough to avoid this kind of incident. Although I don't know well the whole logic with rq->next_balance and ilb trigger so I must defer to you.
In the trace that was showing the issue, i can see that both CPU0 and CPU1 were trying to trig ILB almost simultaneously and the test_and_set NOHZ_BALANCE_KICK filters one request so i would say that clearing the bit before the end of the idle load balance sequence can generate such sequence
I see.
In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK in 2 places : acknowledge and cancel. I have reused part of the proposal from peter which clears the bit if the condition doesn't match but i have reordered the tests to done that only if all other condition are matching
static inline bool got_nohz_idle_kick(void) {
- int cpu = smp_processor_id();
- return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
- bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
if (!nohz_kick)
return false;
if (idle_cpu(cpu) && !need_resched())
return true;
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
return false;
}
#else /* CONFIG_NO_HZ_COMMON */ @@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)
void scheduler_ipi(void) {
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
- && !tick_nohz_full_cpu(smp_processor_id()))
- if (llist_empty(&this_rq()->wake_list)
- && !tick_nohz_full_cpu(smp_processor_id())
- && !got_nohz_idle_kick()) return;
But we still need got_nohz_idle_kick() to be the first check, don't we? Otherwise if we run an "idle -> quick task slice -> idle" sequence we may keep the flag but lose the notifying IPI in between.
On 4 June 2013 16:44, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 01:48:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 13:19, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 12:26, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
I'm not sure that we can have less than 2 places to clear it: cancel place or acknowledge place otherwise we can face a situation where idle load balance will be triggered 2 consecutive times because NOHZ_BALANCE_KICK will be cleared before the idle load balance has been done and had a chance to migrate tasks.
I guess it depends what is the minimum value of rq->next_balance, it seems to be large enough to avoid this kind of incident. Although I don't know well the whole logic with rq->next_balance and ilb trigger so I must defer to you.
In the trace that was showing the issue, i can see that both CPU0 and CPU1 were trying to trig ILB almost simultaneously and the test_and_set NOHZ_BALANCE_KICK filters one request so i would say that clearing the bit before the end of the idle load balance sequence can generate such sequence
I see.
In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK in 2 places : acknowledge and cancel. I have reused part of the proposal from peter which clears the bit if the condition doesn't match but i have reordered the tests to done that only if all other condition are matching
static inline bool got_nohz_idle_kick(void) {
- int cpu = smp_processor_id();
- return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
- bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
if (!nohz_kick)
return false;
if (idle_cpu(cpu) && !need_resched())
return true;
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
return false;
}
#else /* CONFIG_NO_HZ_COMMON */ @@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)
void scheduler_ipi(void) {
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
- && !tick_nohz_full_cpu(smp_processor_id()))
- if (llist_empty(&this_rq()->wake_list)
- && !tick_nohz_full_cpu(smp_processor_id())
- && !got_nohz_idle_kick()) return;
But we still need got_nohz_idle_kick() to be the first check, don't we? Otherwise if we run an "idle -> quick task slice -> idle" sequence we may keep the flag but lose the notifying IPI in between.
I'm not sure to catch the sequence you are describing above: "idle -> quick task slice -> idle". In addition, got_nohz_idle_kick must be the last tested condition (in my proposal) in order to clear NOHZ_BALANCE_KICK only if we are sure that we are going to return without possibility to trig the Idle load balance
Vincent
On Tue, Jun 04, 2013 at 05:29:39PM +0200, Vincent Guittot wrote:
On 4 June 2013 16:44, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 01:48:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 13:19, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 12:26, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote: > > The best I can seem to come up with is something like the below; but I think > its ghastly. Surely we can do something saner with that bit. > > Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
I'm not sure that we can have less than 2 places to clear it: cancel place or acknowledge place otherwise we can face a situation where idle load balance will be triggered 2 consecutive times because NOHZ_BALANCE_KICK will be cleared before the idle load balance has been done and had a chance to migrate tasks.
I guess it depends what is the minimum value of rq->next_balance, it seems to be large enough to avoid this kind of incident. Although I don't know well the whole logic with rq->next_balance and ilb trigger so I must defer to you.
In the trace that was showing the issue, i can see that both CPU0 and CPU1 were trying to trig ILB almost simultaneously and the test_and_set NOHZ_BALANCE_KICK filters one request so i would say that clearing the bit before the end of the idle load balance sequence can generate such sequence
I see.
In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK in 2 places : acknowledge and cancel. I have reused part of the proposal from peter which clears the bit if the condition doesn't match but i have reordered the tests to done that only if all other condition are matching
static inline bool got_nohz_idle_kick(void) {
- int cpu = smp_processor_id();
- return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
- bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
if (!nohz_kick)
return false;
if (idle_cpu(cpu) && !need_resched())
return true;
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
return false;
}
#else /* CONFIG_NO_HZ_COMMON */ @@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)
void scheduler_ipi(void) {
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
- && !tick_nohz_full_cpu(smp_processor_id()))
- if (llist_empty(&this_rq()->wake_list)
- && !tick_nohz_full_cpu(smp_processor_id())
- && !got_nohz_idle_kick()) return;
But we still need got_nohz_idle_kick() to be the first check, don't we? Otherwise if we run an "idle -> quick task slice -> idle" sequence we may keep the flag but lose the notifying IPI in between.
I'm not sure to catch the sequence you are describing above: "idle -> quick task slice -> idle". In addition, got_nohz_idle_kick must be the last tested condition (in my proposal) in order to clear NOHZ_BALANCE_KICK only if we are sure that we are going to return without possibility to trig the Idle load balance
Right, sorry for the confusion.
On Tue, Jun 04, 2013 at 12:26:22PM +0200, Frederic Weisbecker wrote:
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
Yeah, I suppose something like that is a little better.. a few nits though:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..330136b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -630,15 +630,14 @@ void wake_up_nohz_cpu(int cpu) wake_up_idle_cpu(cpu); } -static inline bool got_nohz_idle_kick(void) +static inline bool got_nohz_idle_kick(int cpu) {
- int cpu = smp_processor_id();
- return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
- return test_and_clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
} #else /* CONFIG_NO_HZ_COMMON */ -static inline bool got_nohz_idle_kick(void) +static inline bool got_nohz_idle_kick(int cpu) { return false; } @@ -1393,8 +1392,12 @@ static void sched_ttwu_pending(void) void scheduler_ipi(void) {
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
&& !tick_nohz_full_cpu(smp_processor_id()))
- int cpu = smp_processor_id();
- bool idle_kick = got_nohz_idle_kick(cpu);
This puts an unconditional atomic instruction in the IPI path. if (test) clear(); is lots cheaper, esp. since most IPIs won't have this flag set.
- if (!(idle_kick && idle_cpu(cpu))
&& llist_empty(&this_rq()->wake_list)
&& !tick_nohz_full_cpu(cpu)
What's with this weird operator first split style?
return;
/*
+enum idle_balance_type {
- IDLE_BALANCE = 1,
- IDLE_NOHZ_BALANCE = 2,
+};
You might want to update the rq->idle_balance assignment in scheduler_tick() to make sure it uses the right value (it does now, but there's nothing stopping people from changing the values).
On Tue, Jun 04, 2013 at 01:15:10PM +0200, Peter Zijlstra wrote:
On Tue, Jun 04, 2013 at 12:26:22PM +0200, Frederic Weisbecker wrote:
@@ -1393,8 +1392,12 @@ static void sched_ttwu_pending(void) void scheduler_ipi(void) {
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
&& !tick_nohz_full_cpu(smp_processor_id()))
- int cpu = smp_processor_id();
- bool idle_kick = got_nohz_idle_kick(cpu);
This puts an unconditional atomic instruction in the IPI path. if (test) clear(); is lots cheaper, esp. since most IPIs won't have this flag set.
Agreed but I'm a bit worried about ordering:
CPU 0 CPU 1
test_and_set_bit(nohz_kick, CPU 1) scheduler_ipi smp_send_reschedule(CPU 1) if (test_and_clear_bit(nohz_kick)) do_something
I'm not sure what base guarantee we have with ordering against raw IPIs such as the the scheduler ipi. But unless both IPI trigger and IPI receive imply a full barrier (or just IPI receive implies read barrier, it seems that's all we need), we need test_and_set_bit() or smp_rmb()/smp_mb__before_clear_bit() && smp_mb__after_clear_bit().
- if (!(idle_kick && idle_cpu(cpu))
&& llist_empty(&this_rq()->wake_list)
&& !tick_nohz_full_cpu(cpu)
What's with this weird operator first split style?
Yeah ugly, I'll fix.
return;
/*
+enum idle_balance_type {
- IDLE_BALANCE = 1,
- IDLE_NOHZ_BALANCE = 2,
+};
You might want to update the rq->idle_balance assignment in scheduler_tick() to make sure it uses the right value (it does now, but there's nothing stopping people from changing the values).
Agreed!
Thanks.
On Tue, Jun 04, 2013 at 10:21:06AM +0200, Vincent Guittot wrote:
On 4 June 2013 00:48, Frederic Weisbecker fweisbec@gmail.com wrote:
On Thu, May 30, 2013 at 05:23:05PM +0200, Vincent Guittot wrote:
I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform.
CPU 0 and CPU 1 are running tasks and CPU 2 is idle
CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 1 sets NOHZ_BALANCE_KICK for CPU 2 CPU 1 sends a reschedule IPI to CPU 2 While CPU 2 wakes up, CPU 0 or CPU 1 migrates a waking task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance Task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
Whenever CPU 2 will be selected for the ILB, reschedule IPI will be not sent to CPU2, which is idle, because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed.
We must wait for the sched softirq to be raised on CPU 2 thanks to another part of the kernel to clear NOHZ_BALANCE_KICKand come back to a normal situation.
The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..51fc715 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1420,7 +1420,8 @@ void scheduler_ipi(void) if (unlikely(got_nohz_idle_kick() && !need_resched())) { this_rq()->idle_balance = 1; raise_softirq_irqoff(SCHED_SOFTIRQ);
}
} else
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(smp_processor_id()));
But then do we reach this if the IPI happens while running the non-idle task in CPU 2? The first got_nohz_idle_kick() test would drop us out early from scheduler_ipi() due to the idle_cpu() test. So the flag doesn't get cleared in this case.
The 1st point is that only idle cpu can be selected for idle load balance. But this doesn't prevent the cpu to wake up while it is kicked for idle load balance.
Yep.
I had added the clear_bit for the 1st got_nohz_idle_kick in the draft version of this patch but the test of the emptiness of the wake_list, the call to smp_send_reschedule in the various way to wake up the idle cpu and the results of the tests have convinced me (may be wrongly) that it was not necessary.
Hmm, if the CPU is idle, get selected as an ilb, but then the CPU schedules a non-idle task and receive the IPI in this non-idle context then finally it goes back to idle for a long time. It can stay idle without ever been notified with this NOHZ_BALANCE_KICK flag set.
But I can be missing something that clears the flag somewhere in that scenario. In any case it's not obvious.
linaro-kernel@lists.linaro.org