On 4 June 2013 16:44, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 01:48:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 13:19, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
On 4 June 2013 12:26, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
The best I can seem to come up with is something like the below; but I think its ghastly. Surely we can do something saner with that bit.
Having to clear it at 3 different places is just wrong.
We could clear the flag early in scheduler_ipi() and set some specific value in rq->idle_balance that tells we want nohz idle balancing from the softirq, something like this untested:
I'm not sure that we can have less than 2 places to clear it: cancel place or acknowledge place otherwise we can face a situation where idle load balance will be triggered 2 consecutive times because NOHZ_BALANCE_KICK will be cleared before the idle load balance has been done and had a chance to migrate tasks.
I guess it depends what is the minimum value of rq->next_balance, it seems to be large enough to avoid this kind of incident. Although I don't know well the whole logic with rq->next_balance and ilb trigger so I must defer to you.
In the trace that was showing the issue, i can see that both CPU0 and CPU1 were trying to trig ILB almost simultaneously and the test_and_set NOHZ_BALANCE_KICK filters one request so i would say that clearing the bit before the end of the idle load balance sequence can generate such sequence
I see.
In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK in 2 places : acknowledge and cancel. I have reused part of the proposal from peter which clears the bit if the condition doesn't match but i have reordered the tests to done that only if all other condition are matching
static inline bool got_nohz_idle_kick(void) {
- int cpu = smp_processor_id();
- return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
- bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
if (!nohz_kick)
return false;
if (idle_cpu(cpu) && !need_resched())
return true;
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
return false;
}
#else /* CONFIG_NO_HZ_COMMON */ @@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)
void scheduler_ipi(void) {
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
- && !tick_nohz_full_cpu(smp_processor_id()))
- if (llist_empty(&this_rq()->wake_list)
- && !tick_nohz_full_cpu(smp_processor_id())
- && !got_nohz_idle_kick()) return;
But we still need got_nohz_idle_kick() to be the first check, don't we? Otherwise if we run an "idle -> quick task slice -> idle" sequence we may keep the flag but lose the notifying IPI in between.
I'm not sure to catch the sequence you are describing above: "idle -> quick task slice -> idle". In addition, got_nohz_idle_kick must be the last tested condition (in my proposal) in order to clear NOHZ_BALANCE_KICK only if we are sure that we are going to return without possibility to trig the Idle load balance
Vincent