On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
Fixes: dbfb089d360b ("sched: Fix loadavg accounting race") Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@kernel.org Cc: Juri Lelli juri.lelli@redhat.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Waiman Long longman@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Phil Auld pauld@redhat.com --- kernel/sched/core.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
lockdep_assert_held(&rq->lock);
+ /* Pairs with smp_wmb in __schedule() */ + smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
@@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt) !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN);
+ /* + * Make sure the previous write is ordered before p->on_rq etc so + * that it is visible to other cpus in the wakeup path (ttwu_do_activate()). + */ + smp_wmb(); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++;
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
@@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt) !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN);
/*
* Make sure the previous write is ordered before p->on_rq etc so
* that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
*/
smp_wmb(); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++;
That comment is terrible, look at all the other barrier comments around there for clues; in effect you're worrying about:
p->sched_contributes_to_load = X R1 = p->on_rq WMB RMB p->on_rq = Y R2 = p->sched_contributes_to_load
Right?
Bah bah bah.. I so detest having to add barriers here for silly accounting. Let me think about this a little.
On Thu, Jul 08, 2021 at 09:26:26AM +0200, Peter Zijlstra wrote:
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
@@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt) !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN);
/*
* Make sure the previous write is ordered before p->on_rq etc so
* that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
*/
smp_wmb(); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++;
That comment is terrible, look at all the other barrier comments around there for clues; in effect you're worrying about:
p->sched_contributes_to_load = X R1 = p->on_rq WMB RMB p->on_rq = Y R2 = p->sched_contributes_to_load
Right?
Bah bah bah.. I so detest having to add barriers here for silly accounting. Let me think about this a little.
I got the below:
__schedule() ttwu()
rq_lock() raw_spin_lock(&p->pi_lock) smp_mb__after_spinlock(); smp_mb__after_spinlock();
p->sched_contributes_to_load = X; if (READ_ONCE(p->on_rq) && ...) goto unlock; smp_acquire__after_ctrl_dep();
smp_cond_load_acquire(&p->on_cpu, !VAL)
deactivate_task() p->on_rq = 0;
context_switch() finish_task_switch() finish_task() smp_store_release(p->on_cpu, 0);
ttwu_queue() rq_lock() ttwu_do_activate() if (p->sched_contributes_to_load) ... rq_unlock() raw_spin_unlock(&p->pi_lock); finish_lock_switch() rq_unlock();
The only way for ttwu() to end up in an enqueue, is if it did a LOAD-ACQUIRE on ->on_cpu, but that orders with the STORE-RELEASE on the same, which ensures the p->sched_contributes_to_load LOAD must happen after the STORE.
What am I missing? Your Changelog/comments provide insufficient clues..
On Thu, Jul 08, 2021 at 09:48:03AM +0200, Peter Zijlstra wrote:
On Thu, Jul 08, 2021 at 09:26:26AM +0200, Peter Zijlstra wrote:
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
@@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt) !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN);
/*
* Make sure the previous write is ordered before p->on_rq etc so
* that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
*/
smp_wmb(); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++;
That comment is terrible, look at all the other barrier comments around there for clues; in effect you're worrying about:
p->sched_contributes_to_load = X R1 = p->on_rq WMB RMB p->on_rq = Y R2 = p->sched_contributes_to_load
Right?
Bah bah bah.. I so detest having to add barriers here for silly accounting. Let me think about this a little.
I got the below:
__schedule() ttwu()
rq_lock() raw_spin_lock(&p->pi_lock) smp_mb__after_spinlock(); smp_mb__after_spinlock();
p->sched_contributes_to_load = X; if (READ_ONCE(p->on_rq) && ...) goto unlock; smp_acquire__after_ctrl_dep();
smp_cond_load_acquire(&p->on_cpu, !VAL)
deactivate_task() p->on_rq = 0;
context_switch() finish_task_switch() finish_task() smp_store_release(p->on_cpu, 0);
ttwu_queue() rq_lock() ttwu_do_activate() if (p->sched_contributes_to_load) ... rq_unlock() raw_spin_unlock(&p->pi_lock); finish_lock_switch() rq_unlock();
The only way for ttwu() to end up in an enqueue, is if it did a LOAD-ACQUIRE on ->on_cpu,
That's not completely true; there's the WF_ON_CPU case, but in that scenario we IPI the CPU doing __schedule and it becomes simple UP/PO and everything must trivially work.
but that orders with the STORE-RELEASE on the same, which ensures the p->sched_contributes_to_load LOAD must happen after the STORE.
What am I missing? Your Changelog/comments provide insufficient clues..
On Thu, Jul 08, 2021 at 09:54:06AM +0200 Peter Zijlstra wrote:
On Thu, Jul 08, 2021 at 09:48:03AM +0200, Peter Zijlstra wrote:
On Thu, Jul 08, 2021 at 09:26:26AM +0200, Peter Zijlstra wrote:
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
@@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt) !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN);
/*
* Make sure the previous write is ordered before p->on_rq etc so
* that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
*/
smp_wmb(); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++;
That comment is terrible, look at all the other barrier comments around there for clues; in effect you're worrying about:
p->sched_contributes_to_load = X R1 = p->on_rq WMB RMB p->on_rq = Y R2 = p->sched_contributes_to_load
Right?
Bah bah bah.. I so detest having to add barriers here for silly accounting. Let me think about this a little.
I got the below:
__schedule() ttwu()
rq_lock() raw_spin_lock(&p->pi_lock) smp_mb__after_spinlock(); smp_mb__after_spinlock();
p->sched_contributes_to_load = X; if (READ_ONCE(p->on_rq) && ...) goto unlock; smp_acquire__after_ctrl_dep();
smp_cond_load_acquire(&p->on_cpu, !VAL)
deactivate_task() p->on_rq = 0;
context_switch() finish_task_switch() finish_task() smp_store_release(p->on_cpu, 0);
ttwu_queue() rq_lock() ttwu_do_activate() if (p->sched_contributes_to_load) ... rq_unlock() raw_spin_unlock(&p->pi_lock); finish_lock_switch() rq_unlock();
The only way for ttwu() to end up in an enqueue, is if it did a LOAD-ACQUIRE on ->on_cpu,
That's not completely true; there's the WF_ON_CPU case, but in that scenario we IPI the CPU doing __schedule and it becomes simple UP/PO and everything must trivially work.
but that orders with the STORE-RELEASE on the same, which ensures the p->sched_contributes_to_load LOAD must happen after the STORE.
What am I missing? Your Changelog/comments provide insufficient clues..
Sorry... I don't have a nice diagram. I'm still looking at what all those macros actually mean on the various architectures.
"Works great in practice but how does it work in theory?" :)
Using what you have above I get the same thing. It looks like it should be ordered but in practice it's not, and ordering it "more" as I did in the patch, fixes it.
Is it possible that the bit field is causing some of the assumptions about ordering in those various macros to be off?
I notice in all the comments about smp_mb__after_spinlock etc, it's always WRITE_ONCE/READ_ONCE on the variables in question but we can't do that with the bit field.
I appreciate your time on this.
Cheers, Phil
On Thu, Jul 08, 2021 at 10:54:58AM -0400, Phil Auld wrote:
Sorry... I don't have a nice diagram. I'm still looking at what all those macros actually mean on the various architectures.
Don't worry about other architectures, lets focus on Power, because that's the case where you can reprouce funnies. Now Power only has 2 barrier ops (not quite true, but close enough for all this):
- SYNC is the full barrier
- LWSYNC is a TSO like barrier
Pretty much everything (LOAD-ACQUIRE, STORE-RELEASE, WMB, RMB) uses LWSYNC. Only MB result in SYNC.
Power is 'funny' because their spinlocks are weaker than everybody else's, but AFAICT that doesn't seem relevant here.
Using what you have above I get the same thing. It looks like it should be ordered but in practice it's not, and ordering it "more" as I did in the patch, fixes it.
And you're running Linus' tree, not some franken-kernel from RHT, right? As asked in that other email, can you try with just the WMB added? I really don't believe that RMB you added can make a difference.
Also, can you try with TTWU_QUEUE disabled (without any additional barriers added), that simplifies the wakeup path a lot.
Is it possible that the bit field is causing some of the assumptions about ordering in those various macros to be off?
*should* not matter...
prev->sched_contributes_to_load = X;
smp_store_release(&prev->on_cpu, 0); asm("LWSYNC" : : : "memory"); WRITE_ONCE(prev->on_cpu, 0);
due to that memory clobber, the compiler must emit whatever stores are required for the bitfield prior to the LWSYNC.
I notice in all the comments about smp_mb__after_spinlock etc, it's always WRITE_ONCE/READ_ONCE on the variables in question but we can't do that with the bit field.
Yeah, but both ->on_rq and ->sched_contributes_to_load are 'normal' stores. That said, given that ttwu() does a READ_ONCE() on ->on_rq, we should match that with WRITE_ONCE()...
So I think we should do the below, but I don't believe it'll make a difference. Let me stare more.
--- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca9a523c9a6c..da93551b298d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1973,12 +1973,12 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags) { enqueue_task(rq, p, flags);
- p->on_rq = TASK_ON_RQ_QUEUED; + WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED); }
void deactivate_task(struct rq *rq, struct task_struct *p, int flags) { - p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; + WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
dequeue_task(rq, p, flags); } @@ -5662,11 +5662,11 @@ static bool try_steal_cookie(int this, int that) if (p->core_occupation > dst->idle->core_occupation) goto next;
- p->on_rq = TASK_ON_RQ_MIGRATING; + WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); deactivate_task(src, p, 0); set_task_cpu(p, this); activate_task(dst, p, 0); - p->on_rq = TASK_ON_RQ_QUEUED; + WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
resched_curr(dst);
Hi Peter,
On Fri, Jul 09, 2021 at 02:57:10PM +0200 Peter Zijlstra wrote:
On Thu, Jul 08, 2021 at 10:54:58AM -0400, Phil Auld wrote:
Sorry... I don't have a nice diagram. I'm still looking at what all those macros actually mean on the various architectures.
Don't worry about other architectures, lets focus on Power, because that's the case where you can reprouce funnies. Now Power only has 2 barrier ops (not quite true, but close enough for all this):
SYNC is the full barrier
LWSYNC is a TSO like barrier
Pretty much everything (LOAD-ACQUIRE, STORE-RELEASE, WMB, RMB) uses LWSYNC. Only MB result in SYNC.
Power is 'funny' because their spinlocks are weaker than everybody else's, but AFAICT that doesn't seem relevant here.
Thanks.
Using what you have above I get the same thing. It looks like it should be ordered but in practice it's not, and ordering it "more" as I did in the patch, fixes it.
And you're running Linus' tree, not some franken-kernel from RHT, right? As asked in that other email, can you try with just the WMB added? I really don't believe that RMB you added can make a difference.
So, no. Right now the reproducer is on the franken-kernel :(
As far as I can tell the relevant code paths (schedule, barriers, wakeup etc) are all current and the same. I traced through your diagram and it all matches exactly.
I have a suspicion that Linus's tree may hide it. I believe this is tickled by NFS io, which I _think_ is effected by the unboud workqueue changes that may make it less likely to do the wakeup on a different cpu. But that's just speculation.
The issue is that the systems under test here are in a partner's lab to which I have no direct access.
I will try to get an upstream build on there, if possible, as soon as I can.
Also, can you try with TTWU_QUEUE disabled (without any additional barriers added), that simplifies the wakeup path a lot.
Will do.
Is it possible that the bit field is causing some of the assumptions about ordering in those various macros to be off?
*should* not matter...
prev->sched_contributes_to_load = X;
smp_store_release(&prev->on_cpu, 0); asm("LWSYNC" : : : "memory"); WRITE_ONCE(prev->on_cpu, 0);
due to that memory clobber, the compiler must emit whatever stores are required for the bitfield prior to the LWSYNC.
I notice in all the comments about smp_mb__after_spinlock etc, it's always WRITE_ONCE/READ_ONCE on the variables in question but we can't do that with the bit field.
Yeah, but both ->on_rq and ->sched_contributes_to_load are 'normal' stores. That said, given that ttwu() does a READ_ONCE() on ->on_rq, we should match that with WRITE_ONCE()...
So I think we should do the below, but I don't believe it'll make a difference. Let me stare more.
I'm out of the office for the next week+ so don't stare to hard. I'll try to get the tests you asked for as soon as I get back in the (home) office.
I'm not sure the below will make a difference either, but will try it too.
Thanks again for the help. And sorry for the timing.
Cheers, Phil
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca9a523c9a6c..da93551b298d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1973,12 +1973,12 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags) { enqueue_task(rq, p, flags);
- p->on_rq = TASK_ON_RQ_QUEUED;
- WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
} void deactivate_task(struct rq *rq, struct task_struct *p, int flags) {
- p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
- WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
dequeue_task(rq, p, flags); } @@ -5662,11 +5662,11 @@ static bool try_steal_cookie(int this, int that) if (p->core_occupation > dst->idle->core_occupation) goto next;
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0); set_task_cpu(p, this); activate_task(dst, p, 0);WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
p->on_rq = TASK_ON_RQ_QUEUED;
WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
resched_curr(dst);
Hi Peter,
On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
Yes, it is needed. We've got idle power systems with load average of 530.21. Calc_load_tasks is 530, and the sum of both nr_uninterruptible and calc_load_active across all the runqueues is 530. Basically monotonically non-decreasing load average. With the patch this no longer happens.
We need the sched_contributed_to_load to "happen before" so that it's seen on the other cpu on wakeup.
@@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt) !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN);
/*
* Make sure the previous write is ordered before p->on_rq etc so
* that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
*/
smp_wmb(); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++;
That comment is terrible, look at all the other barrier comments around there for clues; in effect you're worrying about:
p->sched_contributes_to_load = X R1 = p->on_rq WMB RMB p->on_rq = Y R2 = p->sched_contributes_to_load
Right?
The only way I can see that decrememnt being missed is if the write to sched_contributes_to_load is not being seen on the wakeup cpu.
Before the previous patch the _state condition was checked again on the wakeup cpu and that is ordered.
Bah bah bah.. I so detest having to add barriers here for silly accounting. Let me think about this a little.
Thanks, Phil
On Thu, Jul 08, 2021 at 09:25:45AM -0400, Phil Auld wrote:
Hi Peter,
On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
Yes, it is needed. We've got idle power systems with load average of 530.21. Calc_load_tasks is 530, and the sum of both nr_uninterruptible and calc_load_active across all the runqueues is 530. Basically monotonically non-decreasing load average. With the patch this no longer happens.
Have you tried without the rmb here? Do you really need both barriers?
On Fri, Jul 09, 2021 at 01:38:20PM +0200 Peter Zijlstra wrote:
On Thu, Jul 08, 2021 at 09:25:45AM -0400, Phil Auld wrote:
Hi Peter,
On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
Yes, it is needed. We've got idle power systems with load average of 530.21. Calc_load_tasks is 530, and the sum of both nr_uninterruptible and calc_load_active across all the runqueues is 530. Basically monotonically non-decreasing load average. With the patch this no longer happens.
Have you tried without the rmb here? Do you really need both barriers?
No and no. It was originally a READ_ONCE which didn't compile. I was trying to pair them up but did think that was overkill. I'll try that as soon as I get back to it.
Thanks, Phil
On Fri, Jul 09, 2021 at 01:38:20PM +0200 Peter Zijlstra wrote:
On Thu, Jul 08, 2021 at 09:25:45AM -0400, Phil Auld wrote:
Hi Peter,
On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
Yes, it is needed. We've got idle power systems with load average of 530.21. Calc_load_tasks is 530, and the sum of both nr_uninterruptible and calc_load_active across all the runqueues is 530. Basically monotonically non-decreasing load average. With the patch this no longer happens.
Have you tried without the rmb here? Do you really need both barriers?
You're right here. (I see now that you were asking about the rmb specifically in the first question) The rmb is not needed.
I was unable to reproducde it with the upstream kernel. I still think it is a problem though since the code in question is all the same. The recent changes to unbound workqueues which make it more likely to run on the submitting cpu may be masking the problem since it obviously requires multiple cpus to hit.
If I can isolate those changes I can try to revert them in upstream and see if I can get it there.
I suppose pulling those changes back could get us past this but I'm not a fan of just hiding it by making it harder to hit.
I've not gotten to the disable TTWU_QUEUE test, that's next...
Cheers, Phil
On Fri, Jul 23, 2021 at 09:38:44AM -0400 Phil Auld wrote:
On Fri, Jul 09, 2021 at 01:38:20PM +0200 Peter Zijlstra wrote:
On Thu, Jul 08, 2021 at 09:25:45AM -0400, Phil Auld wrote:
Hi Peter,
On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b ("sched: Fix loadavg accounting race") causes increasing values of load average (via rq->calc_load_active and calc_load_tasks) due to the wakeup CPU not always seeing the write to task->sched_contributes_to_load in __schedule(). Missing that we fail to decrement nr_uninterruptible when waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in ttwu_do_activate().
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ca80df205ce..ced7074716eb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock);
- /* Pairs with smp_wmb in __schedule() */
- smp_rmb(); if (p->sched_contributes_to_load) rq->nr_uninterruptible--;
Is this really needed ?! (this question is a big fat clue the comment is insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq and hence the p->sched_contributed_to_load must already happen after.
Yes, it is needed. We've got idle power systems with load average of 530.21. Calc_load_tasks is 530, and the sum of both nr_uninterruptible and calc_load_active across all the runqueues is 530. Basically monotonically non-decreasing load average. With the patch this no longer happens.
Have you tried without the rmb here? Do you really need both barriers?
You're right here. (I see now that you were asking about the rmb specifically in the first question) The rmb is not needed.
I was unable to reproducde it with the upstream kernel. I still think it is a problem though since the code in question is all the same. The recent changes to unbound workqueues which make it more likely to run on the submitting cpu may be masking the problem since it obviously requires multiple cpus to hit.
If I can isolate those changes I can try to revert them in upstream and see if I can get it there.
I suppose pulling those changes back could get us past this but I'm not a fan of just hiding it by making it harder to hit.
I've not gotten to the disable TTWU_QUEUE test, that's next...
ETOOMANYTREES
Sorry for the noise. I was using the wrong tree to compare with upstream. The offending tree is missing f97bb5272d9e ("sched: Fix data-race in wakeup"). I thought the loadavg increase sounded familiar...
Cheers, Phil
Cheers, Phil
--
linux-stable-mirror@lists.linaro.org