Re: [PATCH] sched: Fix nr_uninterruptible race causing increasing load average

9 Jul 2021

On Thu, Jul 08, 2021 at 09:25:45AM -0400, Phil Auld wrote:
...
Hi Peter,
On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
...
On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
...
On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
("sched: Fix loadavg accounting race") causes increasing values of load
average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
CPU not always seeing the write to task->sched_contributes_to_load in
__schedule(). Missing that we fail to decrement nr_uninterruptible when
waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in
ttwu_do_activate().
...

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ca80df205ce..ced7074716eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 
   lockdep_assert_held(&rq->lock);

/* Pairs with smp_wmb in __schedule() */
smp_rmb();
if (p->sched_contributes_to_load)
rq->nr_uninterruptible--;

Is this really needed ?! (this question is a big fat clue the comment is
insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq
and hence the p->sched_contributed_to_load must already happen after.
Yes, it is needed.  We've got idle power systems with load average of 530.21.
Calc_load_tasks is 530, and the sum of both nr_uninterruptible and
calc_load_active across all the runqueues is 530. Basically monotonically
non-decreasing load average. With the patch this no longer happens.
Have you tried without the rmb here? Do you really need both barriers?

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] sched: Fix nr_uninterruptible race causing increasing load average