Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()

14 Apr 2025


      On Fri, Apr 11, 2025 at 10:51:34AM -0400, Rik van Riel wrote:
...
On Wed, 9 Apr 2025 17:27:03 +0200
Peter Zijlstra peterz@infradead.org wrote:
...
On Wed, Apr 09, 2025 at 10:29:43AM -0400, Rik van Riel wrote:
...
Our trouble workload still makes the scheduler crash
with this patch.
I'll go put the debugging patch on our kernel.
Should I try to get debugging data with this patch
part of the mix, or with the debugging patch just
on top of what's in 6.13 already?
Whatever is more convenient I suppose.
If you can dump the full tree that would be useful. Typically the
se::{vruntime,weight} and cfs_rq::{zero_vruntime,avg_vruntime,avg_load}
such that we can do full manual validation of the numbers.
Here is a dump of the scheduler tree of the crashing CPU.
Unfortunately the CPU crashed in pick_next_entity, and not in your
debugging code. I'll add two more calls to avg_vruntime_validate(),
one from avg_vruntime_update(), and one rfom __update_min_vruntime()
when we skip the call to avg_vruntime_update(). The line numbers in
the backtrace could be a clue.
I have edited the cgroup names to make things more readable, but everything
else is untouched.
Hmm, I didn't think you guys used the cgroup stuff.
Anyway, given cgroups, which group pick is the one that went boom? Also,
what is curr (for that cgroup).
curr lives outside of the tree, but is included in the eligibility
consideration (when still on_rq and all that).
...
nr_running = 3
min_vruntime = 107772371139014
avg_vruntime = -1277161882867784752
avg_load = 786
tasks_timeline = [
  {
    cgroup /A
    weight = 10230 => 9
No vruntime, I'll assume !on_rq, but that makes avg_load above not match
:/ So something is off here.
...
rq = {
  nr_running = 0
  min_vruntime = 458975898004
  avg_vruntime = 0
  avg_load = 0
  tasks_timeline = [
  ]
}

},
  {
    cgroup /B
    vruntime = 18445226958208703357
    weight = 319394 => 311
    rq = {
      nr_running = 2
      min_vruntime = 27468255210769
      avg_vruntime = 0
      avg_load = 93
      tasks_timeline = [
        {
          cgroup /B/a
          vruntime = 27468255210769
          weight = 51569 => 50
          rq = {
            nr_running = 1
            min_vruntime = 820449693961
            avg_vruntime = 0
            avg_load = 15
            tasks_timeline = [
              {
                task = 3653382 (fc0)
                vruntime = 820449693961
                weight = 15360 => 15
              },
            ]
          }
        },
        {
          cgroup /B/b
          vruntime = 27468255210769
          weight = 44057 => 43
          rq = {
            nr_running = 1
            min_vruntime = 563178567930
            avg_vruntime = 0
            avg_load = 15
            tasks_timeline = [
              {
                task = 3706454 (fc0)
                vruntime = 563178567930
                weight = 15360 => 15
              },
            ]
          }
        },
      ]
    }
  },
  {
    cgroup /C
    vruntime = 18445539757376619550
    weight = 477855 => 466
    rq = {
      nr_running = 0
      min_vruntime = 17163581720739
      avg_vruntime = 0
      avg_load = 0
      tasks_timeline = [
      ]
    }
  },
]
So given the above, I've created the below files, and that gives:
$ ./vruntime < root.txt
  k: 0 w: 311 k*w: 0
  k: 312799167916193 w: 466 k*w: 145764412248945938
  v': 107772371139014 = v: 18445226958208703357 + d: 1624887871987273
  V': -1116773464285165183 = V: 145764412248945938 - d: 1624887871987273 * W: 777
min_vruntime: 107772371139014
avg_vruntime: -1116773464285165183
avg_load: 777
...
One thing that stands out to me is how the vruntime of each of the
entities on the CPU's cfs_rq are really large negative numbers.
vruntime = 18429030910682621789 equals 0xffc111f8d9ee675d
I do not know how those se->vruntime numbers got to that point,
but they are a suggestive cause of the overflow.
I'll go comb through the se->vruntime updating code to see how those
large numbers could end up as the vruntime for these sched entities.
As you can see from the output here, the large negative is the result
of min_vruntime being significantly ahead of the actual entities.
This can happen due to that monotonicity filter the thing has -- it
doesn't want to go backwards. Whereas the 0-lag point can move
backwards, seeing how it is the weighted average, and inserting a task
with positive lag will insert a task left of middle, moving the middle
left.
The zero_vruntime patch I gave earlier should avoid this particular
issue.
$ ./vruntime < B.txt
  k: 0 w: 50 k*w: 0
  k: 0 w: 43 k*w: 0
  v': 27468255210769 = v: 27468255210769 + d: 0
  V': 0 = V: 0 - d: 0 * W: 93
min_vruntime: 27468255210769
avg_vruntime: 0
avg_load: 93
C, B/a and B/b are not really interesting, they're single entries where
min_vruntime == vruntime and boring.
---8<---(root.txt)---8<---
entity 18445226958208703357     319394
entity 18445539757376619550     477855
group   107772371139014
---8<---(B.txt)---8<---
entity  27468255210769  51569
entity  27468255210769  44057
group   27468255210769
---8<---(vruntime.c)---8<---
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
int main (int argc, char **argv)
{
        unsigned long V, W;
        unsigned long V0;
        bool init = false;
for (;;) {
                unsigned long vruntime, weight;
                char type[32];
int r = scanf("%s\t%lu\t%lu\n", &type, &vruntime, &weight);
                if (r == EOF)
                        break;
if (!strcmp(type, "entity")) {
                        if (!init) {
                                V = W = 0;
                                V0 = vruntime;
                                init = true;
                        }
unsigned long k = vruntime - V0;
                        unsigned long w = weight / 1024;
V += k * w;
                        W += w;
printf("  k: %ld w: %lu k*w: %ld\n", k, w, k*w);
                }
if (!strcmp(type, "group")) {
                        unsigned long d = vruntime - V0;
printf("  v': %lu = v: %lu + d: %lu\n", V0 + d, V0, d);
                        printf("  V': %ld = V: %ld - d: %ld * W: %lu\n",
                                        V - d * W, V, d, W);
V0 += d;
                        V -= d * W;
                }
        }
printf("min_vruntime: %lu\n", V0);
        printf("avg_vruntime: %ld\n", V);
        printf("avg_load: %lu\n", W);
return 0;
}

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()