pick_eevdf() can return null, resulting in a null pointer dereference crash in pick_next_entity()
The other call site of pick_eevdf() can already handle a null pointer, and pick_next_entity() can already return null as well. Add an extra check to handle the null return here.
Cc: stable@vger.kernel.org Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue") Signed-off-by: Pat Cody pat@patcody.io --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a553181dc764..f2157298cbce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5560,6 +5560,8 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) }
struct sched_entity *se = pick_eevdf(cfs_rq); + if (!se) + return NULL; if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); /*
On 3/20/25 20:53, Pat Cody wrote:
pick_eevdf() can return null, resulting in a null pointer dereference crash in pick_next_entity()
The other call site of pick_eevdf() can already handle a null pointer, and pick_next_entity() can already return null as well. Add an extra check to handle the null return here.
Cc: stable@vger.kernel.org Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue") Signed-off-by: Pat Cody pat@patcody.io
Did this happen on mainline? Any chance it's reproducible?
kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a553181dc764..f2157298cbce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5560,6 +5560,8 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) } struct sched_entity *se = pick_eevdf(cfs_rq);
- if (!se)
if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); /*return NULL;
On Thu, Mar 20, 2025 at 10:42:44PM +0000, Christian Loehle wrote:
Did this happen on mainline? Any chance it's reproducible?
We have the following back-ported on top of 6.13:
sched/fair: Adhere to place_entity() constraints (not upstreamed, see https://lore.kernel.org/all/20250207111141.GD7145@noisy.programming.kicks-as...) a430d99e3490 sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat 2a77e4be12cb sched/fair: Untangle NEXT_BUDDY and pick_next_task()
We don't have a repro, but the crash in pick_task_fair happens more often than MCE crashes with our limited deployment of 6.13.
On Thu, Mar 20, 2025 at 01:53:10PM -0700, Pat Cody wrote:
pick_eevdf() can return null, resulting in a null pointer dereference crash in pick_next_entity()
If it returns NULL while nr_queued, something is really badly wrong.
Your check will hide this badness.
Does something like:
https://lkml.kernel.org/r/20250128143949.GD7145@noisy.programming.kicks-ass....
help?
On Mon, Mar 24, 2025 at 12:56:13PM +0100, Peter Zijlstra wrote:
Does something like:
https://lkml.kernel.org/r/20250128143949.GD7145@noisy.programming.kicks-ass....
help?
To clarify- are you asking about if we've tried reverting 4423af84b297? We have not tried that yet.
Or if we've included "sched/fair: Adhere to place_entity() constraints", which we have already done- https://lore.kernel.org/all/20250207-tunneling-tested-koel-c59d33@leitao/
On Tue, Mar 25, 2025 at 08:12:30AM -0700, Pat Cody wrote:
On Mon, Mar 24, 2025 at 12:56:13PM +0100, Peter Zijlstra wrote:
Does something like:
https://lkml.kernel.org/r/20250128143949.GD7145@noisy.programming.kicks-ass....
help?
To clarify- are you asking about if we've tried reverting 4423af84b297? We have not tried that yet.
Or if we've included "sched/fair: Adhere to place_entity() constraints", which we have already done- https://lore.kernel.org/all/20250207-tunneling-tested-koel-c59d33@leitao/
This; it seems it got lost. I'll try and get it queued up.
On Tue, Mar 25, 2025 at 07:59:07PM +0100, Peter Zijlstra wrote:
To clarify- are you asking about if we've tried reverting 4423af84b297? We have not tried that yet.
Or if we've included "sched/fair: Adhere to place_entity() constraints", which we have already done- https://lore.kernel.org/all/20250207-tunneling-tested-koel-c59d33@leitao/
This; it seems it got lost. I'll try and get it queued up.
Given that we've already included the patch in what we're running, and we're seeing this null pointer exception, any suggestions for how to proceed?
On Mon, 2025-03-24 at 12:56 +0100, Peter Zijlstra wrote:
On Thu, Mar 20, 2025 at 01:53:10PM -0700, Pat Cody wrote:
pick_eevdf() can return null, resulting in a null pointer dereference crash in pick_next_entity()
If it returns NULL while nr_queued, something is really badly wrong.
Your check will hide this badness.
Looking at the numbers, I suspect vruntime_eligible() is simply not allowing us to run the left-most entity in the rb tree.
At the root level we are seeing these numbers:
*(struct cfs_rq *)0xffff8882b3b80000 = { .load = (struct load_weight){ .weight = (unsigned long)4750106, .inv_weight = (u32)0, }, .nr_running = (unsigned int)3, .h_nr_running = (unsigned int)3, .idle_nr_running = (unsigned int)0, .idle_h_nr_running = (unsigned int)0, .h_nr_delayed = (unsigned int)0, .avg_vruntime = (s64)-2206158374744070955, .avg_load = (u64)4637, .min_vruntime = (u64)12547674988423219,
Meanwhile, the cfs_rq->curr entity has a weight of 4699124, a vruntime of 12071905127234526, and a vlag of -2826239998
The left node entity in the cfs_rq has a weight of 107666, a vruntime of 16048555717648580, and a vlag of -1338888
I cannot for the life of me figure out how the avg_vruntime number is so out of whack from what the vruntime numbers of the sched entities on the runqueue look like.
The avg_vruntime code is confusing me. On the one hand the vruntime number is multiplied by the sched entity weight when adding to or subtracting to avg_vruntime, but on the other hand vruntime_eligible scales the comparison by the cfs_rq->avg_load number.
What even protects the load number in vruntime_eligible from going negative in certain cases, when the current entity's entity_key is a negative value?
The latter is probably not the bug we're seeing now, but I don't understand how that is supposed to behave.
On Wed, Apr 02, 2025 at 10:59:09AM -0400, Rik van Riel wrote:
On Mon, 2025-03-24 at 12:56 +0100, Peter Zijlstra wrote:
On Thu, Mar 20, 2025 at 01:53:10PM -0700, Pat Cody wrote:
pick_eevdf() can return null, resulting in a null pointer dereference crash in pick_next_entity()
If it returns NULL while nr_queued, something is really badly wrong.
Your check will hide this badness.
Looking at the numbers, I suspect vruntime_eligible() is simply not allowing us to run the left-most entity in the rb tree.
At the root level we are seeing these numbers:
*(struct cfs_rq *)0xffff8882b3b80000 = { .load = (struct load_weight){ .weight = (unsigned long)4750106, .inv_weight = (u32)0, }, .nr_running = (unsigned int)3, .h_nr_running = (unsigned int)3, .idle_nr_running = (unsigned int)0, .idle_h_nr_running = (unsigned int)0, .h_nr_delayed = (unsigned int)0, .avg_vruntime = (s64)-2206158374744070955, .avg_load = (u64)4637, .min_vruntime = (u64)12547674988423219,
Meanwhile, the cfs_rq->curr entity has a weight of 4699124, a vruntime of 12071905127234526, and a vlag of -2826239998
The left node entity in the cfs_rq has a weight of 107666, a vruntime of 16048555717648580, and a vlag of -1338888
The thing that stands out is that min_vruntime is a lot smaller than the leftmost vruntime. This in turn leads to keys being large.
This is undesirable, as it can lead to overflow.
I cannot for the life of me figure out how the avg_vruntime number is so out of whack from what the vruntime numbers of the sched entities on the runqueue look like.
The avg_vruntime code is confusing me. On the one hand the vruntime number is multiplied by the sched entity weight when adding to or subtracting to avg_vruntime, but on the other hand vruntime_eligible scales the comparison by the cfs_rq->avg_load number.
What even protects the load number in vruntime_eligible from going negative in certain cases, when the current entity's entity_key is a negative value?
The latter is probably not the bug we're seeing now, but I don't understand how that is supposed to behave.
So there is this giant comment right above avg_vruntime_add() that tries and explain things.
Basically, from the constraint that the sum of lag is zero, you can infer that the 0-lag point is the weighted average of the individual vruntime, which is what we're trying to compute:
\Sum w_i * v_i avg = -------------- \Sum w_i
Now, since vruntime takes the whole u64 (worse, it wraps), this multiplication term in the numerator is not something we can compute; instead we do the min_vruntime (v0 henceforth) thing like:
v_i = (v_i - v0) + v0
(edit -- this does two things: - it keeps the key 'small'; - it creates a relative 0-point in the modular space)
If you do that subtitution and work it all out, you end up with:
\Sum w_i * (v_i - v0) avg = --------------------- + v0 \Sum w_i
Since you cannot very well track a ratio like that (and not suffer terrible numerical problems) we simpy track the numerator and denominator individually and only perform the division when strictly needed.
Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator lives in cfs_rq->avg_load.
The one extra 'funny' is that these numbers track the entities in the tree, and current is typically outside of the tree, so avg_vruntime() adds current when needed before doing the division.
(vruntime_eligible() elides the division by cross-wise multiplication)
Anyway... we got s64 to track avg_vruntime, that sum over weighted keys, if you have a ton of entities and large keys and large weights (your case) you can overflow and things go to shit.
Anyway, seeing how your min_vruntime is weird, let me ask you to try the below; it removes the old min_vruntime and instead tracks zero vruntime as the 'current' avg_vruntime. We don't need the monotinicity filter, all we really need is something 'near' all the other vruntimes in order to compute this relative key so we can preserve order across the wrap.
This *should* get us near minimal sized keys. If you can still reproduce, you should probably add something like that patch I send you privately earlier, that checks the overflows.
The below builds and boots (provided SCHED_CORE=n).
--- kernel/sched/debug.c | 8 ++--- kernel/sched/fair.c | 92 ++++++++-------------------------------------------- kernel/sched/sched.h | 2 +- 3 files changed, 19 insertions(+), 83 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 56ae54e0ce6a..5ca512d50e3a 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -807,7 +807,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) { - s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread; + s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread; struct sched_entity *last, *first, *root; struct rq *rq = cpu_rq(cpu); unsigned long flags; @@ -830,15 +830,15 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) last = __pick_last_entity(cfs_rq); if (last) right_vruntime = last->vruntime; - min_vruntime = cfs_rq->min_vruntime; + zero_vruntime = cfs_rq->zero_vruntime; raw_spin_rq_unlock_irqrestore(rq, flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline", SPLIT_NS(left_deadline)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_vruntime", SPLIT_NS(left_vruntime)); - SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime", - SPLIT_NS(min_vruntime)); + SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime", + SPLIT_NS(zero_vruntime)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime", SPLIT_NS(avg_vruntime(cfs_rq))); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime", diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e43993a4e580..17e43980f5a3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -526,24 +526,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec); * Scheduling class tree data structure manipulation methods: */
-static inline __maybe_unused u64 max_vruntime(u64 max_vruntime, u64 vruntime) -{ - s64 delta = (s64)(vruntime - max_vruntime); - if (delta > 0) - max_vruntime = vruntime; - - return max_vruntime; -} - -static inline __maybe_unused u64 min_vruntime(u64 min_vruntime, u64 vruntime) -{ - s64 delta = (s64)(vruntime - min_vruntime); - if (delta < 0) - min_vruntime = vruntime; - - return min_vruntime; -} - static inline bool entity_before(const struct sched_entity *a, const struct sched_entity *b) { @@ -556,7 +538,7 @@ static inline bool entity_before(const struct sched_entity *a,
static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) { - return (s64)(se->vruntime - cfs_rq->min_vruntime); + return (s64)(se->vruntime - cfs_rq->zero_vruntime); }
#define __node_2_se(node) \ @@ -608,13 +590,13 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) * * Which we track using: * - * v0 := cfs_rq->min_vruntime + * v0 := cfs_rq->zero_vruntime * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime * \Sum w_i := cfs_rq->avg_load * - * Since min_vruntime is a monotonic increasing variable that closely tracks - * the per-task service, these deltas: (v_i - v), will be in the order of the - * maximal (virtual) lag induced in the system due to quantisation. + * Since zero_vruntime closely tracks the per-task service, these + * deltas: (v_i - v), will be in the order of the maximal (virtual) lag + * induced in the system due to quantisation. * * Also, we use scale_load_down() to reduce the size. * @@ -673,7 +655,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) avg = div_s64(avg, load); }
- return cfs_rq->min_vruntime + avg; + return cfs_rq->zero_vruntime + avg; }
/* @@ -734,7 +716,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) load += weight; }
- return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load; + return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load; }
int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) @@ -742,42 +724,14 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) return vruntime_eligible(cfs_rq, se->vruntime); }
-static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime) +static void update_zero_vruntime(struct cfs_rq *cfs_rq) { - u64 min_vruntime = cfs_rq->min_vruntime; - /* - * open coded max_vruntime() to allow updating avg_vruntime - */ - s64 delta = (s64)(vruntime - min_vruntime); - if (delta > 0) { - avg_vruntime_update(cfs_rq, delta); - min_vruntime = vruntime; - } - return min_vruntime; -} + u64 vruntime = avg_vruntime(cfs_rq); + s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime);
-static void update_min_vruntime(struct cfs_rq *cfs_rq) -{ - struct sched_entity *se = __pick_root_entity(cfs_rq); - struct sched_entity *curr = cfs_rq->curr; - u64 vruntime = cfs_rq->min_vruntime; + avg_vruntime_update(cfs_rq, delta);
- if (curr) { - if (curr->on_rq) - vruntime = curr->vruntime; - else - curr = NULL; - } - - if (se) { - if (!curr) - vruntime = se->min_vruntime; - else - vruntime = min_vruntime(vruntime, se->min_vruntime); - } - - /* ensure we never gain time by being placed backwards. */ - cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime); + cfs_rq->zero_vruntime = vruntime; }
static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq) @@ -850,6 +804,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity, static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { avg_vruntime_add(cfs_rq, se); + update_zero_vruntime(cfs_rq); se->min_vruntime = se->vruntime; se->min_slice = se->slice; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, @@ -1239,7 +1194,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
curr->vruntime += calc_delta_fair(delta_exec, curr); resched = update_deadline(cfs_rq, curr); - update_min_vruntime(cfs_rq);
if (entity_is_task(curr)) { struct task_struct *p = task_of(curr); @@ -3825,15 +3779,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, place_entity(cfs_rq, se, 0); if (!curr) __enqueue_entity(cfs_rq, se); - - /* - * The entity's vruntime has been adjusted, so let's check - * whether the rq-wide min_vruntime needs updated too. Since - * the calculations above require stable min_vruntime rather - * than up-to-date one, we do the update at the end of the - * reweight process. - */ - update_min_vruntime(cfs_rq); } }
@@ -5511,15 +5456,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_cfs_group(se);
- /* - * Now advance min_vruntime if @se was the entity holding it back, - * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be - * put back on, and if we advance min_vruntime, we'll be placed back - * further than we started -- i.e. we'll be penalized. - */ - if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE) - update_min_vruntime(cfs_rq); - if (flags & DEQUEUE_DELAYED) finish_delayed_dequeue_entity(se);
@@ -13312,7 +13248,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) void init_cfs_rq(struct cfs_rq *cfs_rq) { cfs_rq->tasks_timeline = RB_ROOT_CACHED; - cfs_rq->min_vruntime = (u64)(-(1LL << 20)); + cfs_rq->zero_vruntime = (u64)(-(1LL << 20)); #ifdef CONFIG_SMP raw_spin_lock_init(&cfs_rq->removed.lock); #endif diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 47972f34ea70..41b312f17f22 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -652,7 +652,7 @@ struct cfs_rq { s64 avg_vruntime; u64 avg_load;
- u64 min_vruntime; + u64 zero_vruntime; #ifdef CONFIG_SCHED_CORE unsigned int forceidle_seq; u64 min_vruntime_fi;
On Wed, 2025-04-02 at 20:07 +0200, Peter Zijlstra wrote:
Anyway, seeing how your min_vruntime is weird, let me ask you to try the below; it removes the old min_vruntime and instead tracks zero vruntime as the 'current' avg_vruntime. We don't need the monotinicity filter, all we really need is something 'near' all the other vruntimes in order to compute this relative key so we can preserve order across the wrap.
This *should* get us near minimal sized keys. If you can still reproduce, you should probably add something like that patch I send you privately earlier, that checks the overflows.
Our trouble workload still makes the scheduler crash with this patch.
I'll go put the debugging patch on our kernel.
Should I try to get debugging data with this patch part of the mix, or with the debugging patch just on top of what's in 6.13 already?
Digging through our kernel crash history, this particular crash seems to go back at least to 6.11. They just happen much more frequently on 6.13 for some (as of yet unknown) reason.
On Wed, Apr 09, 2025 at 10:29:43AM -0400, Rik van Riel wrote:
On Wed, 2025-04-02 at 20:07 +0200, Peter Zijlstra wrote:
Anyway, seeing how your min_vruntime is weird, let me ask you to try the below; it removes the old min_vruntime and instead tracks zero vruntime as the 'current' avg_vruntime. We don't need the monotinicity filter, all we really need is something 'near' all the other vruntimes in order to compute this relative key so we can preserve order across the wrap.
This *should* get us near minimal sized keys. If you can still reproduce, you should probably add something like that patch I send you privately earlier, that checks the overflows.
Our trouble workload still makes the scheduler crash with this patch.
I'll go put the debugging patch on our kernel.
Should I try to get debugging data with this patch part of the mix, or with the debugging patch just on top of what's in 6.13 already?
Whatever is more convenient I suppose.
If you can dump the full tree that would be useful. Typically the se::{vruntime,weight} and cfs_rq::{zero_vruntime,avg_vruntime,avg_load} such that we can do full manual validation of the numbers.
On Wed, 9 Apr 2025 17:27:03 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Wed, Apr 09, 2025 at 10:29:43AM -0400, Rik van Riel wrote:
Our trouble workload still makes the scheduler crash with this patch.
I'll go put the debugging patch on our kernel.
Should I try to get debugging data with this patch part of the mix, or with the debugging patch just on top of what's in 6.13 already?
Whatever is more convenient I suppose.
If you can dump the full tree that would be useful. Typically the se::{vruntime,weight} and cfs_rq::{zero_vruntime,avg_vruntime,avg_load} such that we can do full manual validation of the numbers.
Here is a dump of the scheduler tree of the crashing CPU.
Unfortunately the CPU crashed in pick_next_entity, and not in your debugging code. I'll add two more calls to avg_vruntime_validate(), one from avg_vruntime_update(), and one rfom __update_min_vruntime() when we skip the call to avg_vruntime_update(). The line numbers in the backtrace could be a clue.
I have edited the cgroup names to make things more readable, but everything else is untouched.
One thing that stands out to me is how the vruntime of each of the entities on the CPU's cfs_rq are really large negative numbers.
vruntime = 18429030910682621789 equals 0xffc111f8d9ee675d
I do not know how those se->vruntime numbers got to that point, but they are a suggestive cause of the overflow.
I'll go comb through the se->vruntime updating code to see how those large numbers could end up as the vruntime for these sched entities.
nr_running = 3 min_vruntime = 107772371139014 avg_vruntime = -1277161882867784752 avg_load = 786 tasks_timeline = [ { cgroup /A weight = 10230 => 9 rq = { nr_running = 0 min_vruntime = 458975898004 avg_vruntime = 0 avg_load = 0 tasks_timeline = [ ] } }, { cgroup /B vruntime = 18445226958208703357 weight = 319394 => 311 rq = { nr_running = 2 min_vruntime = 27468255210769 avg_vruntime = 0 avg_load = 93 tasks_timeline = [ { cgroup /B/a vruntime = 27468255210769 weight = 51569 => 50 rq = { nr_running = 1 min_vruntime = 820449693961 avg_vruntime = 0 avg_load = 15 tasks_timeline = [ { task = 3653382 (fc0) vruntime = 820449693961 weight = 15360 => 15 }, ] } }, { cgroup /B/b vruntime = 27468255210769 weight = 44057 => 43 rq = { nr_running = 1 min_vruntime = 563178567930 avg_vruntime = 0 avg_load = 15 tasks_timeline = [ { task = 3706454 (fc0) vruntime = 563178567930 weight = 15360 => 15 }, ] } }, ] } }, { cgroup /C vruntime = 18445539757376619550 weight = 477855 => 466 rq = { nr_running = 0 min_vruntime = 17163581720739 avg_vruntime = 0 avg_load = 0 tasks_timeline = [ ] } }, ]
On Fri, Apr 11, 2025 at 10:51:34AM -0400, Rik van Riel wrote:
On Wed, 9 Apr 2025 17:27:03 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Wed, Apr 09, 2025 at 10:29:43AM -0400, Rik van Riel wrote:
Our trouble workload still makes the scheduler crash with this patch.
I'll go put the debugging patch on our kernel.
Should I try to get debugging data with this patch part of the mix, or with the debugging patch just on top of what's in 6.13 already?
Whatever is more convenient I suppose.
If you can dump the full tree that would be useful. Typically the se::{vruntime,weight} and cfs_rq::{zero_vruntime,avg_vruntime,avg_load} such that we can do full manual validation of the numbers.
Here is a dump of the scheduler tree of the crashing CPU.
Unfortunately the CPU crashed in pick_next_entity, and not in your debugging code. I'll add two more calls to avg_vruntime_validate(), one from avg_vruntime_update(), and one rfom __update_min_vruntime() when we skip the call to avg_vruntime_update(). The line numbers in the backtrace could be a clue.
I have edited the cgroup names to make things more readable, but everything else is untouched.
Hmm, I didn't think you guys used the cgroup stuff.
Anyway, given cgroups, which group pick is the one that went boom? Also, what is curr (for that cgroup).
curr lives outside of the tree, but is included in the eligibility consideration (when still on_rq and all that).
nr_running = 3 min_vruntime = 107772371139014 avg_vruntime = -1277161882867784752 avg_load = 786 tasks_timeline = [ { cgroup /A weight = 10230 => 9
No vruntime, I'll assume !on_rq, but that makes avg_load above not match :/ So something is off here.
rq = { nr_running = 0 min_vruntime = 458975898004 avg_vruntime = 0 avg_load = 0 tasks_timeline = [ ] }
}, { cgroup /B vruntime = 18445226958208703357 weight = 319394 => 311 rq = { nr_running = 2 min_vruntime = 27468255210769 avg_vruntime = 0 avg_load = 93 tasks_timeline = [ { cgroup /B/a vruntime = 27468255210769 weight = 51569 => 50 rq = { nr_running = 1 min_vruntime = 820449693961 avg_vruntime = 0 avg_load = 15 tasks_timeline = [ { task = 3653382 (fc0) vruntime = 820449693961 weight = 15360 => 15 }, ] } }, { cgroup /B/b vruntime = 27468255210769 weight = 44057 => 43 rq = { nr_running = 1 min_vruntime = 563178567930 avg_vruntime = 0 avg_load = 15 tasks_timeline = [ { task = 3706454 (fc0) vruntime = 563178567930 weight = 15360 => 15 }, ] } }, ] } }, { cgroup /C vruntime = 18445539757376619550 weight = 477855 => 466 rq = { nr_running = 0 min_vruntime = 17163581720739 avg_vruntime = 0 avg_load = 0 tasks_timeline = [ ] } }, ]
So given the above, I've created the below files, and that gives:
$ ./vruntime < root.txt k: 0 w: 311 k*w: 0 k: 312799167916193 w: 466 k*w: 145764412248945938 v': 107772371139014 = v: 18445226958208703357 + d: 1624887871987273 V': -1116773464285165183 = V: 145764412248945938 - d: 1624887871987273 * W: 777 min_vruntime: 107772371139014 avg_vruntime: -1116773464285165183 avg_load: 777
One thing that stands out to me is how the vruntime of each of the entities on the CPU's cfs_rq are really large negative numbers.
vruntime = 18429030910682621789 equals 0xffc111f8d9ee675d
I do not know how those se->vruntime numbers got to that point, but they are a suggestive cause of the overflow.
I'll go comb through the se->vruntime updating code to see how those large numbers could end up as the vruntime for these sched entities.
As you can see from the output here, the large negative is the result of min_vruntime being significantly ahead of the actual entities.
This can happen due to that monotonicity filter the thing has -- it doesn't want to go backwards. Whereas the 0-lag point can move backwards, seeing how it is the weighted average, and inserting a task with positive lag will insert a task left of middle, moving the middle left.
The zero_vruntime patch I gave earlier should avoid this particular issue.
$ ./vruntime < B.txt k: 0 w: 50 k*w: 0 k: 0 w: 43 k*w: 0 v': 27468255210769 = v: 27468255210769 + d: 0 V': 0 = V: 0 - d: 0 * W: 93 min_vruntime: 27468255210769 avg_vruntime: 0 avg_load: 93
C, B/a and B/b are not really interesting, they're single entries where min_vruntime == vruntime and boring.
---8<---(root.txt)---8<--- entity 18445226958208703357 319394 entity 18445539757376619550 477855 group 107772371139014
---8<---(B.txt)---8<--- entity 27468255210769 51569 entity 27468255210769 44057 group 27468255210769
---8<---(vruntime.c)---8<---
#include <stdio.h> #include <stdlib.h> #include <stdbool.h> #include <string.h>
int main (int argc, char **argv) { unsigned long V, W; unsigned long V0; bool init = false;
for (;;) { unsigned long vruntime, weight; char type[32];
int r = scanf("%s\t%lu\t%lu\n", &type, &vruntime, &weight); if (r == EOF) break;
if (!strcmp(type, "entity")) { if (!init) { V = W = 0; V0 = vruntime; init = true; }
unsigned long k = vruntime - V0; unsigned long w = weight / 1024;
V += k * w; W += w;
printf(" k: %ld w: %lu k*w: %ld\n", k, w, k*w); }
if (!strcmp(type, "group")) { unsigned long d = vruntime - V0;
printf(" v': %lu = v: %lu + d: %lu\n", V0 + d, V0, d); printf(" V': %ld = V: %ld - d: %ld * W: %lu\n", V - d * W, V, d, W);
V0 += d; V -= d * W; } }
printf("min_vruntime: %lu\n", V0); printf("avg_vruntime: %ld\n", V); printf("avg_load: %lu\n", W);
return 0; }
On 4/14/25 5:08 AM, Peter Zijlstra wrote:
[ math and such ]
The zero_vruntime patch I gave earlier should avoid this particular issue.
Here's a crash with the zero runtime patch. I'm trying to reproduce this outside of prod so we can crank up the iteration speed a bit.
-chris
nr_running = 2 zero_vruntime = 19194347104893960 avg_vruntime = 6044054790 avg_load = 2 curr = { cgroup urgent vruntime = 24498183812106172 weight = 3561684 => 3478 rq = { nr_running = 0 zero_vruntime = 505000008215941 avg_vruntime = 0 avg_load = 0 curr = { cgroup urgent/-610604968056586240 vruntime = 505000008302509 weight = 455902 => 445 rq = { nr_running = 0 zero_vruntime = 12234709899 avg_vruntime = 0 avg_load = 0 curr = { task = 279047 (fc0) vruntime = 12237278090 weight = 15360 => 15 } tasks_timeline = [ ] } } tasks_timeline = [ ] } } tasks_timeline = [ { cgroup optional vruntime = 19194350126921355 weight = 1168 => 2 rq = { nr_running = 2 zero_vruntime = 440280059357029 avg_vruntime = 476 avg_load = 688 tasks_timeline = [ { cgroup optional/-610613050111295488 vruntime = 440280059333960 weight = 291271 => 284 rq = { nr_running = 5 zero_vruntime = 65179829005 avg_vruntime = 0 avg_load = 75 tasks_timeline = [ { task = 261672 (fc0) vruntime = 65189926507 weight = 15360 => 15 }, { task = 261332 (fc0) vruntime = 65189480962 weight = 15360 => 15 }, { task = 261329 (enc1:0:vp9_fbv) vruntime = 65165843516 weight = 15360 => 15 }, { task = 261334 (dec0:0:hevc_fbv) vruntime = 65174065035 weight = 15360 => 15 }, { task = 261868 (fc0) vruntime = 65179829005 weight = 15360 => 15 }, ] } }, { cgroup optional/-610609318858457088 vruntime = 440280059373247 weight = 413911 => 404 rq = { nr_running = 1 zero_vruntime = 22819875784 avg_vruntime = 0 avg_load = 15 tasks_timeline = [ { task = 273291 (fc0) vruntime = 22819875784 weight = 15360 => 15 }, ] } }, ] } }, ]
On Mon, Apr 14, 2025 at 11:38:15AM -0400, Chris Mason wrote:
On 4/14/25 5:08 AM, Peter Zijlstra wrote:
[ math and such ]
The zero_vruntime patch I gave earlier should avoid this particular issue.
Here's a crash with the zero runtime patch.
And indeed it doesn't have these massive (negative) avg_vruntime values.
I'm trying to reproduce this outside of prod so we can crank up the iteration speed a bit.
Thanks.
Could you add which pick went boom for the next dump?
I am however, slightly confused by this output format.
It looks like it dumps the cfs_rq the first time it encounters it, either through curr or through the tree.
So if I read this correct the root is something like:
nr_running = 2 zero_vruntime = 19194347104893960 avg_vruntime = 6044054790 avg_load = 2 curr = { cgroup urgent vruntime = 24498183812106172 weight = 3561684 => 3478 } tasks_timeline = [ { cgroup optional vruntime = 19194350126921355 weight = 1168 => 2 }, ]
group 19194347104893960 curr 24498183812106172 3561684 entity 19194350126921355 1168
But if I run those numbers, I get avg_load == 1, seeing how 1168/1024 = 1. But the thing says it should be 2.
Similarly, my avg_vruntime is exactly half of what it says.
avg_vruntime: 3022027395 avg_load: 1
(seeing how 19194350126921355-19194347104893960 = 3022027395)
Anyway, with curr being significantly to the right of that, the 0-lag point is well right of where optional sits. So this pick should be fine, and result in 'optional' getting selected (curr is no longer eligible).
All the urgent/* groups have nr_running == 0, so are not interesting, we'll not pick there.
NOTE: I'm inferring curr is on_rq, because nr_running == 2 and the tree only has 1 entity in it.
NOTE: if we ignore curr, then optional sits at exactly the 0-lag point, with either sets of numbers and so should be eligible.
This then leaves us the optional/* groups.
cgroup optional rq = { nr_running = 2 zero_vruntime = 440280059357029 avg_vruntime = 476 avg_load = 688 tasks_timeline = [ { cgroup optional/-610613050111295488 vruntime = 440280059333960 weight = 291271 => 284 }, { cgroup optional/-610609318858457088 vruntime = 440280059373247 weight = 413911 => 404 },
group 440280059357029 entity 440280059333960 291271 entity 440280059373247 413911
Which gives:
avg_vruntime: 476 avg_load: 688
And that matches.
Next we have:
cgroup optional/-610613050111295488 rq = { nr_running = 5 zero_vruntime = 65179829005 avg_vruntime = 0 avg_load = 75 tasks_timeline = [ { task = 261672 (fc0) vruntime = 65189926507 weight = 15360 => 15 }, { task = 261332 (fc0) vruntime = 65189480962 weight = 15360 => 15 }, { task = 261329 (enc1:0:vp9_fbv) vruntime = 65165843516 weight = 15360 => 15 }, { task = 261334 (dec0:0:hevc_fbv) vruntime = 65174065035 weight = 15360 => 15 }, { task = 261868 (fc0) vruntime = 65179829005 weight = 15360 => 15 }, ] }
avg_vruntime: 0 avg_load: 75
This again matches, leaving the bottom 3 tasks eligible.
And finally:
cgroup optional/-610609318858457088 rq = { nr_running = 1 zero_vruntime = 22819875784 avg_vruntime = 0 avg_load = 15 tasks_timeline = [ { task = 273291 (fc0) vruntime = 22819875784 weight = 15360 => 15 }, ] }
Rather boring indeed, but the numbers appear correct.
So I'm not immediately seeing where it would go boom, but seeing how the root group is the one with dodgy numbers, I would suspect that -- but I'm not immediately seeing how... :-(
linux-stable-mirror@lists.linaro.org