On Wed, May 07, 2025 at 10:43:05AM -0700, Omar Sandoval wrote:
On Tue, Apr 29, 2025 at 06:38:59PM +0200, Greg Kroah-Hartman wrote:
6.14-stable review patch. If anyone has any objections, please let me know.
From: Omar Sandoval osandov@fb.com
[ Upstream commit bbce3de72be56e4b5f68924b7da9630cc89aa1a8 ]
There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash.
The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case:
- In the if (entity_is_task(se)) else block at the beginning of dequeue_entities(), slice is set to cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
- The first for_each_sched_entity() loop dequeues the entity.
- If the entity was its parent's only child, then the next iteration tries to dequeue the parent.
- If the parent's dequeue needs to be delayed, then it breaks from the first for_each_sched_entity() loop _without updating slice_.
- The second for_each_sched_entity() loop sets the parent's ->slice to the saved slice, which is still U64_MAX.
This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was:
- In update_entity_lag(), se->slice is used to calculate limit, which ends up as a huge negative number.
- limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit is negative, vlag > limit, so se->vlag is set to the same huge negative number.
- In place_entity(), se->vlag is scaled, which overflows and results in another huge (positive or negative) number.
- The adjusted lag is subtracted from se->vruntime, which increases or decreases se->vruntime by a huge number.
- pick_eevdf() calls entity_eligible()/vruntime_eligible(), which incorrectly returns false because the vruntime is so far from the other vruntimes on the queue, causing the (vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
- Nothing appears to be eligible, so pick_eevdf() returns NULL.
- pick_next_entity() tries to dereference the return value of pick_eevdf() and crashes.
Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash).
Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq.
Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Omar Sandoval osandov@fb.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Ingo Molnar mingo@kernel.org Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.174557099... Signed-off-by: Sasha Levin sashal@kernel.org
kernel/sched/fair.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
Hi,
I believe this fix should go in 6.12, too.
Great, can you submit a version that applies to 6.12.y?
thanks,
greg k-h