The patch below does not apply to the 6.6-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y git checkout FETCH_HEAD git cherry-pick -x 76f970ce51c80f625eb6ddbb24e9cb51b977b598 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2025032422-giblet-smelting-2a62@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76f970ce51c80f625eb6ddbb24e9cb51b977b598 Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann dietmar.eggemann@arm.com Date: Fri, 14 Mar 2025 16:13:45 +0100 Subject: [PATCH] Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
This reverts commit eff6c8ce8d4d7faef75f66614dd20bb50595d261.
Hazem reported a 30% drop in UnixBench spawn test with commit eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM (aarch64) (single level MC sched domain):
https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com
There is an early bail from sched_move_task() if p->sched_task_group is equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope' (Ubuntu '22.04.5 LTS').
So in:
do_exit()
sched_autogroup_exit_task()
sched_move_task()
if sched_get_task_group(p) == p->sched_task_group return
/* p is enqueued */ dequeue_task() \ sched_change_group() | task_change_group_fair() | detach_task_cfs_rq() | (1) set_task_rq() | attach_task_cfs_rq() | enqueue_task() /
(1) isn't called for p anymore.
Turns out that the regression is related to sgs->group_util in group_is_overloaded() and group_has_capacity(). If (1) isn't called for all the 'spawn' tasks then sgs->group_util is ~900 and sgs->group_capacity = 1024 (single CPU sched domain) and this leads to group_is_overloaded() returning true (2) and group_has_capacity() false (3) much more often compared to the case when (1) is called.
I.e. there are much more cases of 'group_is_overloaded' and 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which then returns much more often a CPU != smp_processor_id() (5).
This isn't good for these extremely short running tasks (FORK + EXIT) and also involves calling sched_balance_find_dst_group_cpu() unnecessary (single CPU sched domain).
Instead if (1) is called for 'p->flags & PF_EXITING' then the path (4),(6) is taken much more often.
select_task_rq_fair(..., wake_flags = WF_FORK)
cpu = smp_processor_id()
new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)
group = sched_balance_find_dst_group(..., cpu)
do {
update_sg_wakeup_stats()
sgs->group_type = group_classify()
if group_is_overloaded() (2) return group_overloaded
if !group_has_capacity() (3) return group_fully_busy
return group_has_spare (4)
} while group
if local_sgs.group_type > idlest_sgs.group_type return idlest (5)
case group_has_spare:
if local_sgs.idle_cpus >= idlest_sgs.idle_cpus return NULL (6)
Unixbench Tests './Run -c 4 spawn' on:
(a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4') and Ubuntu 22.04.5 LTS (aarch64).
Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.
w/o patch w/ patch 21005 27120
(b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and Ubuntu 22.04.5 LTS (x86_64).
Shell & test run in '/A'.
w/o patch w/ patch 67675 88806
CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal 0 or 1.
Reported-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com Signed-off-by: Dietmar Eggemann dietmar.eggemann@arm.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Ingo Molnar mingo@kernel.org Reviewed-by: Vincent Guittot vincent.guittot@linaro.org Tested-by: Hagar Hemdan hagarhem@amazon.com Cc: Linus Torvalds torvalds@linux-foundation.org Link: https://lore.kernel.org/r/20250314151345.275739-1-dietmar.eggemann@arm.com
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 67189907214d..042351c7afce 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9016,7 +9016,7 @@ void sched_release_group(struct task_group *tg) spin_unlock_irqrestore(&task_group_lock, flags); }
-static struct task_group *sched_get_task_group(struct task_struct *tsk) +static void sched_change_group(struct task_struct *tsk) { struct task_group *tg;
@@ -9028,13 +9028,7 @@ static struct task_group *sched_get_task_group(struct task_struct *tsk) tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), struct task_group, css); tg = autogroup_task_group(tsk, tg); - - return tg; -} - -static void sched_change_group(struct task_struct *tsk, struct task_group *group) -{ - tsk->sched_task_group = group; + tsk->sched_task_group = tg;
#ifdef CONFIG_FAIR_GROUP_SCHED if (tsk->sched_class->task_change_group) @@ -9055,20 +9049,11 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup) { int queued, running, queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; - struct task_group *group; struct rq *rq;
CLASS(task_rq_lock, rq_guard)(tsk); rq = rq_guard.rq;
- /* - * Esp. with SCHED_AUTOGROUP enabled it is possible to get superfluous - * group changes. - */ - group = sched_get_task_group(tsk); - if (group == tsk->sched_task_group) - return; - update_rq_clock(rq);
running = task_current_donor(rq, tsk); @@ -9079,7 +9064,7 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup) if (running) put_prev_task(rq, tsk);
- sched_change_group(tsk, group); + sched_change_group(tsk); if (!for_autogroup) scx_cgroup_move_task(tsk);