Turns out hotplugging CPUs that are in exclusive cpusets can lead to the cpuset code feeding empty cpumasks to the sched domain rebuild machinery. This leads to the following splat:
[ 30.618174] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 30.623697] Modules linked in: [ 30.626731] CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477d62e #23 [ 30.635003] Hardware name: ARM Juno development board (r0) (DT) [ 30.640877] Workqueue: events cpuset_hotplug_workfn [ 30.645713] pstate: 60000005 (nZCv daif -PAN -UAO) [ 30.650464] pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969) [ 30.655126] lr : build_sched_domains (kernel/sched/topology.c:1966) [...] [ 30.742047] Call trace: [ 30.744474] build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969) [ 30.748793] partition_sched_domains_locked (kernel/sched/topology.c:2250) [ 30.753971] rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019) [ 30.758977] rebuild_sched_domains (kernel/cgroup/cpuset.c:1032) [ 30.763209] cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2)) [ 30.767613] process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274) [ 30.771586] worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416) [ 30.775217] kthread (kernel/kthread.c:255) [ 30.778418] ret_from_fork (arch/arm64/kernel/entry.S:1167) [ 30.781965] Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853)
The faulty line in question is
cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));
and we're not checking the return value against nr_cpu_ids (we shouldn't have to!), which leads to the above.
Prevent generate_sched_domains() from returning empty cpumasks, and add some assertion in build_sched_domains() to scream bloody murder if it happens again.
The above splat was obtained on my Juno r0 with:
cgcreate -g cpuset:asym cgset -r cpuset.cpus=0-3 asym cgset -r cpuset.mems=0 asym cgset -r cpuset.cpu_exclusive=1 asym
cgcreate -g cpuset:smp cgset -r cpuset.cpus=4-5 smp cgset -r cpuset.mems=0 smp cgset -r cpuset.cpu_exclusive=1 smp
cgset -r cpuset.sched_load_balance=0 .
echo 0 > /sys/devices/system/cpu/cpu4/online echo 0 > /sys/devices/system/cpu/cpu5/online
Cc: stable@vger.kernel.org Fixes: 05484e098448 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection") Signed-off-by: Valentin Schneider valentin.schneider@arm.com --- kernel/cgroup/cpuset.c | 3 ++- kernel/sched/topology.c | 5 ++++- 2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c52bc91f882b..c87ee6412b36 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -798,7 +798,8 @@ static int generate_sched_domains(cpumask_var_t **domains, cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus)) continue;
- if (is_sched_load_balance(cp)) + if (is_sched_load_balance(cp) && + !cpumask_empty(cp->effective_cpus)) csa[csn++] = cp;
/* skip @cp's subtree if not a partition root */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3623ffe85d18..2e7af755e17a 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1945,7 +1945,7 @@ static struct sched_domain_topology_level static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr) { - enum s_alloc alloc_state; + enum s_alloc alloc_state = sa_none; struct sched_domain *sd; struct s_data d; struct rq *rq = NULL; @@ -1953,6 +1953,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att struct sched_domain_topology_level *tl_asym; bool has_asym = false;
+ if (WARN_ON(cpumask_empty(cpu_map))) + goto error; + alloc_state = __visit_domain_allocation_hell(&d, cpu_map); if (alloc_state != sa_rootdomain) goto error;
On 23/10/2019 17:37, Valentin Schneider wrote:
Turns out hotplugging CPUs that are in exclusive cpusets can lead to the cpuset code feeding empty cpumasks to the sched domain rebuild machinery. This leads to the following splat:
[...]
The faulty line in question is
cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));
and we're not checking the return value against nr_cpu_ids (we shouldn't have to!), which leads to the above.
Prevent generate_sched_domains() from returning empty cpumasks, and add some assertion in build_sched_domains() to scream bloody murder if it happens again.
The above splat was obtained on my Juno r0 with:
cgcreate -g cpuset:asym cgset -r cpuset.cpus=0-3 asym cgset -r cpuset.mems=0 asym cgset -r cpuset.cpu_exclusive=1 asym
cgcreate -g cpuset:smp cgset -r cpuset.cpus=4-5 smp cgset -r cpuset.mems=0 smp cgset -r cpuset.cpu_exclusive=1 smp
cgset -r cpuset.sched_load_balance=0 .
echo 0 > /sys/devices/system/cpu/cpu4/online echo 0 > /sys/devices/system/cpu/cpu5/online
Cc: stable@vger.kernel.org Fixes: 05484e098448 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection")
Sorry for being picky but IMHO you should also mention that it fixes
f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
Tested it on a hikey620 (8 CPus SMP) with v5.4-rc4 and a local fix for asym_cpu_capacity_level(). 2 exclusive cpusets [0-3] and [4-7], hp'ing out [0-3] and then hp'ing in [0] again.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5a174ae6ecf3..8f83e8e3ea9a 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2203,8 +2203,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[], for (i = 0; i < ndoms_cur; i++) { for (j = 0; j < n && !new_topology; j++) { if (cpumask_equal(doms_cur[i], doms_new[j]) && - dattrs_equal(dattr_cur, i, dattr_new, j)) + dattrs_equal(dattr_cur, i, dattr_new, j)) { + struct root_domain *rd; + + /* + * This domain won't be destroyed and as such + * its dl_bw->total_bw needs to be cleared. It + * will be recomputed in function + * update_tasks_root_domain(). + */ + rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
We have an issue here if doms_cur[i] is empty.
+ dl_clear_root_domain(rd); goto match1;
There is yet another similar issue behind the first one (asym_cpu_capacity_level()).
342 static bool build_perf_domains(const struct cpumask *cpu_map) 343 { 344 int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map); 345 struct perf_domain *pd = NULL, *tmp; 346 int cpu = cpumask_first(cpu_map); <--- !!! 347 struct root_domain *rd = cpu_rq(cpu)->rd; <--- !!! 348 struct cpufreq_policy *policy; 349 struct cpufreq_governor *gov; ... 406 tmp = rd->pd; <--- !!!
Caught when running hikey620 (8 CPus SMP) with v5.4-rc4 and a local fix for asym_cpu_capacity_level() with CONFIG_ENERGY_MODEL=y.
There might be other places in build_sched_domains() suffering from the same issue. So I assume it's wise to not call it with an empty cpu_map and warn if done so.
[...]
On 24/10/2019 17:19, Dietmar Eggemann wrote:
Sorry for being picky but IMHO you should also mention that it fixes
f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
I can append the following to the changelog, although I'd like some feedback from the cgroup folks before doing a respin:
""" Note that commit
f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
introduced a similar issue. Since doms_new is assigned to doms_cur without any filtering, we can end up with an empty cpumask in the doms_cur array.
The next time we go through a rebuild, this will break on:
rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
If there wasn't enough already, this is yet another argument for *not* handing over empty cpumasks to the sched domain rebuild. """
I tagged the commit that introduces the static key with Fixes: because it was introduced earlier - I don't think it would make sense to have two "Fixes:" lines? In any case, it'll now be listed in the changelog.
On Wed, Oct 23, 2019 at 04:37:44PM +0100, Valentin Schneider valentin.schneider@arm.com wrote:
Prevent generate_sched_domains() from returning empty cpumasks, and add some assertion in build_sched_domains() to scream bloody murder if it happens again.
Good catch. It makes sense to prune the empty domains in generate_sched_domains already.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c52bc91f882b..c87ee6412b36 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -798,7 +798,8 @@ static int generate_sched_domains(cpumask_var_t **domains, cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus)) continue;
if (is_sched_load_balance(cp))
if (is_sched_load_balance(cp) &&
!cpumask_empty(cp->effective_cpus)) csa[csn++] = cp;
If I didn't overlook anything, cp->effective_cpus can contain CPUs exluded by housekeeping_cpumask(HK_FLAG_DOMAIN) later, i.e. possibly still returning domains with empty cpusets.
I'd suggest moving the emptiness check down into the loop where domain cpumasks are ultimately constructed.
Michal
Hi Michal,
On 31/10/2019 17:23, Michal Koutný wrote:
On Wed, Oct 23, 2019 at 04:37:44PM +0100, Valentin Schneider valentin.schneider@arm.com wrote:
Prevent generate_sched_domains() from returning empty cpumasks, and add some assertion in build_sched_domains() to scream bloody murder if it happens again.
Good catch. It makes sense to prune the empty domains in generate_sched_domains already.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c52bc91f882b..c87ee6412b36 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -798,7 +798,8 @@ static int generate_sched_domains(cpumask_var_t **domains, cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus)) continue;
if (is_sched_load_balance(cp))
if (is_sched_load_balance(cp) &&
!cpumask_empty(cp->effective_cpus)) csa[csn++] = cp;
If I didn't overlook anything, cp->effective_cpus can contain CPUs exluded by housekeeping_cpumask(HK_FLAG_DOMAIN) later, i.e. possibly still returning domains with empty cpusets.
I'd suggest moving the emptiness check down into the loop where domain cpumasks are ultimately constructed.
Ah, wasn't aware of this - thanks for having a look!
I think I need to have the check before the final cpumask gets built, because at this point the cpumask array is already built and it's handed off directly to the sched domain rebuild.
Do you reckon the following would work?
----8<---- diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c87ee6412b36..e4c10785dc7c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -798,8 +798,14 @@ static int generate_sched_domains(cpumask_var_t **domains, cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus)) continue;
+ /* + * Skip cpusets that would lead to an empty sched domain. + * That could be because effective_cpus is empty, or because + * it's only spanning CPUs outside the housekeeping mask. + */ if (is_sched_load_balance(cp) && - !cpumask_empty(cp->effective_cpus)) + cpumask_intersects(cp->effective_cpus, + housekeeping_cpumask(HK_FLAG_DOMAIN))) csa[csn++] = cp;
/* skip @cp's subtree if not a partition root */
On Thu, Oct 31, 2019 at 06:23:12PM +0100, Valentin Schneider valentin.schneider@arm.com wrote:
Do you reckon the following would work?
LGTM (i.e. cpuset will be skipped if no CPUs taking part in load balancing remain in it after hot(un)plug event).
Michal
linux-stable-mirror@lists.linaro.org