Hi,
When using KVM on systems that require iTLB multihit mitigation enabled[1], we're observing very high latency (70ms+) in KVM_CREATE_VM ioctl() in 6.1 kernel in comparison to older stable kernels such as 5.10. This is true even when using favordynmods mount option.
We debugged this down to the cpuset controller trying to acquire cpuset_rwsem in cpuset_can_attach(). This happens because KVM creates a worker thread which calls cgroup_attach_task_all() during KVM_CREATE_VM. I don't know if favordynmods is supposed to cover this case or not, but removing cpuset_rwsem certainly solves the issue.
For the backport I tried to pick as many dependent commits as required to avoid conflicts. I would highly appreciate review from cgroup people.
Tests performed: * Measured latency in KVM_CREATE_VM ioctl(), it goes down to less than 1ms * Ran the cgroup kselftest tests, got same results with or without this series * However, some tests such as test_memcontrol and test_kmem are failing in 6.1. This probably needs to be looked at * To make test_cpuset_prs.sh work, I had to increase the timeout on line 592 to 1 second. With this change, the test runs and passes * I run our downstream test suite against our downstream 6.1 kernel with this series applied, it passed
[1] For the case where the CPU is not vulnerable to iTLB multihit we can simply disable the iTLB multihit mitigation in KVM which avoids this whole situation. Disabling the mitigation is possible since upstream commit 0b210faf337 which I plan to backport soon
Daniel Vacek (1): cgroup/cpuset: no need to explicitly init a global static variable
Juri Lelli (1): sched/cpuset: Bring back cpuset_mutex
Waiman Long (3): cgroup/cpuset: Optimize cpuset_attach() on v2 cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated
include/linux/cpuset.h | 8 +- kernel/cgroup/cpuset.c | 211 +++++++++++++++++++++++------------------ kernel/sched/core.c | 22 +++-- 3 files changed, 139 insertions(+), 102 deletions(-)
From: Waiman Long longman@redhat.com
Commit 7fd4da9c1584be97ffbc40e600a19cb469fd4e78 upstream.
It was found that with the default hierarchy, enabling cpuset in the child cgroups can trigger a cpuset_attach() call in each of the child cgroups that have tasks with no change in effective cpus and mems. If there are many processes in those child cgroups, it will burn quite a lot of cpu cycles iterating all the tasks without doing useful work.
Optimizing this case by comparing between the old and new cpusets and skip useless update if there is no change in effective cpus and mems. Also mems_allowed are less likely to be changed than cpus_allowed. So skip changing mm if there is no change in effective_mems and CS_MEMORY_MIGRATE is not set.
By inserting some instrumentation code and running a simple command in a container 200 times in a cgroup v2 system, it was found that all the cpuset_attach() calls are skipped (401 times in total) as there was no change in effective cpus and mems.
Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Luiz Capitulino luizcap@amazon.com --- kernel/cgroup/cpuset.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e276db722845..0496b88fcd63 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2553,12 +2553,28 @@ static void cpuset_attach(struct cgroup_taskset *tset) struct cgroup_subsys_state *css; struct cpuset *cs; struct cpuset *oldcs = cpuset_attach_old_cs; + bool cpus_updated, mems_updated;
cgroup_taskset_first(tset, &css); cs = css_cs(css);
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */ percpu_down_write(&cpuset_rwsem); + cpus_updated = !cpumask_equal(cs->effective_cpus, + oldcs->effective_cpus); + mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems); + + /* + * In the default hierarchy, enabling cpuset in the child cgroups + * will trigger a number of cpuset_attach() calls with no change + * in effective cpus and mems. In that case, we can optimize out + * by skipping the task iteration and update. + */ + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && + !cpus_updated && !mems_updated) { + cpuset_attach_nodemask_to = cs->effective_mems; + goto out; + }
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
@@ -2567,9 +2583,14 @@ static void cpuset_attach(struct cgroup_taskset *tset)
/* * Change mm for all threadgroup leaders. This is expensive and may - * sleep and should be moved outside migration path proper. + * sleep and should be moved outside migration path proper. Skip it + * if there is no change in effective_mems and CS_MEMORY_MIGRATE is + * not set. */ cpuset_attach_nodemask_to = cs->effective_mems; + if (!is_memory_migrate(cs) && !mems_updated) + goto out; + cgroup_taskset_for_each_leader(leader, css, tset) { struct mm_struct *mm = get_task_mm(leader);
@@ -2592,6 +2613,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) } }
+out: cs->old_mems_allowed = cpuset_attach_nodemask_to;
cs->attach_in_progress--;
From: Daniel Vacek neelx@redhat.com
Commit 21786e5cb375a1e58a9175fee423e1d7f892d965 upstream.
cpuset_rwsem is a static variable defined with DEFINE_STATIC_PERCPU_RWSEM(). It's initialized at build time and so there's no need for explicit runtime init leaking one percpu int.
Signed-off-by: Daniel Vacek neelx@redhat.com Reviewed-by: Aaron Tomlin atomlin@atomlin.com Acked-by: Mukesh Ojha quic_mojha@quicinc.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Luiz Capitulino luizcap@amazon.com --- kernel/cgroup/cpuset.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 0496b88fcd63..4baae4b3c9a0 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3391,8 +3391,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
int __init cpuset_init(void) { - BUG_ON(percpu_init_rwsem(&cpuset_rwsem)); - BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
From: Waiman Long longman@redhat.com
Commit df59b72cd8fb8ca00301f47e65853efed195d23f upstream.
If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it.
Signed-off-by: Waiman Long longman@redhat.com Reviewed-by: Michal Koutný mkoutny@suse.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Luiz Capitulino luizcap@amazon.com --- kernel/cgroup/cpuset.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 4baae4b3c9a0..8664b9c1edc8 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3613,6 +3613,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) update_tasks: cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); mems_updated = !nodes_equal(new_mems, cs->effective_mems); + if (!cpus_updated && !mems_updated) + goto unlock; /* Hotplug doesn't affect this cpuset */
if (mems_updated) check_insane_mems_config(&new_mems); @@ -3624,6 +3626,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems, cpus_updated, mems_updated);
+unlock: percpu_up_write(&cpuset_rwsem); }
From: Waiman Long longman@redhat.com
Commit 6667439f51c446fead5d991ff49b842a811a6195 upstream.
Similar to commit 3fb906e7fabb ("group/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks"), the whole set of possible CPUs including offline ones should be used for setting cpumasks for tasks in the top cpuset when a cpuset partition is modified as the hotplug code won't update cpumasks for tasks in the top cpuset when CPUs become online or offline.
Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Luiz Capitulino luizcap@amazon.com --- kernel/cgroup/cpuset.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 8664b9c1edc8..91c4256a3da4 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1209,7 +1209,9 @@ void rebuild_sched_domains(void) * * Iterate through each task of @cs updating its cpus_allowed to the * effective cpuset's. As this function is called with cpuset_rwsem held, - * cpuset membership stays stable. + * cpuset membership stays stable. For top_cpuset, task_cpu_possible_mask() + * is used instead of effective_cpus to make sure all offline CPUs are also + * included as hotplug code won't update cpumasks for tasks in top_cpuset. */ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) { @@ -1219,15 +1221,18 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) { - /* - * Percpu kthreads in top_cpuset are ignored - */ - if (top_cs && (task->flags & PF_KTHREAD) && - kthread_is_per_cpu(task)) - continue; + const struct cpumask *possible_mask = task_cpu_possible_mask(task);
- cpumask_and(new_cpus, cs->effective_cpus, - task_cpu_possible_mask(task)); + if (top_cs) { + /* + * Percpu kthreads in top_cpuset are ignored + */ + if ((task->flags & PF_KTHREAD) && kthread_is_per_cpu(task)) + continue; + cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus); + } else { + cpumask_and(new_cpus, possible_mask, cs->effective_cpus); + } set_cpus_allowed_ptr(task, new_cpus); } css_task_iter_end(&it);
From: Juri Lelli juri.lelli@redhat.com
Commit 111cd11bbc54850f24191c52ff217da88a5e639b upstream.
Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea, as it has been reported to cause slowdowns in workloads that need to change cpuset configuration frequently and it is also not implementing priority inheritance (which causes troubles with realtime workloads).
Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it only for SCHED_DEADLINE tasks (other policies don't care about stable cpusets anyway).
Signed-off-by: Juri Lelli juri.lelli@redhat.com Reviewed-by: Waiman Long longman@redhat.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Luiz Capitulino luizcap@amazon.com --- include/linux/cpuset.h | 8 +-- kernel/cgroup/cpuset.c | 159 +++++++++++++++++++++-------------------- kernel/sched/core.c | 22 ++++-- 3 files changed, 99 insertions(+), 90 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d58e0476ee8e..355f796c5f07 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -71,8 +71,8 @@ extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); extern void cpuset_update_active_cpus(void); extern void cpuset_wait_for_hotplug(void); -extern void cpuset_read_lock(void); -extern void cpuset_read_unlock(void); +extern void cpuset_lock(void); +extern void cpuset_unlock(void); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); extern nodemask_t cpuset_mems_allowed(struct task_struct *p); @@ -196,8 +196,8 @@ static inline void cpuset_update_active_cpus(void)
static inline void cpuset_wait_for_hotplug(void) { }
-static inline void cpuset_read_lock(void) { } -static inline void cpuset_read_unlock(void) { } +static inline void cpuset_lock(void) { } +static inline void cpuset_unlock(void) { }
static inline void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 91c4256a3da4..58fa0a50bd11 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -366,22 +366,23 @@ static struct cpuset top_cpuset = { if (is_cpuset_online(((des_cs) = css_cs((pos_css)))))
/* - * There are two global locks guarding cpuset structures - cpuset_rwsem and + * There are two global locks guarding cpuset structures - cpuset_mutex and * callback_lock. We also require taking task_lock() when dereferencing a * task's cpuset pointer. See "The task_lock() exception", at the end of this - * comment. The cpuset code uses only cpuset_rwsem write lock. Other - * kernel subsystems can use cpuset_read_lock()/cpuset_read_unlock() to - * prevent change to cpuset structures. + * comment. The cpuset code uses only cpuset_mutex. Other kernel subsystems + * can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset + * structures. Note that cpuset_mutex needs to be a mutex as it is used in + * paths that rely on priority inheritance (e.g. scheduler - on RT) for + * correctness. * * A task must hold both locks to modify cpusets. If a task holds - * cpuset_rwsem, it blocks others wanting that rwsem, ensuring that it - * is the only task able to also acquire callback_lock and be able to - * modify cpusets. It can perform various checks on the cpuset structure - * first, knowing nothing will change. It can also allocate memory while - * just holding cpuset_rwsem. While it is performing these checks, various - * callback routines can briefly acquire callback_lock to query cpusets. - * Once it is ready to make the changes, it takes callback_lock, blocking - * everyone else. + * cpuset_mutex, it blocks others, ensuring that it is the only task able to + * also acquire callback_lock and be able to modify cpusets. It can perform + * various checks on the cpuset structure first, knowing nothing will change. + * It can also allocate memory while just holding cpuset_mutex. While it is + * performing these checks, various callback routines can briefly acquire + * callback_lock to query cpusets. Once it is ready to make the changes, it + * takes callback_lock, blocking everyone else. * * Calls to the kernel memory allocator can not be made while holding * callback_lock, as that would risk double tripping on callback_lock @@ -403,16 +404,16 @@ static struct cpuset top_cpuset = { * guidelines for accessing subsystem state in kernel/cgroup.c */
-DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem); +static DEFINE_MUTEX(cpuset_mutex);
-void cpuset_read_lock(void) +void cpuset_lock(void) { - percpu_down_read(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); }
-void cpuset_read_unlock(void) +void cpuset_unlock(void) { - percpu_up_read(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); }
static DEFINE_SPINLOCK(callback_lock); @@ -496,7 +497,7 @@ static inline bool partition_is_populated(struct cpuset *cs, * One way or another, we guarantee to return some non-empty subset * of cpu_online_mask. * - * Call with callback_lock or cpuset_rwsem held. + * Call with callback_lock or cpuset_mutex held. */ static void guarantee_online_cpus(struct task_struct *tsk, struct cpumask *pmask) @@ -538,7 +539,7 @@ static void guarantee_online_cpus(struct task_struct *tsk, * One way or another, we guarantee to return some non-empty subset * of node_states[N_MEMORY]. * - * Call with callback_lock or cpuset_rwsem held. + * Call with callback_lock or cpuset_mutex held. */ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) { @@ -550,7 +551,7 @@ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) /* * update task's spread flag if cpuset's page/slab spread flag is set * - * Call with callback_lock or cpuset_rwsem held. The check can be skipped + * Call with callback_lock or cpuset_mutex held. The check can be skipped * if on default hierarchy. */ static void cpuset_update_task_spread_flags(struct cpuset *cs, @@ -575,7 +576,7 @@ static void cpuset_update_task_spread_flags(struct cpuset *cs, * * One cpuset is a subset of another if all its allowed CPUs and * Memory Nodes are a subset of the other, and its exclusive flags - * are only set if the other's are set. Call holding cpuset_rwsem. + * are only set if the other's are set. Call holding cpuset_mutex. */
static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) @@ -713,7 +714,7 @@ static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial) * If we replaced the flag and mask values of the current cpuset * (cur) with those values in the trial cpuset (trial), would * our various subset and exclusive rules still be valid? Presumes - * cpuset_rwsem held. + * cpuset_mutex held. * * 'cur' is the address of an actual, in-use cpuset. Operations * such as list traversal that depend on the actual address of the @@ -829,7 +830,7 @@ static void update_domain_attr_tree(struct sched_domain_attr *dattr, rcu_read_unlock(); }
-/* Must be called with cpuset_rwsem held. */ +/* Must be called with cpuset_mutex held. */ static inline int nr_cpusets(void) { /* jump label reference count + the top-level cpuset */ @@ -855,7 +856,7 @@ static inline int nr_cpusets(void) * domains when operating in the severe memory shortage situations * that could cause allocation failures below. * - * Must be called with cpuset_rwsem held. + * Must be called with cpuset_mutex held. * * The three key local variables below are: * cp - cpuset pointer, used (together with pos_css) to perform a @@ -1084,7 +1085,7 @@ static void rebuild_root_domains(void) struct cpuset *cs = NULL; struct cgroup_subsys_state *pos_css;
- percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex); lockdep_assert_cpus_held(); lockdep_assert_held(&sched_domains_mutex);
@@ -1134,7 +1135,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[], * 'cpus' is removed, then call this routine to rebuild the * scheduler's dynamic sched domains. * - * Call with cpuset_rwsem held. Takes cpus_read_lock(). + * Call with cpuset_mutex held. Takes cpus_read_lock(). */ static void rebuild_sched_domains_locked(void) { @@ -1145,7 +1146,7 @@ static void rebuild_sched_domains_locked(void) int ndoms;
lockdep_assert_cpus_held(); - percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex);
/* * If we have raced with CPU hotplug, return early to avoid @@ -1196,9 +1197,9 @@ static void rebuild_sched_domains_locked(void) void rebuild_sched_domains(void) { cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); rebuild_sched_domains_locked(); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); }
@@ -1208,7 +1209,7 @@ void rebuild_sched_domains(void) * @new_cpus: the temp variable for the new effective_cpus mask * * Iterate through each task of @cs updating its cpus_allowed to the - * effective cpuset's. As this function is called with cpuset_rwsem held, + * effective cpuset's. As this function is called with cpuset_mutex held, * cpuset membership stays stable. For top_cpuset, task_cpu_possible_mask() * is used instead of effective_cpus to make sure all offline CPUs are also * included as hotplug code won't update cpumasks for tasks in top_cpuset. @@ -1322,7 +1323,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, int old_prs, new_prs; int part_error = PERR_NONE; /* Partition error? */
- percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex);
/* * The parent must be a partition root. @@ -1545,7 +1546,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, * * On legacy hierarchy, effective_cpus will be the same with cpu_allowed. * - * Called with cpuset_rwsem held + * Called with cpuset_mutex held */ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp, bool force) @@ -1705,7 +1706,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs, struct cpuset *sibling; struct cgroup_subsys_state *pos_css;
- percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex);
/* * Check all its siblings and call update_cpumasks_hier() @@ -1955,12 +1956,12 @@ static void *cpuset_being_rebound; * @cs: the cpuset in which each task's mems_allowed mask needs to be changed * * Iterate through each task of @cs updating its mems_allowed to the - * effective cpuset's. As this function is called with cpuset_rwsem held, + * effective cpuset's. As this function is called with cpuset_mutex held, * cpuset membership stays stable. */ static void update_tasks_nodemask(struct cpuset *cs) { - static nodemask_t newmems; /* protected by cpuset_rwsem */ + static nodemask_t newmems; /* protected by cpuset_mutex */ struct css_task_iter it; struct task_struct *task;
@@ -1973,7 +1974,7 @@ static void update_tasks_nodemask(struct cpuset *cs) * take while holding tasklist_lock. Forks can happen - the * mpol_dup() cpuset_being_rebound check will catch such forks, * and rebind their vma mempolicies too. Because we still hold - * the global cpuset_rwsem, we know that no other rebind effort + * the global cpuset_mutex, we know that no other rebind effort * will be contending for the global variable cpuset_being_rebound. * It's ok if we rebind the same mm twice; mpol_rebind_mm() * is idempotent. Also migrate pages in each mm to new nodes. @@ -2019,7 +2020,7 @@ static void update_tasks_nodemask(struct cpuset *cs) * * On legacy hierarchy, effective_mems will be the same with mems_allowed. * - * Called with cpuset_rwsem held + * Called with cpuset_mutex held */ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems) { @@ -2072,7 +2073,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems) * mempolicies and if the cpuset is marked 'memory_migrate', * migrate the tasks pages to the new memory. * - * Call with cpuset_rwsem held. May take callback_lock during call. + * Call with cpuset_mutex held. May take callback_lock during call. * Will take tasklist_lock, scan tasklist for tasks in cpuset cs, * lock each such tasks mm->mmap_lock, scan its vma's and rebind * their mempolicies to the cpusets new mems_allowed. @@ -2164,7 +2165,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val) * @cs: the cpuset in which each task's spread flags needs to be changed * * Iterate through each task of @cs updating its spread flags. As this - * function is called with cpuset_rwsem held, cpuset membership stays + * function is called with cpuset_mutex held, cpuset membership stays * stable. */ static void update_tasks_flags(struct cpuset *cs) @@ -2184,7 +2185,7 @@ static void update_tasks_flags(struct cpuset *cs) * cs: the cpuset to update * turning_on: whether the flag is being set or cleared * - * Call with cpuset_rwsem held. + * Call with cpuset_mutex held. */
static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, @@ -2234,7 +2235,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, * @new_prs: new partition root state * Return: 0 if successful, != 0 if error * - * Call with cpuset_rwsem held. + * Call with cpuset_mutex held. */ static int update_prstate(struct cpuset *cs, int new_prs) { @@ -2472,7 +2473,7 @@ static int cpuset_can_attach_check(struct cpuset *cs) return 0; }
-/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */ +/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */ static int cpuset_can_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; @@ -2484,7 +2485,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css)); cs = css_cs(css);
- percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex);
/* Check to see if task is allowed in the cpuset */ ret = cpuset_can_attach_check(cs); @@ -2506,7 +2507,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) */ cs->attach_in_progress++; out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); return ret; }
@@ -2518,15 +2519,15 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset) cgroup_taskset_first(tset, &css); cs = css_cs(css);
- percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); cs->attach_in_progress--; if (!cs->attach_in_progress) wake_up(&cpuset_attach_wq); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); }
/* - * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach_task() + * Protected by cpuset_mutex. cpus_attach is used only by cpuset_attach_task() * but we can't allocate it dynamically there. Define it global and * allocate from cpuset_init(). */ @@ -2535,7 +2536,7 @@ static nodemask_t cpuset_attach_nodemask_to;
static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task) { - percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex);
if (cs != &top_cpuset) guarantee_online_cpus(task, cpus_attach); @@ -2564,7 +2565,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) cs = css_cs(css);
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */ - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus); mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems); @@ -2625,7 +2626,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) if (!cs->attach_in_progress) wake_up(&cpuset_attach_wq);
- percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); }
/* The various types of files and directories in a cpuset file system */ @@ -2657,7 +2658,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, int retval = 0;
cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) { retval = -ENODEV; goto out_unlock; @@ -2693,7 +2694,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, break; } out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); return retval; } @@ -2706,7 +2707,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, int retval = -ENODEV;
cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) goto out_unlock;
@@ -2719,7 +2720,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, break; } out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); return retval; } @@ -2752,7 +2753,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, * operation like this one can lead to a deadlock through kernfs * active_ref protection. Let's break the protection. Losing the * protection is okay as we check whether @cs is online after - * grabbing cpuset_rwsem anyway. This only happens on the legacy + * grabbing cpuset_mutex anyway. This only happens on the legacy * hierarchies. */ css_get(&cs->css); @@ -2760,7 +2761,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, flush_work(&cpuset_hotplug_work);
cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) goto out_unlock;
@@ -2784,7 +2785,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
free_cpuset(trialcs); out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); kernfs_unbreak_active_protection(of->kn); css_put(&cs->css); @@ -2932,13 +2933,13 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
css_get(&cs->css); cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) goto out_unlock;
retval = update_prstate(cs, val); out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); css_put(&cs->css); return retval ?: nbytes; @@ -3151,7 +3152,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) return 0;
cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex);
set_bit(CS_ONLINE, &cs->flags); if (is_spread_page(parent)) @@ -3202,7 +3203,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) cpumask_copy(cs->effective_cpus, parent->cpus_allowed); spin_unlock_irq(&callback_lock); out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); return 0; } @@ -3223,7 +3224,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) struct cpuset *cs = css_cs(css);
cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex);
if (is_partition_valid(cs)) update_prstate(cs, 0); @@ -3242,7 +3243,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) cpuset_dec(); clear_bit(CS_ONLINE, &cs->flags);
- percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); }
@@ -3255,7 +3256,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
static void cpuset_bind(struct cgroup_subsys_state *root_css) { - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); spin_lock_irq(&callback_lock);
if (is_in_v2_mode()) { @@ -3268,7 +3269,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css) }
spin_unlock_irq(&callback_lock); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); }
/* @@ -3289,7 +3290,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset) return 0;
lockdep_assert_held(&cgroup_mutex); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex);
/* Check to see if task is allowed in the cpuset */ ret = cpuset_can_attach_check(cs); @@ -3310,7 +3311,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset) */ cs->attach_in_progress++; out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); return ret; }
@@ -3326,11 +3327,11 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset) if (same_cs) return;
- percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); cs->attach_in_progress--; if (!cs->attach_in_progress) wake_up(&cpuset_attach_wq); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); }
/* @@ -3358,7 +3359,7 @@ static void cpuset_fork(struct task_struct *task) }
/* CLONE_INTO_CGROUP */ - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); guarantee_online_mems(cs, &cpuset_attach_nodemask_to); cpuset_attach_task(cs, task);
@@ -3366,7 +3367,7 @@ static void cpuset_fork(struct task_struct *task) if (!cs->attach_in_progress) wake_up(&cpuset_attach_wq);
- percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); }
struct cgroup_subsys cpuset_cgrp_subsys = { @@ -3467,7 +3468,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs, is_empty = cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed);
- percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex);
/* * Move tasks to the nearest ancestor with execution resources, @@ -3477,7 +3478,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs, if (is_empty) remove_tasks_in_empty_cpuset(cs);
- percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); }
static void @@ -3528,14 +3529,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) retry: wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
- percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex);
/* * We have raced with task attaching. We wait until attaching * is finished, so we won't attach a task to an empty cpuset. */ if (cs->attach_in_progress) { - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); goto retry; }
@@ -3632,7 +3633,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) cpus_updated, mems_updated);
unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); }
/** @@ -3662,7 +3663,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) if (on_dfl && !alloc_cpumasks(NULL, &tmp)) ptmp = &tmp;
- percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex);
/* fetch the available cpus/mems and find out which changed how */ cpumask_copy(&new_cpus, cpu_active_mask); @@ -3719,7 +3720,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) update_tasks_nodemask(&top_cpuset); }
- percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex);
/* if cpus or mems changed, we need to propagate to descendants */ if (cpus_updated || mems_updated) { @@ -4129,7 +4130,7 @@ void __cpuset_memory_pressure_bump(void) * - Used for /proc/<pid>/cpuset. * - No need to task_lock(tsk) on this tsk->cpuset reference, as it * doesn't really matter if tsk->cpuset changes after we read it, - * and we take cpuset_rwsem, keeping cpuset_attach() from changing it + * and we take cpuset_mutex, keeping cpuset_attach() from changing it * anyway. */ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b23dcbeacdf3..8a8dbb2dac03 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7475,6 +7475,7 @@ static int __sched_setscheduler(struct task_struct *p, int reset_on_fork; int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; struct rq *rq; + bool cpuset_locked = false;
/* The pi code expects interrupts enabled */ BUG_ON(pi && in_interrupt()); @@ -7524,8 +7525,14 @@ static int __sched_setscheduler(struct task_struct *p, return retval; }
- if (pi) - cpuset_read_lock(); + /* + * SCHED_DEADLINE bandwidth accounting relies on stable cpusets + * information. + */ + if (dl_policy(policy) || dl_policy(p->policy)) { + cpuset_locked = true; + cpuset_lock(); + }
/* * Make sure no PI-waiters arrive (or leave) while we are @@ -7601,8 +7608,8 @@ static int __sched_setscheduler(struct task_struct *p, if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) { policy = oldpolicy = -1; task_rq_unlock(rq, p, &rf); - if (pi) - cpuset_read_unlock(); + if (cpuset_locked) + cpuset_unlock(); goto recheck; }
@@ -7669,7 +7676,8 @@ static int __sched_setscheduler(struct task_struct *p, task_rq_unlock(rq, p, &rf);
if (pi) { - cpuset_read_unlock(); + if (cpuset_locked) + cpuset_unlock(); rt_mutex_adjust_pi(p); }
@@ -7681,8 +7689,8 @@ static int __sched_setscheduler(struct task_struct *p,
unlock: task_rq_unlock(rq, p, &rf); - if (pi) - cpuset_read_unlock(); + if (cpuset_locked) + cpuset_unlock(); return retval; }
On Thu, Aug 31, 2023 at 06:13:01PM +0000, Luiz Capitulino wrote:
Hi,
When using KVM on systems that require iTLB multihit mitigation enabled[1], we're observing very high latency (70ms+) in KVM_CREATE_VM ioctl() in 6.1 kernel in comparison to older stable kernels such as 5.10. This is true even when using favordynmods mount option.
We debugged this down to the cpuset controller trying to acquire cpuset_rwsem in cpuset_can_attach(). This happens because KVM creates a worker thread which calls cgroup_attach_task_all() during KVM_CREATE_VM. I don't know if favordynmods is supposed to cover this case or not, but removing cpuset_rwsem certainly solves the issue.
For the backport I tried to pick as many dependent commits as required to avoid conflicts. I would highly appreciate review from cgroup people.
Tests performed:
- Measured latency in KVM_CREATE_VM ioctl(), it goes down to less than 1ms
- Ran the cgroup kselftest tests, got same results with or without this series
- However, some tests such as test_memcontrol and test_kmem are failing in 6.1. This probably needs to be looked at
- To make test_cpuset_prs.sh work, I had to increase the timeout on line 592 to 1 second. With this change, the test runs and passes
- I run our downstream test suite against our downstream 6.1 kernel with this series applied, it passed
[1] For the case where the CPU is not vulnerable to iTLB multihit we can simply disable the iTLB multihit mitigation in KVM which avoids this whole situation. Disabling the mitigation is possible since upstream commit 0b210faf337 which I plan to backport soon
Please try 6.1.50, I think you will find that this issue is resolved there, right?
if not, please rebase your series on top of that, as obviously, it does not still apply anymore.
thanks,
greg k-h
On 2023-08-31 14:22, Greg KH wrote:
On Thu, Aug 31, 2023 at 06:13:01PM +0000, Luiz Capitulino wrote:
Hi,
When using KVM on systems that require iTLB multihit mitigation enabled[1], we're observing very high latency (70ms+) in KVM_CREATE_VM ioctl() in 6.1 kernel in comparison to older stable kernels such as 5.10. This is true even when using favordynmods mount option.
We debugged this down to the cpuset controller trying to acquire cpuset_rwsem in cpuset_can_attach(). This happens because KVM creates a worker thread which calls cgroup_attach_task_all() during KVM_CREATE_VM. I don't know if favordynmods is supposed to cover this case or not, but removing cpuset_rwsem certainly solves the issue.
For the backport I tried to pick as many dependent commits as required to avoid conflicts. I would highly appreciate review from cgroup people.
Tests performed:
- Measured latency in KVM_CREATE_VM ioctl(), it goes down to less than 1ms
- Ran the cgroup kselftest tests, got same results with or without this series
- However, some tests such as test_memcontrol and test_kmem are failing in 6.1. This probably needs to be looked at
- To make test_cpuset_prs.sh work, I had to increase the timeout on line 592 to 1 second. With this change, the test runs and passes
- I run our downstream test suite against our downstream 6.1 kernel with this series applied, it passed
[1] For the case where the CPU is not vulnerable to iTLB multihit we can simply disable the iTLB multihit mitigation in KVM which avoids this whole situation. Disabling the mitigation is possible since upstream commit 0b210faf337 which I plan to backport soon
Please try 6.1.50, I think you will find that this issue is resolved there, right?
It should, since the most important is dropping cpuset_rwsem from cpuset. I'll double check to be sure.
Thank you very much for the quick reply. I queued this series before my vacation and didn't check back before sending it today.
- Luiz
if not, please rebase your series on top of that, as obviously, it does not still apply anymore.
thanks,
greg k-h
On 2023-08-31 14:31, Luiz Capitulino wrote:
On 2023-08-31 14:22, Greg KH wrote:
On Thu, Aug 31, 2023 at 06:13:01PM +0000, Luiz Capitulino wrote:
Hi,
When using KVM on systems that require iTLB multihit mitigation enabled[1], we're observing very high latency (70ms+) in KVM_CREATE_VM ioctl() in 6.1 kernel in comparison to older stable kernels such as 5.10. This is true even when using favordynmods mount option.
We debugged this down to the cpuset controller trying to acquire cpuset_rwsem in cpuset_can_attach(). This happens because KVM creates a worker thread which calls cgroup_attach_task_all() during KVM_CREATE_VM. I don't know if favordynmods is supposed to cover this case or not, but removing cpuset_rwsem certainly solves the issue.
For the backport I tried to pick as many dependent commits as required to avoid conflicts. I would highly appreciate review from cgroup people.
Tests performed: * Measured latency in KVM_CREATE_VM ioctl(), it goes down to less than 1ms * Ran the cgroup kselftest tests, got same results with or without this series * However, some tests such as test_memcontrol and test_kmem are failing in 6.1. This probably needs to be looked at * To make test_cpuset_prs.sh work, I had to increase the timeout on line 592 to 1 second. With this change, the test runs and passes * I run our downstream test suite against our downstream 6.1 kernel with this series applied, it passed
[1] For the case where the CPU is not vulnerable to iTLB multihit we can simply disable the iTLB multihit mitigation in KVM which avoids this whole situation. Disabling the mitigation is possible since upstream commit 0b210faf337 which I plan to backport soon
Please try 6.1.50, I think you will find that this issue is resolved there, right?
It should, since the most important is dropping cpuset_rwsem from cpuset. I'll double check to be sure.
Quick test passed. We'll be doing more testing in the next days and I'll report if there are any issues.
- Luiz
Thank you very much for the quick reply. I queued this series before my vacation and didn't check back before sending it today.
- Luiz
if not, please rebase your series on top of that, as obviously, it does not still apply anymore.
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org