The "nohz_full" and "rcu_nocbs" boot command parameters can be used to remove a lot of kernel overhead on a specific set of isolated CPUs which can be used to run some latency/bandwidth sensitive workloads with as little kernel disturbance/noise as possible. The problem with this mode of operation is the fact that it is a static configuration which cannot be changed after boot to adjust for changes in application loading.
There is always a desire to enable runtime modification of the number of isolated CPUs that can be dedicated to this type of demanding workloads. This patchset is an attempt to do just that with an amount of CPU isolation close to what can be done with the nohz_full and rcu_nocbs boot kernel parameters.
This patch series provides the ability to change the set of housekeeping CPUs at run time via the cpuset isolated partition functionality. Currently, the cpuset isolated partition is able to disable scheduler load balancing and the CPU affinity of the unbound workqueue to avoid the isolated CPUs. This patch series will extend that with other kernel noises associated with the nohz_full boot command line parameter which has the following sub-categories: - tick - timer - RCU - MISC - WQ - kthread
The rcu_nocbs is actually a subset of nohz_full focusing just on the RCU part of the kernel noises. The WQ part has already been handled by the current cpuset code.
This series focuses on the tick and RCU part of the kernel noises by actively changing their internal data structures to track changes in the list of isolated CPUs used by cpuset isolated partitions.
The dynamic update of the lists of housekeeping CPUs at run time will also have impact on the other part of the kernel noises that reference the lists of housekeeping CPUs at run time.
The pending patch series on timer migration[1], when properly integrated will support the timer part too.
The CPU hotplug functionality of the Linux kernel is used to facilitate the runtime change of the nohz_full isolated CPUs with minimal code changes. The CPUs that need to be switched from non-isolated to isolated or vice versa will be brought offline first, making the necessary changes and then brought back online afterward.
The use of CPU hotplug, however, does have a slight drawback of freezing all the other CPUs in part of the offlining process using the stop machine feature of the kernel. That will cause a noticeable latency spikes in other running applications which may be significant to sensitive applications running on isolated CPUs in other isolated partitions at the time. Hopefully we can find a way to solve this problem in the future.
One possible workaround for this is to reserve a set of nohz_full isolated CPUs at boot time using the nohz_full boot command parameter. The bringing of those nohz_full reserved CPUs into and out of isolated partitions will not invoke CPU hotplug and hence will not cause unexpected latency spikes. These reserved CPUs will only be needed if there are other existing isolated partitions running critical applications at the time when an isolated partition needs to be created.
Patches 1-4 updates the CPU isolation code at kernel/sched/isolation.c to enable dynamic update of the lists of housekeeping CPUs.
Patch 5 introduces a new cpuhp_offline_cb() API for shutting down the given set of CPUs, running the given callback method and then bringing those CPUs back online again. This new API will block any incoming hotplug events from interfering this operation.
Patches 6-9 updates the cpuset partition code to use the new cpuhp API to shut down the affect CPUs, making changes to the housekeeping cpumasks and then bring those CPUs online afterward.
Patch 10 works around an issue in the DL server code that block the hotplug operation under certain configurations.
Patch 11-14 updates the timer tick and related code to enable proper updates to the set of CPUs requiring nohz_full dynticks support.
Patch 15 enables runtime modification to the set of isolated CPUs requiring RCU NO-CB CPU support with minor changes to the RCU code.
Patches 16-18 includes other miscellaneous updates to cpuset code and documentation.
This patch series is applied on top of some other cpuset patches[1] posted upstream recently.
[1] https://lore.kernel.org/lkml/20250806093855.86469-1-gmonaco@redhat.com/ [2] https://lore.kernel.org/lkml/20250806172430.1155133-1-longman@redhat.com/
Waiman Long (18): sched/isolation: Enable runtime update of housekeeping cpumasks sched/isolation: Call sched_tick_offload_init() when HK_FLAG_KERNEL_NOISE is first set sched/isolation: Use RCU to delay successive housekeeping cpumask updates sched/isolation: Add a debugfs file to dump housekeeping cpumasks cpu/hotplug: Add a new cpuhp_offline_cb() API cgroup/cpuset: Introduce a new top level isolcpus_update_mutex cgroup/cpuset: Allow overwriting HK_TYPE_DOMAIN housekeeping cpumask cgroup/cpuset: Use CPU hotplug to enable runtime nohz_full modification cgroup/cpuset: Revert "Include isolated cpuset CPUs in cpu_is_isolated() check" sched/core: Ignore DL BW deactivation error if in cpuhp_offline_cb_mode tick/nohz: Make nohz_full parameter optional tick/nohz: Introduce tick_nohz_full_update_cpus() to update tick_nohz_full_mask tick/nohz: Allow runtime changes in full dynticks CPUs tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() cgroup/cpuset: Enable RCU NO-CB CPU offloading of newly isolated CPUs cgroup/cpuset: Don't set have_boot_nohz_full without any boot time nohz_full CPU cgroup/cpuset: Documentation updates & don't use CPU 0 for isolated partition cgroup/cpuset: Add pr_debug() statements for cpuhp_offline_cb() call
Documentation/admin-guide/cgroup-v2.rst | 33 +- .../admin-guide/kernel-parameters.txt | 19 +- include/linux/context_tracking.h | 8 +- include/linux/cpuhplock.h | 9 + include/linux/cpuset.h | 6 - include/linux/rcupdate.h | 2 + include/linux/sched/isolation.h | 9 +- include/linux/tick.h | 2 + kernel/cgroup/cpuset.c | 344 ++++++++++++------ kernel/context_tracking.c | 21 +- kernel/cpu.c | 47 +++ kernel/rcu/tree_nocb.h | 7 +- kernel/sched/core.c | 8 +- kernel/sched/debug.c | 32 ++ kernel/sched/isolation.c | 151 +++++++- kernel/sched/sched.h | 2 +- kernel/time/tick-common.c | 15 +- kernel/time/tick-sched.c | 24 +- .../selftests/cgroup/test_cpuset_prs.sh | 15 +- 19 files changed, 583 insertions(+), 171 deletions(-)
The housekeeping CPU masks, set up by the "isolcpus" and "nohz_full" boot command line options, are used at boot time to exclude selected CPUs from running some kernel background processes to minimize disturbance to latency sensitive userspace applications. Some of housekeeping CPU masks are also checked at run time to avoid using those isolated CPUs.
The cpuset subsystem is now able to dynamically create a set of isolated CPUs to be used in isolated cpuset partitions. The long term goal is to make the degree of isolation as close as possible to what can be done statically using those boot command line options.
This patch is a step in that direction by providing a new housekeep_exclude_cpumask() API to exclude only the given cpumask from the housekeeping cpumasks. Existing boot time "isolcpus" and "nohz_full" cpumask setup, if present, can be overwritten.
Two set of cpumasks are now kept internally. One set are used by the callers while the other set are being updated before the new set are atomically switched on.
Signed-off-by: Waiman Long longman@redhat.com --- include/linux/sched/isolation.h | 6 +++ kernel/sched/isolation.c | 95 +++++++++++++++++++++++++++++---- 2 files changed, 91 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h index d8501f4709b5..af38d21d0d00 100644 --- a/include/linux/sched/isolation.h +++ b/include/linux/sched/isolation.h @@ -32,6 +32,7 @@ extern bool housekeeping_enabled(enum hk_type type); extern void housekeeping_affine(struct task_struct *t, enum hk_type type); extern bool housekeeping_test_cpu(int cpu, enum hk_type type); extern void __init housekeeping_init(void); +extern int housekeeping_exclude_cpumask(struct cpumask *cpumask, unsigned long flags);
#else
@@ -59,6 +60,11 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type) }
static inline void housekeeping_init(void) { } + +static inline housekeeping_exclude_cpumask(struct cpumask *cpumask, unsigned long flags) +{ + return -EOPNOTSUPP; +} #endif /* CONFIG_CPU_ISOLATION */
static inline bool housekeeping_cpu(int cpu, enum hk_type type) diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index a4cf17b1fab0..3fb0e8ccce26 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -19,8 +19,16 @@ enum hk_flags { DEFINE_STATIC_KEY_FALSE(housekeeping_overridden); EXPORT_SYMBOL_GPL(housekeeping_overridden);
+/* + * The housekeeping cpumasks can now be dynamically updated at run time. + * Two set of cpumasks are kept. One set can be used while the other set are + * being updated concurrently. + */ +static DEFINE_RAW_SPINLOCK(cpumask_lock); struct housekeeping { - cpumask_var_t cpumasks[HK_TYPE_MAX]; + struct cpumask *cpumask_ptrs[HK_TYPE_MAX]; + cpumask_var_t cpumasks[HK_TYPE_MAX][2]; + unsigned int seq_nrs[HK_TYPE_MAX]; unsigned long flags; };
@@ -38,11 +46,13 @@ int housekeeping_any_cpu(enum hk_type type)
if (static_branch_unlikely(&housekeeping_overridden)) { if (housekeeping.flags & BIT(type)) { - cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id()); + struct cpumask *cpumask = READ_ONCE(housekeeping.cpumask_ptrs[type]); + + cpu = sched_numa_find_closest(cpumask, smp_processor_id()); if (cpu < nr_cpu_ids) return cpu;
- cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask); + cpu = cpumask_any_and_distribute(cpumask, cpu_online_mask); if (likely(cpu < nr_cpu_ids)) return cpu; /* @@ -62,7 +72,7 @@ const struct cpumask *housekeeping_cpumask(enum hk_type type) { if (static_branch_unlikely(&housekeeping_overridden)) if (housekeeping.flags & BIT(type)) - return housekeeping.cpumasks[type]; + return READ_ONCE(housekeeping.cpumask_ptrs[type]); return cpu_possible_mask; } EXPORT_SYMBOL_GPL(housekeeping_cpumask); @@ -71,7 +81,7 @@ void housekeeping_affine(struct task_struct *t, enum hk_type type) { if (static_branch_unlikely(&housekeeping_overridden)) if (housekeeping.flags & BIT(type)) - set_cpus_allowed_ptr(t, housekeeping.cpumasks[type]); + set_cpus_allowed_ptr(t, READ_ONCE(housekeeping.cpumask_ptrs[type])); } EXPORT_SYMBOL_GPL(housekeeping_affine);
@@ -79,7 +89,7 @@ bool housekeeping_test_cpu(int cpu, enum hk_type type) { if (static_branch_unlikely(&housekeeping_overridden)) if (housekeeping.flags & BIT(type)) - return cpumask_test_cpu(cpu, housekeeping.cpumasks[type]); + return cpumask_test_cpu(cpu, READ_ONCE(housekeeping.cpumask_ptrs[type])); return true; } EXPORT_SYMBOL_GPL(housekeeping_test_cpu); @@ -98,7 +108,7 @@ void __init housekeeping_init(void)
for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) { /* We need at least one CPU to handle housekeeping work */ - WARN_ON_ONCE(cpumask_empty(housekeeping.cpumasks[type])); + WARN_ON_ONCE(cpumask_empty(housekeeping.cpumask_ptrs[type])); } }
@@ -106,8 +116,10 @@ static void __init housekeeping_setup_type(enum hk_type type, cpumask_var_t housekeeping_staging) {
- alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type]); - cpumask_copy(housekeeping.cpumasks[type], + alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type][0]); + alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type][1]); + housekeeping.cpumask_ptrs[type] = housekeeping.cpumasks[type][0]; + cpumask_copy(housekeeping.cpumask_ptrs[type], housekeeping_staging); }
@@ -161,7 +173,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
for_each_set_bit(type, &iter_flags, HK_TYPE_MAX) { if (!cpumask_equal(housekeeping_staging, - housekeeping.cpumasks[type])) { + housekeeping.cpumask_ptrs[type])) { pr_warn("Housekeeping: nohz_full= must match isolcpus=\n"); goto free_housekeeping_staging; } @@ -251,3 +263,66 @@ static int __init housekeeping_isolcpus_setup(char *str) return housekeeping_setup(str, flags); } __setup("isolcpus=", housekeeping_isolcpus_setup); + +/** + * housekeeping_exclude_cpumask - Update housekeeping cpumasks to exclude only the given cpumask + * @cpumask: new cpumask to be excluded from housekeeping cpumasks + * @hk_flags: bit mask of housekeeping types to be excluded + * Return: 0 if successful, error code if an error happens. + * + * Exclude the given cpumask from the housekeeping cpumasks associated with + * the given hk_flags. If the given cpumask is NULL, no CPU will need to be + * excluded. + */ +int housekeeping_exclude_cpumask(struct cpumask *cpumask, unsigned long hk_flags) +{ + unsigned long type; + +#ifdef CONFIG_CPUMASK_OFFSTACK + /* + * Pre-allocate cpumasks, if needed + */ + for_each_set_bit(type, &hk_flags, HK_TYPE_MAX) { + cpumask_var_t mask0, mask1; + + if (housekeeping.cpumask_ptrs[type]) + continue; + if (!zalloc_cpumask_var(&mask0, GFP_KERNEL) || + !zalloc_cpumask_var(&mask1, GFP_KERNEL)) + return -ENOMEM; + + /* + * cpumasks[type][] should be NULL, still do a swap & free + * dance just in case the cpumasks are allocated but + * cpumask_ptrs not setup somehow. + */ + mask0 = xchg(&housekeeping.cpumasks[type][0], mask0); + mask1 = xchg(&housekeeping.cpumasks[type][1], mask1); + free_cpumask_var(mask0); + free_cpumask_var(mask1); + } +#endif + + raw_spin_lock(&cpumask_lock); + + for_each_set_bit(type, &hk_flags, HK_TYPE_MAX) { + int idx = ++housekeeping.seq_nrs[type] & 1; + struct cpumask *dst_cpumask = housekeeping.cpumasks[type][idx]; + + if (!cpumask) { + cpumask_copy(dst_cpumask, cpu_possible_mask); + housekeeping.flags &= ~BIT(type); + } else { + cpumask_andnot(dst_cpumask, cpu_possible_mask, cpumask); + housekeeping.flags |= BIT(type); + } + WRITE_ONCE(housekeeping.cpumask_ptrs[type], dst_cpumask); + } + raw_spin_unlock(&cpumask_lock); + + if (!housekeeping.flags && static_key_enabled(&housekeeping_overridden)) + static_key_disable(&housekeeping_overridden.key); + else if (housekeeping.flags && !static_key_enabled(&housekeeping_overridden)) + static_key_enable(&housekeeping_overridden.key); + return 0; +}
The sched_tick_offload_init() function is called at boot time whenever "nohz_full" is set. Now housekeeping cpumasks can be updated at run time without the corresponding "nohz_full" kernel parameter. So we have to be able to call sched_tick_offload_init() at run time to allow tick offloading. Remove the __init attribute from sched_tick_offload_init() and call it when the HK_FLAG_KERNEL_NOISE flag is first set.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/sched/core.c | 2 +- kernel/sched/isolation.c | 10 +++++++++- kernel/sched/sched.h | 2 +- 3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index be00629f0ba4..9f02c047e25b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5783,7 +5783,7 @@ static void sched_tick_stop(int cpu) } #endif /* CONFIG_HOTPLUG_CPU */
-int __init sched_tick_offload_init(void) +int sched_tick_offload_init(void) { tick_work_cpu = alloc_percpu(struct tick_work); BUG_ON(!tick_work_cpu); diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index 3fb0e8ccce26..ee396ae13719 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -33,6 +33,7 @@ struct housekeeping { };
static struct housekeeping housekeeping; +static bool sched_tick_offload_inited;
bool housekeeping_enabled(enum hk_type type) { @@ -103,8 +104,10 @@ void __init housekeeping_init(void)
static_branch_enable(&housekeeping_overridden);
- if (housekeeping.flags & HK_FLAG_KERNEL_NOISE) + if (housekeeping.flags & HK_FLAG_KERNEL_NOISE) { sched_tick_offload_init(); + sched_tick_offload_inited = true; + }
for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) { /* We need at least one CPU to handle housekeeping work */ @@ -324,5 +327,10 @@ int housekeeping_exclude_cpumask(struct cpumask *cpumask, unsigned long hk_flags static_key_disable(&housekeeping_overridden.key); else if (housekeeping.flags && !static_key_enabled(&housekeeping_overridden)) static_key_enable(&housekeeping_overridden.key); + + if ((housekeeping.flags & HK_FLAG_KERNEL_NOISE) && !sched_tick_offload_inited) { + sched_tick_offload_init(); + sched_tick_offload_inited = true; + } return 0; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index be9745d104f7..d4676305e099 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2671,7 +2671,7 @@ extern void post_init_entity_util_avg(struct task_struct *p);
#ifdef CONFIG_NO_HZ_FULL extern bool sched_can_stop_tick(struct rq *rq); -extern int __init sched_tick_offload_init(void); +extern int sched_tick_offload_init(void);
/* * Tick may be needed by tasks in the runqueue depending on their policy and
Even though there are 2 separate sets of housekeeping cpumasks for access and update, it is possible that the set of cpumasks to be updated are still being used by the callers of housekeeping functions resulting in the use of an intermediate cpumask between the new and old ones.
To reduce the chance of this, we need to introduce delay between successive housekeeping cpumask updates. One simple way is to make use of the RCU grace period delay. The callers of the housekeeping APIs can optionally hold rcu_read_lock to eliminate the chance of using intermediate housekeeping cpumasks.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/sched/isolation.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index ee396ae13719..f26708667754 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -23,6 +23,9 @@ EXPORT_SYMBOL_GPL(housekeeping_overridden); * The housekeeping cpumasks can now be dynamically updated at run time. * Two set of cpumasks are kept. One set can be used while the other set are * being updated concurrently. + * + * rcu_read_lock() can optionally be held by housekeeping API callers to + * ensure stability of the cpumasks. */ static DEFINE_RAW_SPINLOCK(cpumask_lock); struct housekeeping { @@ -34,6 +37,8 @@ struct housekeeping {
static struct housekeeping housekeeping; static bool sched_tick_offload_inited; +static struct rcu_head rcu_gp[HK_TYPE_MAX]; +static unsigned long update_flags;
bool housekeeping_enabled(enum hk_type type) { @@ -267,6 +272,18 @@ static int __init housekeeping_isolcpus_setup(char *str) } __setup("isolcpus=", housekeeping_isolcpus_setup);
+/* + * Bits in update_flags can only turned on with cpumask_lock held and + * cleared by this RCU callback function. + */ +static void rcu_gp_end(struct rcu_head *rcu) +{ + int type = rcu - rcu_gp; + + /* Atomically clear the corresponding flag bit */ + clear_bit(type, &update_flags); +} + /** * housekeeping_exclude_cpumask - Update housekeeping cpumasks to exclude only the given cpumask * @cpumask: new cpumask to be excluded from housekeeping cpumasks @@ -306,8 +323,21 @@ int housekeeping_exclude_cpumask(struct cpumask *cpumask, unsigned long hk_flags } #endif
+retry: + /* + * If the RCU grace period for the previous update with conflicting + * flag bits hasn't been completed yet, we have to wait for it. + */ + while (READ_ONCE(update_flags) & hk_flags) + synchronize_rcu(); + raw_spin_lock(&cpumask_lock);
+ if (READ_ONCE(update_flags) & hk_flags) { + raw_spin_unlock(&cpumask_lock); + goto retry; + } + for_each_set_bit(type, &hk_flags, HK_TYPE_MAX) { int idx = ++housekeeping.seq_nrs[type] & 1; struct cpumask *dst_cpumask = housekeeping.cpumasks[type][idx]; @@ -320,8 +350,11 @@ int housekeeping_exclude_cpumask(struct cpumask *cpumask, unsigned long hk_flags housekeeping.flags |= BIT(type); } WRITE_ONCE(housekeeping.cpumask_ptrs[type], dst_cpumask); + set_bit(type, &update_flags); } raw_spin_unlock(&cpumask_lock); + for_each_set_bit(type, &hk_flags, HK_TYPE_MAX) + call_rcu(&rcu_gp[type], rcu_gp_end);
if (!housekeeping.flags && static_key_enabled(&housekeeping_overridden)) static_key_disable(&housekeeping_overridden.key);
As housekeeping cpumasks can now be modified at run time, we need a way to examine the their current values to see if they meet our expectation. Add a new sched debugfs file "housekeeping_cpumasks" to dump out the current values.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/sched/debug.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 3f06ab84d53f..ba8f0334c15e 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -490,6 +490,35 @@ static void debugfs_fair_server_init(void) } }
+#ifdef CONFIG_CPU_ISOLATION +static int hk_cpumasks_show(struct seq_file *m, void *v) +{ + static const char * const hk_type_name[HK_TYPE_MAX] = { + [HK_TYPE_DOMAIN] = "domain", + [HK_TYPE_MANAGED_IRQ] = "managed_irq", + [HK_TYPE_KERNEL_NOISE] = "nohz_full" + }; + int type; + + for (type = 0; type < HK_TYPE_MAX; type++) + seq_printf(m, "%s: %*pbl\n", hk_type_name[type], + cpumask_pr_args(housekeeping_cpumask(type))); + return 0; +} + +static int hk_cpumasks_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, hk_cpumasks_show, NULL); +} + +static const struct file_operations hk_cpumasks_fops = { + .open = hk_cpumasks_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; +#endif + static __init int sched_init_debug(void) { struct dentry __maybe_unused *numa; @@ -525,6 +554,9 @@ static __init int sched_init_debug(void) debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold); #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_CPU_ISOLATION + debugfs_create_file("housekeeing_cpumasks", 0444, debugfs_sched, NULL, &hk_cpumasks_fops); +#endif debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
debugfs_fair_server_init();
Add a new cpuhp_offline_cb() API that allows us to offline a set of CPUs one-by-one, run the given callback function and then bring those CPUs back online again while inhibiting any concurrent CPU hotplug operations from happening.
This new API can be used to enable runtime adjustment of nohz_full and isolcpus boot command line options. A new cpuhp_offline_cb_mode flag is also added to signal that the system is in this offline callback transient state so that some hotplug operations can be optimized out if we choose to.
Signed-off-by: Waiman Long longman@redhat.com --- include/linux/cpuhplock.h | 9 ++++++++ kernel/cpu.c | 47 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+)
diff --git a/include/linux/cpuhplock.h b/include/linux/cpuhplock.h index f7aa20f62b87..b42b81361abc 100644 --- a/include/linux/cpuhplock.h +++ b/include/linux/cpuhplock.h @@ -9,7 +9,9 @@
#include <linux/cleanup.h> #include <linux/errno.h> +#include <linux/cpumask_types.h>
+typedef int (*cpuhp_cb_t)(void *arg); struct device;
extern int lockdep_is_cpus_held(void); @@ -28,6 +30,8 @@ void clear_tasks_mm_cpumask(int cpu); int remove_cpu(unsigned int cpu); int cpu_device_down(struct device *dev); void smp_shutdown_nonboot_cpus(unsigned int primary_cpu); +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg); +extern bool cpuhp_offline_cb_mode;
#else /* CONFIG_HOTPLUG_CPU */
@@ -42,6 +46,11 @@ static inline void cpu_hotplug_disable(void) { } static inline void cpu_hotplug_enable(void) { } static inline int remove_cpu(unsigned int cpu) { return -EPERM; } static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { } +static inline int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg) +{ + return -EPERM; +} +#define cpuhp_offline_cb_mode false #endif /* !CONFIG_HOTPLUG_CPU */
DEFINE_LOCK_GUARD_0(cpus_read_lock, cpus_read_lock(), cpus_read_unlock()) diff --git a/kernel/cpu.c b/kernel/cpu.c index faf0f23fc5d8..b6364a1950b1 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1534,6 +1534,53 @@ int remove_cpu(unsigned int cpu) } EXPORT_SYMBOL_GPL(remove_cpu);
+bool cpuhp_offline_cb_mode; + +/** + * cpuhp_offline_cb - offline CPUs, invoke callback function & online CPUs afterward + * @mask: A mask of CPUs to be taken offline and then online + * @func: A callback function to be invoked while the given CPUs are offline + * @arg: Argument to be passed back to the callback function + * Return: 0 if successful, an error code otherwise + */ +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg) +{ + int cpu, ret, ret2 = 0; + + if (WARN_ON_ONCE(cpumask_empty(mask))) + return -EINVAL; + + lock_device_hotplug(); + cpuhp_offline_cb_mode = true; + for_each_cpu(cpu, mask) { + ret = device_offline(get_cpu_device(cpu)); + if (unlikely(ret)) { + int cpu2; + + /* Online the offline CPUs before returning */ + for_each_cpu(cpu2, mask) { + if (cpu2 == cpu) + break; + device_online(get_cpu_device(cpu2)); + } + goto out; + } + } + ret = func(arg); + + /* Bring CPUs back online */ + for_each_cpu(cpu, mask) { + int ret3 = device_online(get_cpu_device(cpu)); + + if (ret3 && !ret2) + ret2 = ret3; + } +out: + cpuhp_offline_cb_mode = false; + unlock_device_hotplug(); + return ret ? ret : (ret2 ? ret2 : 0); +} + void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { unsigned int cpu;
The current cpuset partition code is able to dynamically update the sched domains of a running system to perform what is essentally the "isolcpus=domain,..." boot command line feature at run time.
To enable runtime modification of nohz_full, we will have to make use of the CPU hotplug functionality to facilitate the proper addition or subtraction of nohz_full CPUs. In other word, we can't hold the cpu_hotplug_lock while doing so. Given the current lock ordering, we will need to introduce a new top level mutex to ensure proper mutual exclusion in case there is a need to update the cpuset states that may require the use of CPU hotplug. This patch introduces a new top level isolcpus_update_mutex for such purpose. This new mutex will be acquired in case the cpuset partition states or the set of isolated CPUs may have to be changed.
The update_unbound_workqueue_cpumask() is now renamed to update_isolation_cpumasks() and moved outside of cpu_hotplug_lock critical regions to enable its future extension to invoke CPU hotplug.
A new global isolcpus_update_state structure is added to track if update_isolation_cpumasks() will need to be invoked. So the existing partition_xcpus_add/del() functions and their callers can now be simplified.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 149 ++++++++++++++++++++++++----------------- 1 file changed, 86 insertions(+), 63 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 27adb04df675..2190efd33efb 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -215,29 +215,39 @@ static struct cpuset top_cpuset = { };
/* - * There are two global locks guarding cpuset structures - cpuset_mutex and - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset - * structures. Note that cpuset_mutex needs to be a mutex as it is used in - * paths that rely on priority inheritance (e.g. scheduler - on RT) for - * correctness. + * CPUSET Locking Convention + * ------------------------- * - * A task must hold both locks to modify cpusets. If a task holds - * cpuset_mutex, it blocks others, ensuring that it is the only task able to - * also acquire callback_lock and be able to modify cpusets. It can perform - * various checks on the cpuset structure first, knowing nothing will change. - * It can also allocate memory while just holding cpuset_mutex. While it is - * performing these checks, various callback routines can briefly acquire - * callback_lock to query cpusets. Once it is ready to make the changes, it - * takes callback_lock, blocking everyone else. + * Below are the three global locks guarding cpuset structures in lock + * acquisition order: + * - isolcpus_update_mutex + * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) + * - cpuset_mutex + * - callback_lock (raw spinlock) * - * Calls to the kernel memory allocator can not be made while holding - * callback_lock, as that would risk double tripping on callback_lock - * from one of the callbacks into the cpuset code from within - * __alloc_pages(). + * The first isolcpus_update_mutex should only be held if the existing set of + * isolated CPUs (in isolated partition) or any of the partition states may be + * changed. Otherwise, it can be skipped. This is used to prevent concurrent + * updates to the set of isolated CPUs. * - * If a task is only holding callback_lock, then it has read-only - * access to cpusets. + * A task must hold all the remaining three locks to modify externally visible + * or used fields of cpusets, though some of the internally used cpuset fields + * can be modified by holding cpu_hotplug_lock and cpuset_mutex only. If only + * reliable read access of the externally used fields are needed, a task can + * hold either cpuset_mutex or callback_lock. + * + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others, + * ensuring that it is the only task able to also acquire callback_lock and + * be able to modify cpusets. It can perform various checks on the cpuset + * structure first, knowing nothing will change. It can also allocate memory + * without holding callback_lock. While it is performing these checks, various + * callback routines can briefly acquire callback_lock to query cpusets. Once + * it is ready to make the changes, it takes callback_lock, blocking everyone + * else. + * + * Calls to the kernel memory allocator cannot be made while holding + * callback_lock which is a spinlock, as the memory allocator may sleep or + * call back into cpuset code and acquire callback_lock. * * Now, the task_struct fields mems_allowed and mempolicy may be changed * by other task, we use alloc_lock in the task_struct fields to protect @@ -248,6 +258,7 @@ static struct cpuset top_cpuset = { * cpumasks and nodemasks. */
+static DEFINE_MUTEX(isolcpus_update_mutex); static DEFINE_MUTEX(cpuset_mutex);
void cpuset_lock(void) @@ -272,6 +283,17 @@ void cpuset_callback_unlock_irq(void) spin_unlock_irq(&callback_lock); }
+/* + * Isolcpus update state (protected by isolcpus_update_mutex mutex) + * + * It contains data related to updating the isolated CPUs configuration in + * isolated partitions. + */ +static struct { + bool updating; /* Isolcpus updating in progress */ + cpumask_var_t cpus; /* CPUs to be updated */ +} isolcpus_update_state; + static struct workqueue_struct *cpuset_migrate_mm_wq;
static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq); @@ -1273,6 +1295,9 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus cpumask_or(isolated_cpus, isolated_cpus, xcpus); else cpumask_andnot(isolated_cpus, isolated_cpus, xcpus); + + isolcpus_update_state.updating = true; + cpumask_or(isolcpus_update_state.cpus, isolcpus_update_state.cpus, xcpus); }
/* @@ -1280,31 +1305,26 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus * @new_prs: new partition_root_state * @parent: parent cpuset * @xcpus: exclusive CPUs to be added - * Return: true if isolated_cpus modified, false otherwise * * Remote partition if parent == NULL */ -static bool partition_xcpus_add(int new_prs, struct cpuset *parent, +static void partition_xcpus_add(int new_prs, struct cpuset *parent, struct cpumask *xcpus) { - bool isolcpus_updated; - WARN_ON_ONCE(new_prs < 0); lockdep_assert_held(&callback_lock); if (!parent) parent = &top_cpuset;
- if (parent == &top_cpuset) cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus);
- isolcpus_updated = (new_prs != parent->partition_root_state); - if (isolcpus_updated) + if (new_prs != parent->partition_root_state) isolated_cpus_update(parent->partition_root_state, new_prs, xcpus);
cpumask_andnot(parent->effective_cpus, parent->effective_cpus, xcpus); - return isolcpus_updated; + return; }
/* @@ -1312,15 +1332,12 @@ static bool partition_xcpus_add(int new_prs, struct cpuset *parent, * @old_prs: old partition_root_state * @parent: parent cpuset * @xcpus: exclusive CPUs to be removed - * Return: true if isolated_cpus modified, false otherwise * * Remote partition if parent == NULL */ -static bool partition_xcpus_del(int old_prs, struct cpuset *parent, +static void partition_xcpus_del(int old_prs, struct cpuset *parent, struct cpumask *xcpus) { - bool isolcpus_updated; - WARN_ON_ONCE(old_prs < 0); lockdep_assert_held(&callback_lock); if (!parent) @@ -1329,27 +1346,33 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent, if (parent == &top_cpuset) cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus);
- isolcpus_updated = (old_prs != parent->partition_root_state); - if (isolcpus_updated) + if (old_prs != parent->partition_root_state) isolated_cpus_update(old_prs, parent->partition_root_state, xcpus);
cpumask_and(xcpus, xcpus, cpu_active_mask); cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus); - return isolcpus_updated; + return; }
-static void update_unbound_workqueue_cpumask(bool isolcpus_updated) +/** + * update_isolation_cpumasks - Update external isolation CPU masks + * + * The following external CPU masks will be updated if necessary: + * - workqueue unbound cpumask + */ +static void update_isolation_cpumasks(void) { int ret;
- lockdep_assert_cpus_held(); - - if (!isolcpus_updated) + if (!isolcpus_update_state.updating) return;
ret = workqueue_unbound_exclude_cpumask(isolated_cpus); WARN_ON_ONCE(ret < 0); + + cpumask_clear(isolcpus_update_state.cpus); + isolcpus_update_state.updating = false; }
/** @@ -1441,8 +1464,6 @@ static inline bool is_local_partition(struct cpuset *cs) static int remote_partition_enable(struct cpuset *cs, int new_prs, struct tmpmasks *tmp) { - bool isolcpus_updated; - /* * The user must have sysadmin privilege. */ @@ -1466,11 +1487,10 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs, return PERR_INVCPUS;
spin_lock_irq(&callback_lock); - isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus); + partition_xcpus_add(new_prs, NULL, tmp->new_cpus); list_add(&cs->remote_sibling, &remote_children); cpumask_copy(cs->effective_xcpus, tmp->new_cpus); spin_unlock_irq(&callback_lock); - update_unbound_workqueue_cpumask(isolcpus_updated); cpuset_force_rebuild(); cs->prs_err = 0;
@@ -1493,15 +1513,12 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs, */ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp) { - bool isolcpus_updated; - WARN_ON_ONCE(!is_remote_partition(cs)); WARN_ON_ONCE(!cpumask_subset(cs->effective_xcpus, subpartitions_cpus));
spin_lock_irq(&callback_lock); list_del_init(&cs->remote_sibling); - isolcpus_updated = partition_xcpus_del(cs->partition_root_state, - NULL, cs->effective_xcpus); + partition_xcpus_del(cs->partition_root_state, NULL, cs->effective_xcpus); if (cs->prs_err) cs->partition_root_state = -cs->partition_root_state; else @@ -1511,7 +1528,6 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp) compute_effective_exclusive_cpumask(cs, NULL, NULL); reset_partition_data(cs); spin_unlock_irq(&callback_lock); - update_unbound_workqueue_cpumask(isolcpus_updated); cpuset_force_rebuild();
/* @@ -1536,7 +1552,6 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus, { bool adding, deleting; int prs = cs->partition_root_state; - int isolcpus_updated = 0;
if (WARN_ON_ONCE(!is_remote_partition(cs))) return; @@ -1569,9 +1584,9 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
spin_lock_irq(&callback_lock); if (adding) - isolcpus_updated += partition_xcpus_add(prs, NULL, tmp->addmask); + partition_xcpus_add(prs, NULL, tmp->addmask); if (deleting) - isolcpus_updated += partition_xcpus_del(prs, NULL, tmp->delmask); + partition_xcpus_del(prs, NULL, tmp->delmask); /* * Need to update effective_xcpus and exclusive_cpus now as * update_sibling_cpumasks() below may iterate back to the same cs. @@ -1580,7 +1595,6 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus, if (xcpus) cpumask_copy(cs->exclusive_cpus, xcpus); spin_unlock_irq(&callback_lock); - update_unbound_workqueue_cpumask(isolcpus_updated); if (adding || deleting) cpuset_force_rebuild();
@@ -1662,7 +1676,6 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd, int old_prs, new_prs; int part_error = PERR_NONE; /* Partition error? */ int subparts_delta = 0; - int isolcpus_updated = 0; struct cpumask *xcpus = user_xcpus(cs); bool nocpu;
@@ -1932,18 +1945,15 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd, * and vice versa. */ if (adding) - isolcpus_updated += partition_xcpus_del(old_prs, parent, - tmp->addmask); + partition_xcpus_del(old_prs, parent, tmp->addmask); if (deleting) - isolcpus_updated += partition_xcpus_add(new_prs, parent, - tmp->delmask); + partition_xcpus_add(new_prs, parent, tmp->delmask);
if (is_partition_valid(parent)) { parent->nr_subparts += subparts_delta; WARN_ON_ONCE(parent->nr_subparts < 0); } spin_unlock_irq(&callback_lock); - update_unbound_workqueue_cpumask(isolcpus_updated);
if ((old_prs != new_prs) && (cmd == partcmd_update)) update_partition_exclusive_flag(cs, new_prs); @@ -2968,7 +2978,6 @@ static int update_prstate(struct cpuset *cs, int new_prs) else if (isolcpus_updated) isolated_cpus_update(old_prs, new_prs, cs->effective_xcpus); spin_unlock_irq(&callback_lock); - update_unbound_workqueue_cpumask(isolcpus_updated);
/* Force update if switching back to member & update effective_xcpus */ update_cpumasks_hier(cs, &tmpmask, !new_prs); @@ -3224,6 +3233,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, int retval = -ENODEV;
buf = strstrip(buf); + mutex_lock(&isolcpus_update_mutex); cpus_read_lock(); mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) @@ -3256,6 +3266,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, out_unlock: mutex_unlock(&cpuset_mutex); cpus_read_unlock(); + update_isolation_cpumasks(); + mutex_unlock(&isolcpus_update_mutex); flush_workqueue(cpuset_migrate_mm_wq); return retval ?: nbytes; } @@ -3358,12 +3370,15 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf, else return -EINVAL;
+ mutex_lock(&isolcpus_update_mutex); cpus_read_lock(); mutex_lock(&cpuset_mutex); if (is_cpuset_online(cs)) retval = update_prstate(cs, val); mutex_unlock(&cpuset_mutex); cpus_read_unlock(); + update_isolation_cpumasks(); + mutex_unlock(&isolcpus_update_mutex); return retval ?: nbytes; }
@@ -3586,15 +3601,22 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css) { struct cpuset *cs = css_cs(css);
+ mutex_lock(&isolcpus_update_mutex); + /* + * Here the partition root state can't be changed by user again. + */ + if (!is_partition_valid(cs)) + goto out; + cpus_read_lock(); mutex_lock(&cpuset_mutex); - /* Reset valid partition back to member */ - if (is_partition_valid(cs)) - update_prstate(cs, PRS_MEMBER); - + update_prstate(cs, PRS_MEMBER); mutex_unlock(&cpuset_mutex); cpus_read_unlock(); + update_isolation_cpumasks(); +out: + mutex_unlock(&isolcpus_update_mutex);
}
@@ -3751,6 +3773,7 @@ int __init cpuset_init(void) BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL)); + BUG_ON(!zalloc_cpumask_var(&isolcpus_update_state.cpus, GFP_KERNEL));
cpumask_setall(top_cpuset.cpus_allowed); nodes_setall(top_cpuset.mems_allowed);
As we did not modify housekeeping cpumasks in the creation of cpuset partition before, we had to disallow the creation of non-isolated partitions from using any of the HK_TYPE_DOMAIN isolated CPUs. Now we are going to modify housekeeping cpumasks at run time, we will now allow overwriting of HK_TYPE_DOMAIN cpumask when an isolated partition is first created or when the creation of a non-isolated partition conflicts with the boot time HK_TYPE_DOMAIN isolated CPUs. The unnecessary checking code are now being removed. The doc file will be updated in a later patch.
On the other hand, there is still a latency spike problem when CPU hotplug code is used to facilitate the proper functioning of the dynamically modified nohz_full HK_TYPE_KERNEL_NOISE cpumask. So the cpuset code will be modified to maintain the boot-time enabled nohz_full cpumask to avoid using cpu hotplug if all the newly isolated/non-isolated CPUs are already in that cpumask. This code will be removed in the future when the latency spike problem is solved.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 45 ++++++++---------------------------------- 1 file changed, 8 insertions(+), 37 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 2190efd33efb..87e9ee7922cd 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -59,7 +59,6 @@ static const char * const perr_strings[] = { [PERR_NOCPUS] = "Parent unable to distribute cpu downstream", [PERR_HOTPLUG] = "No cpu available due to hotplug", [PERR_CPUSEMPTY] = "cpuset.cpus and cpuset.cpus.exclusive are empty", - [PERR_HKEEPING] = "partition config conflicts with housekeeping setup", [PERR_ACCESS] = "Enable partition not permitted", [PERR_REMOTE] = "Have remote partition underneath", }; @@ -81,9 +80,10 @@ static cpumask_var_t subpartitions_cpus; static cpumask_var_t isolated_cpus;
/* - * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot + * Housekeeping (nohz_full) CPUs at boot */ -static cpumask_var_t boot_hk_cpus; +static cpumask_var_t boot_nohz_full_hk_cpus; +static bool have_boot_nohz_full; static bool have_boot_isolcpus;
/* List of remote partition root children */ @@ -1609,26 +1609,6 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus, remote_partition_disable(cs, tmp); }
-/* - * prstate_housekeeping_conflict - check for partition & housekeeping conflicts - * @prstate: partition root state to be checked - * @new_cpus: cpu mask - * Return: true if there is conflict, false otherwise - * - * CPUs outside of boot_hk_cpus, if defined, can only be used in an - * isolated partition. - */ -static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) -{ - if (!have_boot_isolcpus) - return false; - - if ((prstate != PRS_ISOLATED) && !cpumask_subset(new_cpus, boot_hk_cpus)) - return true; - - return false; -} - /** * update_parent_effective_cpumask - update effective_cpus mask of parent cpuset * @cs: The cpuset that requests change in partition root state @@ -1737,9 +1717,6 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd, if (cpumask_empty(xcpus)) return PERR_INVCPUS;
- if (prstate_housekeeping_conflict(new_prs, xcpus)) - return PERR_HKEEPING; - /* * A parent can be left with no CPU as long as there is no * task directly associated with the parent partition. @@ -2356,9 +2333,6 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, cpumask_empty(trialcs->effective_xcpus)) { invalidate = true; cs->prs_err = PERR_INVCPUS; - } else if (prstate_housekeeping_conflict(old_prs, trialcs->effective_xcpus)) { - invalidate = true; - cs->prs_err = PERR_HKEEPING; } else if (tasks_nocpu_error(parent, cs, trialcs->effective_xcpus)) { invalidate = true; cs->prs_err = PERR_NOCPUS; @@ -2499,9 +2473,6 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs, if (cpumask_empty(trialcs->effective_xcpus)) { invalidate = true; cs->prs_err = PERR_INVCPUS; - } else if (prstate_housekeeping_conflict(old_prs, trialcs->effective_xcpus)) { - invalidate = true; - cs->prs_err = PERR_HKEEPING; } else if (tasks_nocpu_error(parent, cs, trialcs->effective_xcpus)) { invalidate = true; cs->prs_err = PERR_NOCPUS; @@ -3787,11 +3758,11 @@ int __init cpuset_init(void)
BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
- have_boot_isolcpus = housekeeping_enabled(HK_TYPE_DOMAIN); - if (have_boot_isolcpus) { - BUG_ON(!alloc_cpumask_var(&boot_hk_cpus, GFP_KERNEL)); - cpumask_copy(boot_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN)); - cpumask_andnot(isolated_cpus, cpu_possible_mask, boot_hk_cpus); + have_boot_nohz_full = housekeeping_enabled(HK_TYPE_KERNEL_NOISE); + have_boot_isolcpus = housekeeping_enabled(HK_TYPE_DOMAIN); + if (have_boot_nohz_full) { + BUG_ON(!alloc_cpumask_var(&boot_nohz_full_hk_cpus, GFP_KERNEL)); + cpumask_copy(boot_nohz_full_hk_cpus, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)); }
return 0;
One relatively simple way to allow runtime modification of nohz_full, and rcu_nocbs CPUs is to use the CPU hotplug to bring the affected CPUs offline first, making changes to the housekeeping cpumasks and then bring them back online. However, doing this will be rather costly in term of the number of CPU cycles needed. Still it is the easiet way to achieve the desired result and hopefully we can gradually reduce the overhead over time.
Use the newly introduced cpuhp_offline_cb() API to bring the affected CPUs offline, make the necessary housekeeping cpumask changes and then bring those CPUs back online again.
As HK_TYPE_DOMAIN cpumask is going to be updated at run time, we are going to reset any boot time isolcpus domain setting if an isolated partition or a conflicting non-isolated partition is going to be created.
Since rebuild_sched_domains() will be called at the end of update_isolation_cpumasks(), earlier rebuild_sched_domains_locked() calls will be suppressed to avoid unneeded work.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 95 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 92 insertions(+), 3 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 87e9ee7922cd..60f336e50b05 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1355,11 +1355,57 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent, return; }
+/* + * We are only updating HK_TYPE_DOMAIN and HK_TYPE_KERNEL_NOISE housekeeping + * cpumask for now. HK_TYPE_MANAGED_IRQ will be handled later. + */ +static int do_housekeeping_exclude_cpumask(void *arg __maybe_unused) +{ + int ret; + struct cpumask *icpus = isolated_cpus; + unsigned long flags = BIT(HK_TYPE_DOMAIN) | BIT(HK_TYPE_KERNEL_NOISE); + + /* + * The boot time isolcpus setting will be overwritten if set. + */ + have_boot_isolcpus = false; + + if (have_boot_nohz_full) { + /* + * Need to separate the handling of HK_TYPE_KERNEL_NOISE and + * HK_TYPE_DOMAIN as different cpumasks will be used for each. + */ + ret = housekeeping_exclude_cpumask(icpus, BIT(HK_TYPE_DOMAIN)); + WARN_ON_ONCE((ret < 0) && (ret != -EOPNOTSUPP)); + + if (cpumask_empty(isolcpus_update_state.cpus)) + return ret; + flags = BIT(HK_TYPE_KERNEL_NOISE); + icpus = kmalloc(cpumask_size(), GFP_KERNEL); + if (WARN_ON_ONCE(!icpus)) + return -ENOMEM; + + /* + * Add boot time nohz_full CPUs into the isolated CPUs list + * for exclusion from HK_TYPE_KERNEL_NOISE CPUs. + */ + cpumask_andnot(icpus, cpu_possible_mask, boot_nohz_full_hk_cpus); + cpumask_or(icpus, icpus, isolated_cpus); + } + ret = housekeeping_exclude_cpumask(icpus, flags); + WARN_ON_ONCE((ret < 0) && (ret != -EOPNOTSUPP)); + + if (icpus != isolated_cpus) + kfree(icpus); + return ret; +} + /** * update_isolation_cpumasks - Update external isolation CPU masks * * The following external CPU masks will be updated if necessary: * - workqueue unbound cpumask + * - housekeeping cpumasks */ static void update_isolation_cpumasks(void) { @@ -1371,7 +1417,41 @@ static void update_isolation_cpumasks(void) ret = workqueue_unbound_exclude_cpumask(isolated_cpus); WARN_ON_ONCE(ret < 0);
+ /* + * Mask out offline and boot-time nohz_full non-housekeeping + * CPUs from isolcpus_update_state.cpus to compute the set + * of CPUs that need to be brought offline before calling + * do_housekeeping_exclude_cpumask(). + */ + cpumask_and(isolcpus_update_state.cpus, + isolcpus_update_state.cpus, cpu_active_mask); + if (have_boot_nohz_full) + cpumask_and(isolcpus_update_state.cpus, + isolcpus_update_state.cpus, boot_nohz_full_hk_cpus); + + /* + * Without any change in the set of nohz_full CPUs, we don't really + * need to use CPU hotplug for making change in HK cpumasks. + */ + if (cpumask_empty(isolcpus_update_state.cpus)) + ret = do_housekeeping_exclude_cpumask(NULL); + else + ret = cpuhp_offline_cb(isolcpus_update_state.cpus, + do_housekeeping_exclude_cpumask, NULL); + /* + * A errno value of -EPERM may be returned from cpuhp_offline_cb() if + * any one of the CPUs in isolcpus_update_state.cpus can't be brought + * offline. This can happen for the boot CPU (normally CPU 0) which + * cannot be shut down. This CPU should not be used for creating + * isolated partition. + */ + if (ret == -EPERM) + pr_warn_once("cpuset: The boot CPU shouldn't be used for isolated partition\n"); + else + WARN_ON_ONCE(ret < 0); + cpumask_clear(isolcpus_update_state.cpus); + rebuild_sched_domains(); isolcpus_update_state.updating = false; }
@@ -2961,7 +3041,16 @@ static int update_prstate(struct cpuset *cs, int new_prs) update_partition_sd_lb(cs, old_prs);
notify_partition_change(cs, old_prs); - if (force_sd_rebuild) + + /* + * If boot time domain isolcpus exists and it conflicts with the CPUs + * in the new partition, we will have to reset HK_TYPE_DOMAIN cpumask. + */ + if (have_boot_isolcpus && (new_prs > PRS_MEMBER) && + !cpumask_subset(cs->effective_xcpus, housekeeping_cpumask(HK_TYPE_DOMAIN))) + isolcpus_update_state.updating = true; + + if (force_sd_rebuild && !isolcpus_update_state.updating) rebuild_sched_domains_locked(); free_cpumasks(NULL, &tmpmask); return 0; @@ -3232,7 +3321,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, }
free_cpuset(trialcs); - if (force_sd_rebuild) + if (force_sd_rebuild && !isolcpus_update_state.updating) rebuild_sched_domains_locked(); out_unlock: mutex_unlock(&cpuset_mutex); @@ -3999,7 +4088,7 @@ static void cpuset_handle_hotplug(void) }
/* rebuild sched domains if necessary */ - if (force_sd_rebuild) + if (force_sd_rebuild && !isolcpus_update_state.updating) rebuild_sched_domains_cpuslocked();
free_cpumasks(NULL, ptmp);
Now that the HK_TYPE_DOMAIN cpumask is updated at run time to reflect changes made in isolated cpuset partitions. We no longer need a separate cpuset_cpu_is_isolated() function for checking isolated CPUs generated by cpuset. Revert commit 3232e7aad11e ("cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check").
Signed-off-by: Waiman Long longman@redhat.com --- include/linux/cpuset.h | 6 ------ include/linux/sched/isolation.h | 3 +-- kernel/cgroup/cpuset.c | 11 ----------- 3 files changed, 1 insertion(+), 19 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 2ddb256187b5..a2ea8efebf36 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -76,7 +76,6 @@ extern void cpuset_lock(void); extern void cpuset_unlock(void); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); -extern bool cpuset_cpu_is_isolated(int cpu); extern nodemask_t cpuset_mems_allowed(struct task_struct *p); #define cpuset_current_mems_allowed (current->mems_allowed) void cpuset_init_current_mems_allowed(void); @@ -206,11 +205,6 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p) return false; }
-static inline bool cpuset_cpu_is_isolated(int cpu) -{ - return false; -} - static inline nodemask_t cpuset_mems_allowed(struct task_struct *p) { return node_possible_map; diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h index af38d21d0d00..0bc4b3368d39 100644 --- a/include/linux/sched/isolation.h +++ b/include/linux/sched/isolation.h @@ -79,8 +79,7 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type) static inline bool cpu_is_isolated(int cpu) { return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) || - !housekeeping_test_cpu(cpu, HK_TYPE_TICK) || - cpuset_cpu_is_isolated(cpu); + !housekeeping_test_cpu(cpu, HK_TYPE_TICK); }
#endif /* _LINUX_SCHED_ISOLATION_H */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 60f336e50b05..6308bb14e018 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1455,17 +1455,6 @@ static void update_isolation_cpumasks(void) isolcpus_update_state.updating = false; }
-/** - * cpuset_cpu_is_isolated - Check if the given CPU is isolated - * @cpu: the CPU number to be checked - * Return: true if CPU is used in an isolated partition, false otherwise - */ -bool cpuset_cpu_is_isolated(int cpu) -{ - return cpumask_test_cpu(cpu, isolated_cpus); -} -EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated); - /* * compute_effective_exclusive_cpumask - compute effective exclusive CPUs * @cs: cpuset
With the new strategy of using CPU hotplug to improve CPU isolation and the optimization of delaying sched domain rebuild until the whole process completes, we can run into a problem in shutting down the last CPU of a partition and a -EBUSY error may be returned. This -EBUSY error is caused by failing the DL BW check in dl_bw_deactivate().
As the CPU deactivation is only temporary and it will be brought back up again in a short moment, there is no point in failing the operation because of this DL BW error in this transitioning period. Fix this problem by ignoring this error when in CPU hotplug offline callback mode (cpuhp_offline_cb_mode is on).
Signed-off-by: Waiman Long longman@redhat.com --- kernel/sched/core.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9f02c047e25b..78f4ba73a9f2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8469,7 +8469,11 @@ int sched_cpu_deactivate(unsigned int cpu)
ret = dl_bw_deactivate(cpu);
- if (ret) + /* + * Ignore DL BW error if in cpuhp offline callback mode as CPU + * deactivation is only temporary. + */ + if (ret && !cpuhp_offline_cb_mode) return ret;
/*
To provide nohz_full tick support, there is a set of tick dependency masks that need to be evaluated on every IRQ and context switch. Switching on nohz_full tick support at runtime will be problematic as some of the tick dependency masks may not be properly set causing problem down the road.
Allow nohz_full boot option to be specified without any parameter to force enable nohz_full tick support without any CPU in the tick_nohz_full_mask yet. The context_tracking_key and tick_nohz_full_running flag will be enabled in this case to make tick_nohz_full_enabled() return true.
There is still a small performance overhead by force enable nohz_full this way. So it should only be used if there is a chance that some CPUs may become isolated later via the cpuset isolated partition functionality and better CPU isolation closed to nohz_full is desired.
Signed-off-by: Waiman Long longman@redhat.com --- .../admin-guide/kernel-parameters.txt | 19 ++++++++++++------- include/linux/context_tracking.h | 7 ++++++- kernel/context_tracking.c | 4 +++- kernel/sched/isolation.c | 13 ++++++++++++- kernel/time/tick-sched.c | 11 +++++++++-- 5 files changed, 42 insertions(+), 12 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 747a55abf494..89a8161475b5 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4260,15 +4260,20 @@ Valid arguments: on, off Default: on
- nohz_full= [KNL,BOOT,SMP,ISOL] - The argument is a cpu list, as described above. + nohz_full[=cpu-list] + [KNL,BOOT,SMP,ISOL] In kernels built with CONFIG_NO_HZ_FULL=y, set - the specified list of CPUs whose tick will be stopped - whenever possible. The boot CPU will be forced outside - the range to maintain the timekeeping. Any CPUs - in this list will have their RCU callbacks offloaded, + the specified list of CPUs whose tick will be + stopped whenever possible. If the argument is + not specified, nohz_full will be forced enabled + without any CPU in the nohz_full list yet. + The boot CPU will be forced outside the range + to maintain the timekeeping. Any CPUs in this + list will have their RCU callbacks offloaded, just as if they had also been called out in the - rcu_nocbs= boot parameter. + rcu_nocbs= boot parameter. There is no need + to use rcu_nocbs= boot parameter if nohz_full + has been set which will override rcu_nocbs.
Note that this argument takes precedence over the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option. diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index af9fe87a0922..a3fea7f9fef6 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -9,8 +9,13 @@
#include <asm/ptrace.h>
- #ifdef CONFIG_CONTEXT_TRACKING_USER +/* + * Pass CONTEXT_TRACKING_FORCE_ENABLE to ct_cpu_track_user() to force enable + * user context tracking. + */ +#define CONTEXT_TRACKING_FORCE_ENABLE (-1) + extern void ct_cpu_track_user(int cpu);
/* Called with interrupts disabled. */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index fb5be6e9b423..734354bbfdbb 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -698,7 +698,9 @@ void __init ct_cpu_track_user(int cpu) { static __initdata bool initialized = false;
- if (!per_cpu(context_tracking.active, cpu)) { + if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) { + static_branch_inc(&context_tracking_key); + } else if (!per_cpu(context_tracking.active, cpu)) { per_cpu(context_tracking.active, cpu) = true; static_branch_inc(&context_tracking_key); } diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index f26708667754..2bed4b2f9ec5 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -146,6 +146,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags) }
alloc_bootmem_cpumask_var(&non_housekeeping_mask); + if (cpulist_parse(str, non_housekeeping_mask) < 0) { pr_warn("Housekeeping: nohz_full= or isolcpus= incorrect CPU range\n"); goto free_non_housekeeping_mask; @@ -155,6 +156,13 @@ static int __init housekeeping_setup(char *str, unsigned long flags) cpumask_andnot(housekeeping_staging, cpu_possible_mask, non_housekeeping_mask);
+ /* + * Allow "nohz_full" without parameter to force enable nohz_full + * at boot time without any CPUs in the nohz_full list yet. + */ + if ((flags & HK_FLAG_KERNEL_NOISE) && !*str) + goto setup_housekeeping_staging; + first_cpu = cpumask_first_and(cpu_present_mask, housekeeping_staging); if (first_cpu >= nr_cpu_ids || first_cpu >= setup_max_cpus) { __cpumask_set_cpu(smp_processor_id(), housekeeping_staging); @@ -168,6 +176,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags) if (cpumask_empty(non_housekeeping_mask)) goto free_housekeeping_staging;
+setup_housekeeping_staging: if (!housekeeping.flags) { /* First setup call ("nohz_full=" or "isolcpus=") */ enum hk_type type; @@ -212,10 +221,12 @@ static int __init housekeeping_nohz_full_setup(char *str) unsigned long flags;
flags = HK_FLAG_KERNEL_NOISE; + if (*str == '=') + str++;
return housekeeping_setup(str, flags); } -__setup("nohz_full=", housekeeping_nohz_full_setup); +__setup("nohz_full", housekeeping_nohz_full_setup);
static int __init housekeeping_isolcpus_setup(char *str) { diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c527b421c865..87b26a4471e7 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -651,8 +651,15 @@ void __init tick_nohz_init(void) } }
- for_each_cpu(cpu, tick_nohz_full_mask) - ct_cpu_track_user(cpu); + /* + * Force enable context_tracking_key if tick_nohz_full_mask empty + */ + if (cpumask_empty(tick_nohz_full_mask)) { + ct_cpu_track_user(CONTEXT_TRACKING_FORCE_ENABLE); + } else { + for_each_cpu(cpu, tick_nohz_full_mask) + ct_cpu_track_user(cpu); + }
ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "kernel/nohz:predown", NULL,
When the list of HK_FLAG_KERNEL_NOISE housekeeping CPUs are changed, we will need to update tick_nohz_full_mask so that dynticks can work correctly. Introduce a new tick_nohz_full_update_cpus() function that can be called at run time to update tick_nohz_full_mask.
Signed-off-by: Waiman Long longman@redhat.com --- include/linux/tick.h | 2 ++ kernel/time/tick-sched.c | 6 ++++++ 2 files changed, 8 insertions(+)
diff --git a/include/linux/tick.h b/include/linux/tick.h index ac76ae9fa36d..34907c0b632c 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -272,6 +272,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal, extern void tick_nohz_full_kick_cpu(int cpu); extern void __tick_nohz_task_switch(void); extern void __init tick_nohz_full_setup(cpumask_var_t cpumask); +extern void tick_nohz_full_update_cpus(cpumask_var_t cpumask); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -297,6 +298,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal, static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void __tick_nohz_task_switch(void) { } static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { } +static inline void tick_nohz_full_update_cpus(cpumask_var_t cpumask) { return false; } #endif
static inline void tick_nohz_task_switch(void) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 87b26a4471e7..9204808b7a55 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -604,6 +604,12 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) tick_nohz_full_running = true; }
+/* Get the new set of run-time nohz CPU list from cpuset */ +void tick_nohz_full_update_cpus(cpumask_var_t cpumask) +{ + cpumask_copy(tick_nohz_full_mask, cpumask); +} + bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { /*
Full dynticks can only be enabled if "nohz_full" boot option has been been specified with or without parameter. Any change in the list of nohz_full CPUs have to be reflected in tick_nohz_full_mask. So the newly introduced tick_nohz_full_update_cpus() will be called to update the mask.
We also need to enable CPU context tracking for those CPUs that are in tick_nohz_full_mask. So remove __init from tick_nohz_init() and ct_cpu_track_user() so that they be called later when an isolated cpuset partition is being created. The __ro_after_init attribute is taken away from context_tracking_key as well.
Also add a new ct_cpu_untrack_user() function to reverse the action of ct_cpu_track_user() in case we need to disable the nohz_full mode of a CPU.
With nohz_full enabled, the boot CPU (typically CPU 0) will be the tick CPU which cannot be shut down easily. So the boot CPU should not be used in an isolated cpuset partition.
With runtime modification of nohz_full CPUs, tick_do_timer_cpu can become TICK_DO_TIMER_NONE. So remove the two TICK_DO_TIMER_NONE WARN_ON_ONCE() calls in tick-sched.c to avoid unnecessary warnings.
Signed-off-by: Waiman Long longman@redhat.com --- include/linux/context_tracking.h | 1 + kernel/cgroup/cpuset.c | 23 ++++++++++++++++++++++- kernel/context_tracking.c | 17 ++++++++++++++--- kernel/time/tick-sched.c | 7 ------- 4 files changed, 37 insertions(+), 11 deletions(-)
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index a3fea7f9fef6..1a6b816f1ad6 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -17,6 +17,7 @@ #define CONTEXT_TRACKING_FORCE_ENABLE (-1)
extern void ct_cpu_track_user(int cpu); +extern void ct_cpu_untrack_user(int cpu);
/* Called with interrupts disabled. */ extern void __ct_user_enter(enum ctx_state state); diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6308bb14e018..45c82c18bec4 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -23,6 +23,7 @@ */ #include "cpuset-internal.h"
+#include <linux/context_tracking.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/kernel.h> @@ -1361,7 +1362,7 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent, */ static int do_housekeeping_exclude_cpumask(void *arg __maybe_unused) { - int ret; + int cpu, ret; struct cpumask *icpus = isolated_cpus; unsigned long flags = BIT(HK_TYPE_DOMAIN) | BIT(HK_TYPE_KERNEL_NOISE);
@@ -1395,6 +1396,26 @@ static int do_housekeeping_exclude_cpumask(void *arg __maybe_unused) ret = housekeeping_exclude_cpumask(icpus, flags); WARN_ON_ONCE((ret < 0) && (ret != -EOPNOTSUPP));
+#ifdef CONFIG_NO_HZ_FULL + /* + * To properly enable/disable nohz_full dynticks for the affected CPUs, + * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and + * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called + * for those CPUs that have their states changed. + */ + if (tick_nohz_full_enabled()) { + tick_nohz_full_update_cpus(icpus); + for_each_cpu(cpu, isolcpus_update_state.cpus) { + if (cpumask_test_cpu(cpu, icpus)) + ct_cpu_track_user(cpu); + else + ct_cpu_untrack_user(cpu); + } + } else { + pr_warn_once("Full dynticks cannot be enabled without the nohz_full kernel boot parameter!\n"); + } +#endif + if (icpus != isolated_cpus) kfree(icpus); return ret; diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 734354bbfdbb..ed5653a3d6f7 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -431,7 +431,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { } #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h>
-DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key); +DEFINE_STATIC_KEY_FALSE(context_tracking_key); EXPORT_SYMBOL_GPL(context_tracking_key);
static noinstr bool context_tracking_recursion_enter(void) @@ -694,9 +694,9 @@ void user_exit_callable(void) } NOKPROBE_SYMBOL(user_exit_callable);
-void __init ct_cpu_track_user(int cpu) +void ct_cpu_track_user(int cpu) { - static __initdata bool initialized = false; + static bool initialized;
if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) { static_branch_inc(&context_tracking_key); @@ -720,6 +720,17 @@ void __init ct_cpu_track_user(int cpu) initialized = true; }
+void ct_cpu_untrack_user(int cpu) +{ +#ifndef CONFIG_CONTEXT_TRACKING_USER_FORCE + if (!per_cpu(context_tracking.active, cpu)) + return; + + per_cpu(context_tracking.active, cpu) = false; + static_branch_dec(&context_tracking_key); +#endif +} + #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE void __init context_tracking_init(void) { diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9204808b7a55..c16250c6a79f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -220,9 +220,6 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now) tick_cpu = READ_ONCE(tick_do_timer_cpu);
if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && unlikely(tick_cpu == TICK_DO_TIMER_NONE)) { -#ifdef CONFIG_NO_HZ_FULL - WARN_ON_ONCE(tick_nohz_full_running); -#endif WRITE_ONCE(tick_do_timer_cpu, cpu); tick_cpu = cpu; } @@ -1201,10 +1198,6 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) */ if (tick_cpu == cpu) return false; - - /* Should not happen for nohz-full */ - if (WARN_ON_ONCE(tick_cpu == TICK_DO_TIMER_NONE)) - return false; }
return true;
In tick_cpu_dying(), if the dying CPU is the current timekeeper, it has to pass the job over to another CPU. The current code passes it to another online CPU. However, that CPU may not be a timer tick housekeeping CPU. If that happens, another CPU will have to manually take it over again later. Avoid this unnecessary work by directly assigning an online housekeeping CPU.
Use READ_ONCE/WRITE_ONCE() to access tick_do_timer_cpu in case the non-HK CPUs may not be in stop machine in the future.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/time/tick-common.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 9a3859443c04..6d5ff85281cc 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -17,6 +17,7 @@ #include <linux/profile.h> #include <linux/sched.h> #include <linux/module.h> +#include <linux/sched/isolation.h> #include <trace/events/power.h>
#include <asm/irq_regs.h> @@ -394,12 +395,18 @@ int tick_cpu_dying(unsigned int dying_cpu) { /* * If the current CPU is the timekeeper, it's the only one that can - * safely hand over its duty. Also all online CPUs are in stop - * machine, guaranteed not to be idle, therefore there is no + * safely hand over its duty. Also all online housekeeping CPUs are + * in stop machine, guaranteed not to be idle, therefore there is no * concurrency and it's safe to pick any online successor. */ - if (tick_do_timer_cpu == dying_cpu) - tick_do_timer_cpu = cpumask_first(cpu_online_mask); + if (READ_ONCE(tick_do_timer_cpu) == dying_cpu) { + unsigned int new_cpu; + + new_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TICK)); + if (WARN_ON_ONCE(new_cpu >= nr_cpu_ids)) + new_cpu = cpumask_first(cpu_online_mask); + WRITE_ONCE(tick_do_timer_cpu, new_cpu); + }
/* Make sure the CPU won't try to retake the timekeeping duty */ tick_sched_timer_dying(dying_cpu);
Make use of the provided rcu_nocb_cpu_offload()/rcu_nocb_cpu_deoffload() APIs to enable RCU NO-CB CPU offloading of newly isolated CPUs and deoffloading of de-isolated CPUs.
Also add a new rcu_nocbs_enabled() helper function to determine if RCU NO-CB CPU offloading can be done.
As nohz_full can now be specified without any CPU list, drop the test for cpumask_empty(tick_nohz_full_mask) in rcu_init_nohz().
The RCU NO-CB CPU offloading feature can only used if either "rcs_nocbs" or the "nohz_full" boot command parameters are used so that the proper RCU NO-CB resources are properly initialized at boot time.
Signed-off-by: Waiman Long longman@redhat.com --- include/linux/rcupdate.h | 2 ++ kernel/cgroup/cpuset.c | 14 ++++++++++++++ kernel/rcu/tree_nocb.h | 7 ++++++- 3 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 120536f4c6eb..642b80a4f071 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -140,6 +140,7 @@ void rcu_init_nohz(void); int rcu_nocb_cpu_offload(int cpu); int rcu_nocb_cpu_deoffload(int cpu); void rcu_nocb_flush_deferred_wakeup(void); +bool rcu_nocbs_enabled(void);
#define RCU_NOCB_LOCKDEP_WARN(c, s) RCU_LOCKDEP_WARN(c, s)
@@ -149,6 +150,7 @@ static inline void rcu_init_nohz(void) { } static inline int rcu_nocb_cpu_offload(int cpu) { return -EINVAL; } static inline int rcu_nocb_cpu_deoffload(int cpu) { return 0; } static inline void rcu_nocb_flush_deferred_wakeup(void) { } +static inline bool rcu_nocbs_enabled(void) { return false; }
#define RCU_NOCB_LOCKDEP_WARN(c, s)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 45c82c18bec4..de9cb92a0fc7 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1416,6 +1416,20 @@ static int do_housekeeping_exclude_cpumask(void *arg __maybe_unused) } #endif
+ if (rcu_nocbs_enabled()) { + /* + * Enable RCU NO-CB CPU offloading/deoffloading for the affected CPUs + */ + for_each_cpu(cpu, isolcpus_update_state.cpus) { + if (cpumask_test_cpu(cpu, icpus)) + ret = rcu_nocb_cpu_offload(cpu); + else + ret = rcu_nocb_cpu_deoffload(cpu); + if (WARN_ON_ONCE(ret)) + break; + } + } + if (icpus != isolated_cpus) kfree(icpus); return ret; diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index e6cd56603cad..4d49a745b871 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -1293,7 +1293,7 @@ void __init rcu_init_nohz(void) struct shrinker * __maybe_unused lazy_rcu_shrinker;
#if defined(CONFIG_NO_HZ_FULL) - if (tick_nohz_full_running && !cpumask_empty(tick_nohz_full_mask)) + if (tick_nohz_full_running) cpumask = tick_nohz_full_mask; #endif
@@ -1365,6 +1365,11 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) mutex_init(&rdp->nocb_gp_kthread_mutex); }
+bool rcu_nocbs_enabled(void) +{ + return !!rcu_state.nocb_is_setup; +} + /* * If the specified CPU is a no-CBs CPU that does not already have its * rcuo CB kthread, spawn it. Additionally, if the rcuo GP kthread
As HK_TYPE_KERNEL_NOISE bit can now be set without any nohz_full CPU specified at boot time, don't set have_boot_nohz_full in this case.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index de9cb92a0fc7..489708f4e096 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3871,7 +3871,12 @@ int __init cpuset_init(void)
BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
- have_boot_nohz_full = housekeeping_enabled(HK_TYPE_KERNEL_NOISE); + /* + * HK_TYPE_KERNEL_NOISE bit can be set without any nohz_full CPU + */ + have_boot_nohz_full = housekeeping_enabled(HK_TYPE_KERNEL_NOISE) && + !cpumask_equal(cpu_possible_mask, + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)); have_boot_isolcpus = housekeeping_enabled(HK_TYPE_DOMAIN); if (have_boot_nohz_full) { BUG_ON(!alloc_cpumask_var(&boot_nohz_full_hk_cpus, GFP_KERNEL));
As CPU hotplug is now used to improve CPU isolation of CPUs in isolated partitions. The boot CPU (typically CPU 0) cannot be put offline impacting the amount of CPU isolation available. Now we have to advise users that the boot CPU should never be used for isolated partitions. A warning will be printed when boot CPU is used and the cgroup-v2.rst is updated accordingly. The test_cpuset_prs.sh selftest is also updated to remove CPU 0 when forming isolated partitions.
Also update the cgroup-v2.rst file to show the need to specify the "nohz_full" kernel boot parameter to enable better nohz_full behavior for the CPUs in isolated partitions as well as the latency spike issue with using CPU hotplug.
Signed-off-by: Waiman Long longman@redhat.com --- Documentation/admin-guide/cgroup-v2.rst | 33 +++++++++++++++---- .../selftests/cgroup/test_cpuset_prs.sh | 8 ++--- 2 files changed, 31 insertions(+), 10 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index d9d3cc7df348..26213383b34b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2556,11 +2556,12 @@ Cpuset Interface Files
It accepts only the following input values when written to.
- ========== ===================================== + ========== =============================================== "member" Non-root member of a partition "root" Partition root - "isolated" Partition root without load balancing - ========== ===================================== + "isolated" Partition root without load balancing and other + OS noises + ========== ===============================================
A cpuset partition is a collection of cpuset-enabled cgroups with a partition root at the top of the hierarchy and its descendants @@ -2593,9 +2594,29 @@ Cpuset Interface Files
When set to "isolated", the CPUs in that partition will be in an isolated state without any load balancing from the scheduler - and excluded from the unbound workqueues. Tasks placed in such - a partition with multiple CPUs should be carefully distributed - and bound to each of the individual CPUs for optimal performance. + and excluded from the unbound workqueues as well as without + other OS noises. Tasks placed in such a partition with multiple + CPUs should be carefully distributed and bound to each of the + individual CPUs for optimal performance. + + As CPU hotplug, if supported, is used to improve the degree of + CPU isolation close to the "nohz_full" kernel boot parameter. + The boot CPU (typically CPU 0) cannot be brought offline, so the + boot CPU should not be used for forming isolated partitions. + The "nohz_full" kernel boot parameter needs to be present to + enable full dynticks support and RCU no-callback CPU mode for + CPUs in isolated partitions even if the optional cpu list + isn't provided. Without that, adding the "rcu_nocbs" boot + kernel parameter without the cpu list can be used to enable + RCU no-callback CPU mode without full dynticks. + + Using CPU hotplug for creating or destroying an isolated + partition can cause latency spike in applications running + in other isolated partitions. A reserved list of CPUs can + optionally be put in the "nohz_full" kernel boot parameter to + alleviate this problem. When these reserved CPUs are used for + isolated partitions, CPU hotplug won't need to be invoked and + so there won't be latency spike in other isolated partitions.
A partition root ("root" or "isolated") can be in one of the two possible states - valid or invalid. An invalid partition diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh index a17256d9f88a..f61369be8bf6 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -318,8 +318,8 @@ TEST_MATRIX=( # Invalid to valid local partition direct transition tests " C1-3:S+:P2 X4:P2 . . . . . . 0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3" " C1-3:S+:P2 X4:P2 . . . X3:P2 . . 0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3" - " C0-3:P2 . . C4-6 C0-4 . . . 0 A1:0-4|B1:4-6 A1:P-2|B1:P0" - " C0-3:P2 . . C4-6 C0-4:C0-3 . . . 0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3" + " C1-3:P2 . . C4-6 C1-4 . . . 0 A1:1-4|B1:4-6 A1:P-2|B1:P0" + " C1-3:P2 . . C4-6 C1-4:C1-3 . . . 0 A1:1-3|B1:4-6 A1:P2|B1:P0 1-3"
# Local partition invalidation tests " C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \ @@ -329,8 +329,8 @@ TEST_MATRIX=( " C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \ . . C4:X . . 0 A1:1-3|A2:1-3|A3:2-3|XA2:|XA3: A1:P2|A2:P-2|A3:P-2 1-3" # Local partition CPU change tests - " C0-5:S+:P2 C4-5:S+:P1 . . . C3-5 . . 0 A1:0-2|A2:3-5 A1:P2|A2:P1 0-2" - " C0-5:S+:P2 C4-5:S+:P1 . . C1-5 . . . 0 A1:1-3|A2:4-5 A1:P2|A2:P1 1-3" + " C1-5:S+:P2 C4-5:S+:P1 . . . C3-5 . . 0 A1:1-2|A2:3-5 A1:P2|A2:P1 1-2" + " C1-5:S+:P2 C4-5:S+:P1 . . C2-5 . . . 0 A1:2-3|A2:4-5 A1:P2|A2:P1 2-3"
# cpus_allowed/exclusive_cpus update tests " C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
Add some pr_debug() statements to actions performed related to the cpuhp_offline_cb() call to aid debugging. Since rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() will print out some info text, there is no need to add pr_debug() statements for them.
Also update test_cpuset_prs.sh test script to enable printing of dynamic debug messages for the kernel/cgroup/cpuset.c file when loglevel is 7 (debug).
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 18 +++++++++++++----- .../selftests/cgroup/test_cpuset_prs.sh | 7 +++++++ 2 files changed, 20 insertions(+), 5 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 489708f4e096..30632e4b5899 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -21,6 +21,7 @@ * License. See the file COPYING in the main directory of the Linux * distribution for more details. */ +#define pr_fmt(fmt) "cpuset: " fmt #include "cpuset-internal.h"
#include <linux/context_tracking.h> @@ -1406,10 +1407,13 @@ static int do_housekeeping_exclude_cpumask(void *arg __maybe_unused) if (tick_nohz_full_enabled()) { tick_nohz_full_update_cpus(icpus); for_each_cpu(cpu, isolcpus_update_state.cpus) { - if (cpumask_test_cpu(cpu, icpus)) + if (cpumask_test_cpu(cpu, icpus)) { + pr_debug("Add CPU %d to nohz_full\n", cpu); ct_cpu_track_user(cpu); - else + } else { + pr_debug("Remove CPU %d from nohz_full\n", cpu); ct_cpu_untrack_user(cpu); + } } } else { pr_warn_once("Full dynticks cannot be enabled without the nohz_full kernel boot parameter!\n"); @@ -1425,6 +1429,7 @@ static int do_housekeeping_exclude_cpumask(void *arg __maybe_unused) ret = rcu_nocb_cpu_offload(cpu); else ret = rcu_nocb_cpu_deoffload(cpu); + if (WARN_ON_ONCE(ret)) break; } @@ -1468,11 +1473,14 @@ static void update_isolation_cpumasks(void) * Without any change in the set of nohz_full CPUs, we don't really * need to use CPU hotplug for making change in HK cpumasks. */ - if (cpumask_empty(isolcpus_update_state.cpus)) + if (cpumask_empty(isolcpus_update_state.cpus)) { ret = do_housekeeping_exclude_cpumask(NULL); - else + } else { + pr_debug("cpuhp_offline_cb() called for CPUs %*pbl\n", + cpumask_pr_args(isolcpus_update_state.cpus)); ret = cpuhp_offline_cb(isolcpus_update_state.cpus, do_housekeeping_exclude_cpumask, NULL); + } /* * A errno value of -EPERM may be returned from cpuhp_offline_cb() if * any one of the CPUs in isolcpus_update_state.cpus can't be brought @@ -1481,7 +1489,7 @@ static void update_isolation_cpumasks(void) * isolated partition. */ if (ret == -EPERM) - pr_warn_once("cpuset: The boot CPU shouldn't be used for isolated partition\n"); + pr_warn_once("The boot CPU shouldn't be used for isolated partition\n"); else WARN_ON_ONCE(ret < 0);
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh index f61369be8bf6..43a12690775e 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -67,6 +67,13 @@ then echo Y > /sys/kernel/debug/sched/verbose fi
+# Enable dynamic debug messages for cpuset only +DYN_DEBUG=/sys/kernel/debug/dynamic_debug/control +[[ -f $DYN_DEBUG ]] && { + echo "-p" > $DYN_DEBUG + echo "file kernel/cgroup/cpuset.c +p" > $DYN_DEBUG +} + cd $CGROUP2 echo +cpuset > cgroup.subtree_control
Le Fri, Aug 08, 2025 at 11:10:44AM -0400, Waiman Long a écrit :
The "nohz_full" and "rcu_nocbs" boot command parameters can be used to remove a lot of kernel overhead on a specific set of isolated CPUs which can be used to run some latency/bandwidth sensitive workloads with as little kernel disturbance/noise as possible. The problem with this mode of operation is the fact that it is a static configuration which cannot be changed after boot to adjust for changes in application loading.
There is always a desire to enable runtime modification of the number of isolated CPUs that can be dedicated to this type of demanding workloads. This patchset is an attempt to do just that with an amount of CPU isolation close to what can be done with the nohz_full and rcu_nocbs boot kernel parameters.
This patch series provides the ability to change the set of housekeeping CPUs at run time via the cpuset isolated partition functionality. Currently, the cpuset isolated partition is able to disable scheduler load balancing and the CPU affinity of the unbound workqueue to avoid the isolated CPUs. This patch series will extend that with other kernel noises associated with the nohz_full boot command line parameter which has the following sub-categories:
- tick
- timer
- RCU
- MISC
- WQ
- kthread
Thanks for working on that, I'm about to leave for 2 weeks vacation so I won't have the time to check this until I'm back.
However this series is highly conflicting with mine (cpuset/isolation: Honour kthreads preferred affinity). Your patchset even redoes things I'm doing (housekeeping cpumask update, RCU synchronization, HK_TYPE_DOMAIN to include cpusets, etc...)
I have a v2 that is almost ready to post.
Wouldn't it be better to wait for it and its infrastructure changes before proceeding with nohz_full?
Thanks.
On 8/8/25 11:50 AM, Frederic Weisbecker wrote:
Le Fri, Aug 08, 2025 at 11:10:44AM -0400, Waiman Long a écrit :
The "nohz_full" and "rcu_nocbs" boot command parameters can be used to remove a lot of kernel overhead on a specific set of isolated CPUs which can be used to run some latency/bandwidth sensitive workloads with as little kernel disturbance/noise as possible. The problem with this mode of operation is the fact that it is a static configuration which cannot be changed after boot to adjust for changes in application loading.
There is always a desire to enable runtime modification of the number of isolated CPUs that can be dedicated to this type of demanding workloads. This patchset is an attempt to do just that with an amount of CPU isolation close to what can be done with the nohz_full and rcu_nocbs boot kernel parameters.
This patch series provides the ability to change the set of housekeeping CPUs at run time via the cpuset isolated partition functionality. Currently, the cpuset isolated partition is able to disable scheduler load balancing and the CPU affinity of the unbound workqueue to avoid the isolated CPUs. This patch series will extend that with other kernel noises associated with the nohz_full boot command line parameter which has the following sub-categories:
- tick
- timer
- RCU
- MISC
- WQ
- kthread
Thanks for working on that, I'm about to leave for 2 weeks vacation so I won't have the time to check this until I'm back.
However this series is highly conflicting with mine (cpuset/isolation: Honour kthreads preferred affinity). Your patchset even redoes things I'm doing (housekeeping cpumask update, RCU synchronization, HK_TYPE_DOMAIN to include cpusets, etc...)
I have a v2 that is almost ready to post.
Wouldn't it be better to wait for it and its infrastructure changes before proceeding with nohz_full?
Sure. I am just posting this RFC patch series to show my current idea that I have. I will wait for your v2 and integrate on top.
Looking forward to your upcoming v2 patch.
Cheers, Longman
linux-kselftest-mirror@lists.linaro.org