The "nohz_full" and "rcu_nocbs" boot command parameters can be used to remove a lot of kernel overhead on a specific set of isolated CPUs which can be used to run some latency/bandwidth sensitive workloads with as little kernel disturbance/noise as possible. The problem with this mode of operation is the fact that it is a static configuration which cannot be changed after boot to adjust for changes in application loading.
There is always a desire to enable runtime modification of the number of isolated CPUs that can be dedicated to this type of demanding workloads. This patchset is an attempt to do just that with an amount of CPU isolation close to what can be done with the nohz_full and rcu_nocbs boot kernel parameters.
This patch series provides the ability to change the set of housekeeping CPUs at run time via the cpuset isolated partition functionality. Currently, the cpuset isolated partition is able to disable scheduler load balancing and the CPU affinity of the unbound workqueue to avoid the isolated CPUs. This patch series will extend that with other kernel noises associated with the nohz_full boot command line parameter which has the following sub-categories: - tick - timer - RCU - MISC - WQ - kthread
The rcu_nocbs is actually a subset of nohz_full focusing just on the RCU part of the kernel noises. The WQ part has already been handled by the current cpuset code.
This series focuses on the tick and RCU part of the kernel noises by actively changing their internal data structures to track changes in the list of isolated CPUs used by cpuset isolated partitions.
The dynamic update of the lists of housekeeping CPUs at run time will also have impact on the other part of the kernel noises that reference the lists of housekeeping CPUs at run time.
The pending patch series on timer migration[1], when properly integrated will support the timer part too.
The CPU hotplug functionality of the Linux kernel is used to facilitate the runtime change of the nohz_full isolated CPUs with minimal code changes. The CPUs that need to be switched from non-isolated to isolated or vice versa will be brought offline first, making the necessary changes and then brought back online afterward.
The use of CPU hotplug, however, does have a slight drawback of freezing all the other CPUs in part of the offlining process using the stop machine feature of the kernel. That will cause a noticeable latency spikes in other running applications which may be significant to sensitive applications running on isolated CPUs in other isolated partitions at the time. Hopefully we can find a way to solve this problem in the future.
One possible workaround for this is to reserve a set of nohz_full isolated CPUs at boot time using the nohz_full boot command parameter. The bringing of those nohz_full reserved CPUs into and out of isolated partitions will not invoke CPU hotplug and hence will not cause unexpected latency spikes. These reserved CPUs will only be needed if there are other existing isolated partitions running critical applications at the time when an isolated partition needs to be created.
Patches 1-4 updates the CPU isolation code at kernel/sched/isolation.c to enable dynamic update of the lists of housekeeping CPUs.
Patch 5 introduces a new cpuhp_offline_cb() API for shutting down the given set of CPUs, running the given callback method and then bringing those CPUs back online again. This new API will block any incoming hotplug events from interfering this operation.
Patches 6-9 updates the cpuset partition code to use the new cpuhp API to shut down the affect CPUs, making changes to the housekeeping cpumasks and then bring those CPUs online afterward.
Patch 10 works around an issue in the DL server code that block the hotplug operation under certain configurations.
Patch 11-14 updates the timer tick and related code to enable proper updates to the set of CPUs requiring nohz_full dynticks support.
Patch 15 enables runtime modification to the set of isolated CPUs requiring RCU NO-CB CPU support with minor changes to the RCU code.
Patches 16-18 includes other miscellaneous updates to cpuset code and documentation.
This patch series is applied on top of some other cpuset patches[1] posted upstream recently.
[1] https://lore.kernel.org/lkml/20250806093855.86469-1-gmonaco@redhat.com/ [2] https://lore.kernel.org/lkml/20250806172430.1155133-1-longman@redhat.com/
Waiman Long (18): sched/isolation: Enable runtime update of housekeeping cpumasks sched/isolation: Call sched_tick_offload_init() when HK_FLAG_KERNEL_NOISE is first set sched/isolation: Use RCU to delay successive housekeeping cpumask updates sched/isolation: Add a debugfs file to dump housekeeping cpumasks cpu/hotplug: Add a new cpuhp_offline_cb() API cgroup/cpuset: Introduce a new top level isolcpus_update_mutex cgroup/cpuset: Allow overwriting HK_TYPE_DOMAIN housekeeping cpumask cgroup/cpuset: Use CPU hotplug to enable runtime nohz_full modification cgroup/cpuset: Revert "Include isolated cpuset CPUs in cpu_is_isolated() check" sched/core: Ignore DL BW deactivation error if in cpuhp_offline_cb_mode tick/nohz: Make nohz_full parameter optional tick/nohz: Introduce tick_nohz_full_update_cpus() to update tick_nohz_full_mask tick/nohz: Allow runtime changes in full dynticks CPUs tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() cgroup/cpuset: Enable RCU NO-CB CPU offloading of newly isolated CPUs cgroup/cpuset: Don't set have_boot_nohz_full without any boot time nohz_full CPU cgroup/cpuset: Documentation updates & don't use CPU 0 for isolated partition cgroup/cpuset: Add pr_debug() statements for cpuhp_offline_cb() call
Documentation/admin-guide/cgroup-v2.rst | 33 +- .../admin-guide/kernel-parameters.txt | 19 +- include/linux/context_tracking.h | 8 +- include/linux/cpuhplock.h | 9 + include/linux/cpuset.h | 6 - include/linux/rcupdate.h | 2 + include/linux/sched/isolation.h | 9 +- include/linux/tick.h | 2 + kernel/cgroup/cpuset.c | 344 ++++++++++++------ kernel/context_tracking.c | 21 +- kernel/cpu.c | 47 +++ kernel/rcu/tree_nocb.h | 7 +- kernel/sched/core.c | 8 +- kernel/sched/debug.c | 32 ++ kernel/sched/isolation.c | 151 +++++++- kernel/sched/sched.h | 2 +- kernel/time/tick-common.c | 15 +- kernel/time/tick-sched.c | 24 +- .../selftests/cgroup/test_cpuset_prs.sh | 15 +- 19 files changed, 583 insertions(+), 171 deletions(-)