New subject: [RFC PATCH 01/18] sched/isolation: Enable runtime update of housekeeping cpumasks

8 Aug 2025


      The "nohz_full" and "rcu_nocbs" boot command parameters can be used to
remove a lot of kernel overhead on a specific set of isolated CPUs which
can be used to run some latency/bandwidth sensitive workloads with as
little kernel disturbance/noise as possible. The problem with this mode
of operation is the fact that it is a static configuration which cannot
be changed after boot to adjust for changes in application loading.
There is always a desire to enable runtime modification of the number
of isolated CPUs that can be dedicated to this type of demanding
workloads. This patchset is an attempt to do just that with an amount of
CPU isolation close to what can be done with the nohz_full and rcu_nocbs
boot kernel parameters.
This patch series provides the ability to change the set of housekeeping
CPUs at run time via the cpuset isolated partition functionality.
Currently, the cpuset isolated partition is able to disable scheduler
load balancing and the CPU affinity of the unbound workqueue to avoid the
isolated CPUs. This patch series will extend that with other kernel noises
associated with the nohz_full boot command line parameter which has the
following sub-categories:
  - tick
  - timer
  - RCU
  - MISC
  - WQ
  - kthread
The rcu_nocbs is actually a subset of nohz_full focusing just on the
RCU part of the kernel noises. The WQ part has already been handled by
the current cpuset code.
This series focuses on the tick and RCU part of the kernel noises by
actively changing their internal data structures to track changes in
the list of isolated CPUs used by cpuset isolated partitions.
The dynamic update of the lists of housekeeping CPUs at run time will
also have impact on the other part of the kernel noises that reference
the lists of housekeeping CPUs at run time.
The pending patch series on timer migration[1], when properly integrated
will support the timer part too.
The CPU hotplug functionality of the Linux kernel is used to facilitate
the runtime change of the nohz_full isolated CPUs with minimal code
changes. The CPUs that need to be switched from non-isolated to
isolated or vice versa will be brought offline first, making the
necessary changes and then brought back online afterward.
The use of CPU hotplug, however, does have a slight drawback of
freezing all the other CPUs in part of the offlining process using
the stop machine feature of the kernel. That will cause a noticeable
latency spikes in other running applications which may be significant
to sensitive applications running on isolated CPUs in other isolated
partitions at the time. Hopefully we can find a way to solve this
problem in the future.
One possible workaround for this is to reserve a set of nohz_full
isolated CPUs at boot time using the nohz_full boot command parameter.
The bringing of those nohz_full reserved CPUs into and out of isolated
partitions will not invoke CPU hotplug and hence will not cause
unexpected latency spikes. These reserved CPUs will only be needed
if there are other existing isolated partitions running critical
applications at the time when an isolated partition needs to be created.
Patches 1-4 updates the CPU isolation code at kernel/sched/isolation.c
to enable dynamic update of the lists of housekeeping CPUs.
Patch 5 introduces a new cpuhp_offline_cb() API for shutting down the
given set of CPUs, running the given callback method and then bringing
those CPUs back online again. This new API will block any incoming
hotplug events from interfering this operation.
Patches 6-9 updates the cpuset partition code to use the new cpuhp API
to shut down the affect CPUs, making changes to the housekeeping
cpumasks and then bring those CPUs online afterward.
Patch 10 works around an issue in the DL server code that block the
hotplug operation under certain configurations.
Patch 11-14 updates the timer tick and related code to enable proper
updates to the set of CPUs requiring nohz_full dynticks support.
Patch 15 enables runtime modification to the set of isolated CPUs
requiring RCU NO-CB CPU support with minor changes to the RCU code.
Patches 16-18 includes other miscellaneous updates to cpuset code and
documentation.
This patch series is applied on top of some other cpuset patches[1]
posted upstream recently.
[1] https://lore.kernel.org/lkml/20250806093855.86469-1-gmonaco@redhat.com/
[2] https://lore.kernel.org/lkml/20250806172430.1155133-1-longman@redhat.com/
Waiman Long (18):
  sched/isolation: Enable runtime update of housekeeping cpumasks
  sched/isolation: Call sched_tick_offload_init() when
    HK_FLAG_KERNEL_NOISE is first set
  sched/isolation: Use RCU to delay successive housekeeping cpumask
    updates
  sched/isolation: Add a debugfs file to dump housekeeping cpumasks
  cpu/hotplug: Add a new cpuhp_offline_cb() API
  cgroup/cpuset: Introduce a new top level isolcpus_update_mutex
  cgroup/cpuset: Allow overwriting HK_TYPE_DOMAIN housekeeping cpumask
  cgroup/cpuset: Use CPU hotplug to enable runtime nohz_full
    modification
  cgroup/cpuset: Revert "Include isolated cpuset CPUs in
    cpu_is_isolated() check"
  sched/core: Ignore DL BW deactivation error if in
    cpuhp_offline_cb_mode
  tick/nohz: Make nohz_full parameter optional
  tick/nohz: Introduce tick_nohz_full_update_cpus() to update
    tick_nohz_full_mask
  tick/nohz: Allow runtime changes in full dynticks CPUs
  tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
  cgroup/cpuset: Enable RCU NO-CB CPU offloading of newly isolated CPUs
  cgroup/cpuset: Don't set have_boot_nohz_full without any boot time
    nohz_full CPU
  cgroup/cpuset: Documentation updates & don't use CPU 0 for isolated
    partition
  cgroup/cpuset: Add pr_debug() statements for cpuhp_offline_cb() call
Documentation/admin-guide/cgroup-v2.rst       |  33 +-
 .../admin-guide/kernel-parameters.txt         |  19 +-
 include/linux/context_tracking.h              |   8 +-
 include/linux/cpuhplock.h                     |   9 +
 include/linux/cpuset.h                        |   6 -
 include/linux/rcupdate.h                      |   2 +
 include/linux/sched/isolation.h               |   9 +-
 include/linux/tick.h                          |   2 +
 kernel/cgroup/cpuset.c                        | 344 ++++++++++++------
 kernel/context_tracking.c                     |  21 +-
 kernel/cpu.c                                  |  47 +++
 kernel/rcu/tree_nocb.h                        |   7 +-
 kernel/sched/core.c                           |   8 +-
 kernel/sched/debug.c                          |  32 ++
 kernel/sched/isolation.c                      | 151 +++++++-
 kernel/sched/sched.h                          |   2 +-
 kernel/time/tick-common.c                     |  15 +-
 kernel/time/tick-sched.c                      |  24 +-
 .../selftests/cgroup/test_cpuset_prs.sh       |  15 +-
 19 files changed, 583 insertions(+), 171 deletions(-)
-- 
2.50.0

[RFC PATCH 00/18] cgroup/cpuset: Enable runtime modification of