v2: - [v1] https://lore.kernel.org/lkml/20230412153758.3088111-1-longman@redhat.com/ - Dropped the special "isolcpus" partition in v1 - Add the root only "cpuset.cpus.reserve" control file for reserving CPUs used for remote isolated partitions. - Update the test_cpuset_prs.sh test script and documentation accordingly.
This patch series introduces a new category of cpuset partition called remote partitions. The existing partition category where the partition roots have to be clustered around the root cgroup in a hierarchical way is now referred to as adjacent partitions.
A remote partition can be formed far from the root cgroup with no partition root parent. The only commonality is that the CPUs that are used in the partition as specified in "cpuset.cpus" have to be present in the "cpuset.cpus" of all its ancestors.
It is relatively rare to have applications that require creation of a separate scheduling domain (root). However, it is more common to have applications that require the use of isolated CPUs (isolated), e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options to get that statically. Of course, the "isolated" partition is another way to achieve that dynamically.
Modern container orchestration tools like Kubernetes use the cgroup hierarchy to manage different containers. And it is relying on other middleware like systemd to help managing it. If a container needs to use isolated CPUs, it is hard to get those with the adjacent partitions as it will require the administrative parent cgroup to be a partition root too which tool like systemd may not be ready to manage.
With this patch series, a new root cgroup only "cpuset.cpus.reserve" file is added to specify the set of CPUs that can be used in partitions (whether remote or adjacent). To create a remote partition, the set of CPUs to be used in that partition (the "cpuset.cpus" file of the partition root) has to be reserved by manually adding them to that control file first. Then that partition can be activated by writing "isolated" into its "cpuset.cpus.partition". CPU reservation of adjacent partitions is done automatically without touching "cpuset.cpus.reserve" at all.
Currently only remote isolated partitions are supported, we could support a scheduling partition ("root") in the future if the need arises. Additional isolation attributes like those with the "isolcpus" or "nohz" boot command line options may be supported in the isolated partitions in the future.
Waiman Long (6): cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling cgroup/cpuset: Improve temporary cpumasks handling cgroup/cpuset: Add cpuset.cpus.reserve for top cpuset cgroup/cpuset: Introduce remote isolated partition cgroup/cpuset: Documentation update for partition cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition
Documentation/admin-guide/cgroup-v2.rst | 92 ++- kernel/cgroup/cpuset.c | 749 +++++++++++++++--- .../selftests/cgroup/test_cpuset_prs.sh | 403 ++++++---- 3 files changed, 988 insertions(+), 256 deletions(-)
linux-kselftest-mirror@lists.linaro.org