Hello,
On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote: ...
file seems hacky to me. e.g. How would it interact with namespacing? Are there reasons why this can't be properly hierarchical other than the amount of work needed? For example:
cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs that the cgroup holds exclusively. The mask is always a subset of cpuset.cpus. The parent loses access to a CPU when the CPU is given to a child by setting the CPU in the child's cpus.exclusive and the CPU can't be given to more than one child. IOW, exclusive CPUs are available only to the leaf cgroups that have them set in their .exclusive file.
When a cgroup is turned into a partition, its cpuset.cpus and cpuset.cpus.exclusive should be the same. For backward compatibility, if the cgroup's parent is already a partition, cpuset will automatically attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
I could well be missing something important but I'd really like to see something like the above where the reservation feature blends in with the rest of cpuset.
It can certainly be made hierarchical as you suggest. It does increase complexity from both user and kernel point of view.
From the user point of view, there is one more knob to manage hierarchically which is not used that often.
From user pov, this only affects them when they want to create partitions down the tree, right?
From the kernel point of view, we may need to have one more cpumask per cpuset as the current subparts_cpus is used to track automatic reservation. We need another cpumask to contain extra exclusive CPUs not allocated through automatic reservation. The fact that you mention this new control file as a list of exclusively owned CPUs for this cgroup. Creating a partition is in fact allocating exclusive CPUs to a cgroup. So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail a write to
Yes, it substitutes and expands on cpuset.cpus.partition behavior.
cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this exclusive list is only valid if a valid partition can be formed. So we need to properly manage the dependency between these 2 control files.
So, I think cpus.exclusive can become the sole mechanism to arbitrate exclusive owenership of CPUs and .partition can depend on .exclusive.
Alternatively, I have no problem exposing cpuset.cpus.exclusive as a read-only file. It is a bit problematic if we need to make it writable.
I don't follow. How would remote partitions work then?
As for namespacing, you do raise a good point. I was thinking mostly from a whole system point of view as the use case that I am aware of does not needs that. To allow delegation of exclusive CPUs to a child cgroup, that cgroup has to be a partition root itself. One compromise that I can think of is to only allow automatic reservation only in such a scenario. In that case, I need to support a remote load balanced partition as well and hierarchical sub-partitions underneath it. That can be done with some extra code to the existing v2 patchset without introducing too much complexity.
IOW, the use of remote partition is only allowed on the whole system level where one has access to the cgroup root. Exclusive CPUs distribution within a container can only be done via the use of adjacent partitions with automatic reservation. Will that be a good enough compromise from your point of view?
It seems too twisted to me. I'd much prefer it to be better integrated with the rest of cpuset.
Thanks.