On 11/15/21 14:31, Michal Koutný wrote:
Hello.
On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long longman@redhat.com wrote:
- When set to "isolated", the CPUs in that partition root will
- be in an isolated state without any load balancing from the
- scheduler. Tasks in such a partition must be explicitly bound
- to each individual CPU.
This sounds reasonable but it seems to have some usability issues as was raised in another thread [1]. (I could only think of the workaround of single-cpu cgroup leaves + CLONE_INTO_CGROUP.)
It can be a problem when one is trying to move from one cgroup to another cgroup with non-overlapping cpus laterally. However, if a task is initially from a parent cgroup with affinity mask that include cpus in the isolated child cgroup, I believe it should be able to move to the isolated child cgroup without problem. Otherwise, it is a bug that needs to be fixed.
TL;DR Do whatever you find suitable but (re)consider sticking to the delegation principle (making hotplug and ancestor changes equal).
Now to the constraints and partition setups. I think it's useful to have a model with which the implementation can be compared with. I tried to condense some "simple rules" from the descriptions you posted in v8 plus your response to my remarks in v7 [2]. These should only be the "validity conditions", not "transition conditions".
## Validity conditions
For simplification, there's a condition called 'degraded' that tells whether a cpuset can host tasks (with the given config) that expands to two predicates:
degraded := cpus.internal_effective == ø && has_tasks valid_root := !degraded && cpus_exclusive && parent.valid_root (valid_member := !degraded)
with a helping predicate cpus_exclusive := cpus not shared by a sibling
The effective CPUs basically combine configured+available CPUs
cpus.internal_effective := (cpus ∩ parent.cpus ∩ online_cpus) - passed
where passed := union of children cpus whose partition is not member
Finally, to handle the degraded cpusets gracefully, we define
if (!degraded) cpus.effective := cpus.internal_effective else cpus.effective := parent.cpus.effective
(In cases when there's no parent, we replace its cpus with online_cpus.)
I'll try applying these conditions to your description.
- "cpuset.cpus" must always be set up first before enabling
- partition.
This is just a transition condition.
Unlike "member" whose "cpuset.cpus.effective" can
- contain CPUs not in "cpuset.cpus", this can never happen with a
- valid partition root. In other words, "cpuset.cpus.effective"
- is always a subset of "cpuset.cpus" for a valid partition root.
IIUC this refers to the cgroup that is 'degraded'. (The consequences for a valid partition root follow from valid_root definition above.)
- When a parent partition root cannot exclusively grant any of
- the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective"
- becomes empty.
This sounds too strict to me, perhaps you meant 'cannot grant _all_ of the CPUs'?
I think the wording may be confusing. What I meant is none of the requested cpu can be granted. So if there is at least one granted, the effective cpus won't be empty.
If there are tasks in the partition root, the
- partition root becomes invalid and "cpuset.cpus.effective"
- is reset to that of the nearest non-empty ancestor.
This is captured in the definition of 'degraded'.
Note that a task cannot be moved to a croup with empty
"cpuset.cpus.effective".
A transition condition. (Makes sense.)
[With the validity conditions above, it's possible to have 'valid_root' with empty cpus (hence also empty cpus.internal_effective) if there are no tasks in there. The transition conditions so far prevented this corner case.]
- There are additional constraints on where a partition root can
- be enabled ("root" or "isolated"). It can only be enabled in
- a cgroup if all the following conditions are met.
I think the enablement (aka rewriting cpuset.cpus.partition) could be always possible but it'd result in "root invalid (...)" if the resulting config doesn't meet the validity condition.
- The "cpuset.cpus" is non-empty and exclusive, i.e. they are
not shared by any of its siblings.
The emptiness here is a judgement call (in my formulation of the conditions it seemed simpler to allow empty cpus.internal_effective with no tasks).
There are more constraints in enabling a partition. Once it is enabled, there will be less constraints to maintain its validity.
- The parent cgroup is a valid partition root.
Captured in the valid_root definition.
- The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
This is unnecessary strictness. Allow such config, cpus.internal_effective still can't be more than parent's cpuset.cpus. (Or do you have a reason to discard such configs?)
- There is no child cgroups with cpuset enabled. This avoids
cpu migrations of multiple cgroups simultaneously which can
be problematic.
A transition condition (i.e. not relevant to validity conditions).
- Once becoming a partition root, changes to "cpuset.cpus"
- is generally allowed as long as the cpu list is exclusive,
- non-empty and is a superset of children's cpu lists.
Any changes should be allowed otherwise it denies the delegation principle of v2 (IOW a parent should be able to preempt CPUs given to chilren previously and not be denied because of them).
(If the change results in failed validity condition the cgroup of course cannot be be a valid_root anymore.)
The constraints of a valid partition root are as follows:
1) The parent cgroup is a valid partition root.
2) "cpuset.cpus.effective" is a subset of "cpuset.cpus"
3) "cpuset.cpus.effective" is non-empty when there are tasks
in the partition.
(This seem to miss the sibling exclusivity condition.) Here I'd simply paste the "Validity conditions" specified above instead.
You currently cannot make change to cpuset.cpus that violates the cpu exclusivity rule. The above constraints will not disallow you to make the change. They just affect the validity of the partition root.
Changing a partition root to "member" is always allowed.
If there are child partition roots underneath it, however,
they will be forced to be switched back to "member" too and
lose their partitions. So care must be taken to double check
for this condition before disabling a partition root.
(Or is this how delegation is intended?) However, AFAICS, parent still can't remove cpuset.cpus even when the child is a "member". Otherwise, I agree with the back-switch.
There are only 2 possibilities here. Either we force the child partitions to be become members or invalid partition root. The purpose of invalid partition root is actually a transient state which can be recovered in some way to make the partition again. However, changing a parent partition root to member breaks the possibility of recovering later. That is why I think it is more sensible to force those child partitions to become members.
- Setting a cgroup to a valid partition root will take the CPUs
- away from the effective CPUs of the parent partition.
Captured in the definition of cpus.internal_effective.
- A valid parent partition may distribute out all its CPUs to
- its child partitions as long as it is not the root cgroup as
- we need some house-keeping CPUs in the root cgroup.
This actually applies to any root partition that's supposed to host tasks. (IOW, 'valid_root' cannot be 'degraded'.)
- An invalid partition is not a real partition even though some
- internal states may still be kept.
Tautology? (Or new definition of "real".)
- An invalid partition root can be reverted back to a real
- partition root if none of the constraints of a valid partition
root are violated.
Yes. (Also tautological.)
Anyway, as I said above, I just tried to formulate the model for clearer understanding and the implementation may introduce transition constraints but it'd be good to always have the simple rules to tell what's a valid root in the tree and what's not.
Thanks for analyzing each statements for their validity. I will try to improve it to make it easier to understand.
Cheers, Longman