On 8/20/25 00:31, Djalal Harouni wrote:
On 8/18/25 18:32, Tejun Heo wrote:
On Mon, Aug 18, 2025 at 10:04:21AM +0100, Djalal Harouni wrote:
This patch series add support to write cgroup interfaces from BPF.
It is useful to freeze a cgroup hierarchy on suspicious activity for a more thorough analysis before killing it. Planned users of this feature are: systemd and BPF tools where the cgroup hierarchy could be a system service, user session, k8s pod or a container.
The writing happens via kernfs nodes and the cgroup must be on the default hierarchy. It implements the requests and feedback from v1 [1] where now we use a unified path for cgroup user space and BPF writing.
So I want to validate that this is the right approach first.
I don't see any reason to object to the feature but the way it's constructed seems rather odd to me. If it's going to need per-feature code, might as well bypass the write part and implement a simpler interface - ie. bpf_cgroup_freeze().
Approach 1: First RFC months ago was something like that "bpf_task_freeze_cgroup" [1], can make it bpf_cgroup_freeze() as a proper kfunc, so resurrect approach 1?
Internally it used an ugly path to workaround kernfs active reference since we don't hold a kernfs_open_file coming from userspace kernfs->write path.
I can improve it, but let's discuss please approach (2) since you suggested it ;-)
Approach 2: Per the old suggestions from you and Alexei [2] [3] you wanted something like:
s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
I didn't make it generic for kernfs, since don't know yet about sysfs use cases and named it "bpf_cgroup_write_interface" to focus on cgroup base interfaces. Doing something that generic now including sysfs without a proper valid use cases seems a bit too much. Also we have some cgroup kfunc to acquire and release that integrate well, so I kept it focused.
Alexei suggested to refactor the cgroup_base_file[] [4][5] to take "kernfs_node" as argument instead of "kernfs_open_file", which will open other possibilities for BPF.
However, instead of going full change on cgroup_base_files[], I added a minimalist: cgroup_kn_cftype kn_cfts[] that for now hold only "cgroup.freeze".
I see three possibilities here:
A. Minimal change with approach presented here: add dedicated array cgroup_kn_cftype kn_cfts[] with only "cgroup.freeze" and later try to unify it inside cgroup_base_file[]. B. Add "->bpf_write()" handler to cgroup_base_file[] and start only with "cgroup.freeze". C. Refactor all cgroup_base_file[] to take a kernfs_node directly instead of kernfs_open_file as suggested.
I took (C) the simple one since I wanted to do cgroup freeze first. You
Sorry here I meant (A).
Thanks.
also suggested maybe in future "cgroup.kill" well if we have it, we definitely will start using it. Not sure if we are allowed to BPF sleep that path, however we can also start doing "cgroup.freeze" from BPF and kill from user space as a first step. But we definitly want more BPF operations on cgroup interfaces, I can think of a companion cgroup where we migrate tasks there on specific signals...
So more or less current proposed approach (2) followed the suggestions, but focused only on writing cgroup kernfs knobs.
Thoughts, did I miss something?
Otherwise, can't it actually write to kernfs files so that we don't need to add code per enabled feature?
I'm not sure how we would write to kernfs files? As pointed by Alexei [6] it is more involved if we want to open files...
About "that we don't need to add code per enabled feature?" well if we go the path of "bpf_cgroup_write_interface" or "bpf_cgroup_write_knob" adding a new write interface will involve theoretically:
- Check if program can sleep or/and the calling context.
- Add the new "cgroup.x" either in cgroup_base_file[] or in the new
array. 3. New handlers or refactor old ones to take a "kernfs_node" instead of "kernfs_open_file".
Compared to having multiple bpf_cgroup_(freeze|kill|...) kfunc seems fair too, and not that much code from BPF part.
BTW current patches contain a bug, after testing. In normal writes from user context we break kernfs active protection to avoid nesting cgroup locking under. Forget this part. In next series since we don't grab the kernfs_node active protection like userspace kernfs_write, then no need to break it... will add a parameter to check like the revalidate one that checks things is cgroup on dfl and not root, things that are automatically handled from normal userspace ->write.
Thank you!
[1] https://lore.kernel.org/bpf/20240327225334.58474-3-tixxdz@gmail.com/ [2] https://lore.kernel.org/bpf/ZgXMww9kJiKi4Vmd@slm.duckdns.org/ [3] https://lore.kernel.org/bpf/Zgc1BZnYCS9OSSTw@slm.duckdns.org/ [4] https://lore.kernel.org/bpf/ CAADnVQK970_Nx3918V41ue031RkGs+WsteOAm6EJOY7oSwzS1A@mail.gmail.com/ [5] https://lore.kernel.org/bpf/ CAADnVQ+WmaPG1WOaSDbjxNPVzVape_JfG_CNSRy188ni076Mog@mail.gmail.com/ [6] https://lore.kernel.org/bpf/CAADnVQLhWDcX-7XCdo- W=jthU=9iPqODwrE6c9fvU8sfAJ5ARg@mail.gmail.com/
Thanks.