Hello, Yafang!
Thank you for continuing working on this!
On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
After switching to memcg-based bpf memory accounting, the bpf memory is charged to the loader's memcg by defaut, that causes unexpected issues for us. For instance, the container of the loader-which loads the bpf programs and pins them on bpffs-may restart after pinning the progs and maps. After the restart, the pinned progs and maps won't belong to the new container any more, while they actually belong to an offline memcg left by the previous generation. That inconsistent behavior will make trouble for the memory resource management for this container.
I'm looking at this text and increasingly feeling that it's not a bpf-specific problem and it shouldn't be solved as one.
Is there any significant reason why the loader can't temporarily enter the root cgroup before creating bpf maps/progs? I know it will require some changes in the userspace code, but so do new bpf flags.
The reason why these progs and maps have to be persistent across multiple generations is that these progs and maps are also used by other processes which are not in this container. IOW, they can't be removed when this container is restarted. Take a specific example, bpf program for clsact qdisc is loaded by a agent running in a container, which not only loads bpf program but also processes the data generated by this program and do some other maintainace things.
In order to keep the charging behavior consistent, we used to consider a way to recharge these pinned maps and progs again after the container is restarted, but after the discussion[1] with Roman, we decided to go another direction that don't charge them to the container in the first place. TL;DR about the mentioned disccussion: recharging is not a generic solution and it may take too much risk.
This patchset is the solution of no charge. Two flags are introduced in union bpf_attr, one for bpf map and another for bpf prog. The user who doesn't want to charge to current memcg can use these two flags. These two flags are only permitted for sys admin as these memory will be accounted to the root memcg only.
If we're going with bpf-specific flags (which I'd prefer not to), let's define them as the way to create system-wide bpf objects which are expected to outlive the original cgroup. With expectations that they will be treated as belonging to the root cgroup by any sort of existing or future resource accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
But then again: why just not create them in the root cgroup?
Thanks!