On Thu, Apr 4, 2024 at 10:56 AM Benjamin Tissoires benjamin.tissoires@redhat.com wrote:
On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires benjamin.tissoires@redhat.com wrote:
So we need something like:
struct bpf_hrtimer { union { struct hrtimer timer;
- struct work_struct work; }; struct bpf_map *map; struct bpf_prog *prog; void __rcu *callback_fn; void *value; union {
Are you sure we need an union here? If we get to call kfree_rcu() we need to have both struct rcu_head and sync_work, not one or the other.
why? with an extra flag it's one or the other. In bpf_timer_cancel_and_free() if (flag & SLEEPABLE) { schedule_work() to cancel_work_sync + kfree_rcu } else { hrtimer_cancel kfree_rcu }
I thought kfree_rcu required struct rcu_head, and given that we need to initialize sync_work it will be poisoned...
yes. It needs rcu_head. But where do you see a conflict? INIT_WORK + schedule_work() will use that space, then cancel_work_sync() will wait on a different work_struct, then kfree_rcu() will reuse that space.
In case of hrtimers none of the work_structs will be used.
struct rcu_head rcu;
- struct work_struct sync_work; };
- u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init() (like in my v2 or v3 IIRC). But that means that once a timer is initialized it needs to be of one or the other type (especially true with the first union in this struct).
yes. That's an idea. The code to support wq vs timer seems to be diverging more than what we expected initially. It seems cleaner to set it as init time and enforce in other helpers.
OK, works for me.
Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer) too far. It's already at 112 bytes and some people use bpf_timer per flow. So potentially millions of such timers. Adding extra sizeof(struct work_struct)=32 * 2 that won't be used is too much. Note that sizeof(struct hrtimer)=64, so unions make everything fit nicely.
Maybe we should do union { struct hrtimer timer; struct { struct work_struct work; struct work_struct sync_work; } }
It's also ok, but sharing rcu_head and work_struct seems cleaner, since it highlights that they're exclusive.
(not nice to read but at least we don't change the size at the beginning)
So should we reject during run time bpf_timer_set_callback() for sleepable timers and only allow bpf_timer_set_sleepable_cb() for those? (and the invert in the other case).
yes.
This version of the patch allows for one timer to be used as softIRQ or WQ, depending on the timer_set_callback that is used. But it might be simpler for the kfree_rcu race to define the bpf_timer to be of one kind, so we are sure to call the correct kfree method.
I think one or another simplifies the code and makes it easier to think through combinations.
I'm still contemplating adding new "struct bpf_wq" and new kfuncs to completely separate wq vs timer. The code reuse seems to be relatively small.
There is some code reuse in the verifier, but it can be factored out I think.
Though the biggest reuse might be in the map portion of bpf_timer, which I haven't looked much TBH.
Right. It's all the 'case BPF_TIMER:' in various places. New 'struct bpf_wq' would need another entry in btf_field_type. But that should be a straightforward addition.
We can potentially factor out internals of bpf_timer_* into smaller helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
Yeah, also, given that we are going to enforce delay == 0 for sleepable timers (wq), the user api would be much cleaner if we can have a dedicated bpf_wq (and it would make the flags of bpf_timer_init easier to deal with).
It seems so. Kinda hard to judge one way or the other without looking at the final code, but it seems separation is worth attempting, at least.
Also if we ever do hrtimer+wq we probably will be using 'struct delayed_work' instead of rolling our own 'struct hrtimer' + 'struct work_struct' combo.
It seems wq logic already made such a combination special enough and thought through the races, so we better just follow that path. In that case it might be yet another 'struct bpf_delayed_wq' and another set of kfuncs. Considering that cancel_work() and cancel_delayed_work() are separate api in the kernel. Multiplexing all of them under bpf_timer_cancel() seems wrong. In the past we were somewhat limited in terms of helpers. We tried not to add them unless absolutely necessary because of uapi considerations. Now with kfuncs we can add/tweak/remove them at will.
One more thing. bpf_timer_cancel() api turned out to be troublesome. Not only it cancels the timer, but drops callback too. It was a surprising behavior for people familiar with kernel timer api-s. We should not repeat this mistake with wq.
We can try to fix bpf_timer_cancel() too. If we drop drop_prog_refcnt() from it it shouldn't affect existing bpf_timer users who are forced to do: bpf_timer_cancel() bpf_timer_set_callback() bpf_timer_start() all the time. If/when bpf_timer_cancel() stops dropping the callback such bpf prog won't be affected. So low chance of breaking any prog. wdyt?
How would a program know set_callback() is not required after a cancel() because the kernel kept it around? It seems that it's going to be hard for them to know that (unless by trying first a start()), and it will add more code.
timer_cancel() would be hard to change but we can always do the change and add a new kfunc timer_cancel_no_drop() which would clearly allow
that works too.
for new programs to know that set_callback() is not required to be called. In a few kernel releases we could remove it and say that timer_cancel() is the same (and replaced by a #define)
#define won't work, since mechanics of detecting and calling helpers vs kfuncs is quite different.
Anyway, the more I think of it, the more I think the best API would be a dedicated wq API. It's probably going to need a little bit more work, but it'll be more or less this work plus the new bpf_wq type in the map.
It seems to me as well.
Thanks for brainstorming.