On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires benjamin.tissoires@redhat.com wrote:
So we need something like:
struct bpf_hrtimer { union { struct hrtimer timer;
- struct work_struct work; }; struct bpf_map *map; struct bpf_prog *prog; void __rcu *callback_fn; void *value; union {
Are you sure we need an union here? If we get to call kfree_rcu() we need to have both struct rcu_head and sync_work, not one or the other.
why? with an extra flag it's one or the other. In bpf_timer_cancel_and_free() if (flag & SLEEPABLE) { schedule_work() to cancel_work_sync + kfree_rcu } else { hrtimer_cancel kfree_rcu }
I thought kfree_rcu required struct rcu_head, and given that we need to initialize sync_work it will be poisoned...
struct rcu_head rcu;
- struct work_struct sync_work; };
- u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init() (like in my v2 or v3 IIRC). But that means that once a timer is initialized it needs to be of one or the other type (especially true with the first union in this struct).
yes. That's an idea. The code to support wq vs timer seems to be diverging more than what we expected initially. It seems cleaner to set it as init time and enforce in other helpers.
OK, works for me.
Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer) too far. It's already at 112 bytes and some people use bpf_timer per flow. So potentially millions of such timers. Adding extra sizeof(struct work_struct)=32 * 2 that won't be used is too much. Note that sizeof(struct hrtimer)=64, so unions make everything fit nicely.
Maybe we should do union { struct hrtimer timer; struct { struct work_struct work; struct work_struct sync_work; } }
(not nice to read but at least we don't change the size at the beginning)
So should we reject during run time bpf_timer_set_callback() for sleepable timers and only allow bpf_timer_set_sleepable_cb() for those? (and the invert in the other case).
yes.
This version of the patch allows for one timer to be used as softIRQ or WQ, depending on the timer_set_callback that is used. But it might be simpler for the kfree_rcu race to define the bpf_timer to be of one kind, so we are sure to call the correct kfree method.
I think one or another simplifies the code and makes it easier to think through combinations.
I'm still contemplating adding new "struct bpf_wq" and new kfuncs to completely separate wq vs timer. The code reuse seems to be relatively small.
There is some code reuse in the verifier, but it can be factored out I think.
Though the biggest reuse might be in the map portion of bpf_timer, which I haven't looked much TBH.
We can potentially factor out internals of bpf_timer_* into smaller helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
Yeah, also, given that we are going to enforce delay == 0 for sleepable timers (wq), the user api would be much cleaner if we can have a dedicated bpf_wq (and it would make the flags of bpf_timer_init easier to deal with).
One more thing. bpf_timer_cancel() api turned out to be troublesome. Not only it cancels the timer, but drops callback too. It was a surprising behavior for people familiar with kernel timer api-s. We should not repeat this mistake with wq.
We can try to fix bpf_timer_cancel() too. If we drop drop_prog_refcnt() from it it shouldn't affect existing bpf_timer users who are forced to do: bpf_timer_cancel() bpf_timer_set_callback() bpf_timer_start() all the time. If/when bpf_timer_cancel() stops dropping the callback such bpf prog won't be affected. So low chance of breaking any prog. wdyt?
How would a program know set_callback() is not required after a cancel() because the kernel kept it around? It seems that it's going to be hard for them to know that (unless by trying first a start()), and it will add more code.
timer_cancel() would be hard to change but we can always do the change and add a new kfunc timer_cancel_no_drop() which would clearly allow for new programs to know that set_callback() is not required to be called. In a few kernel releases we could remove it and say that timer_cancel() is the same (and replaced by a #define)
Anyway, the more I think of it, the more I think the best API would be a dedicated wq API. It's probably going to need a little bit more work, but it'll be more or less this work plus the new bpf_wq type in the map.
Cheers, Benjamin