On Wed, Apr 3, 2024 at 6:01 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires benjamin.tissoires@redhat.com wrote:
goto out; }
spin_lock(&t->sleepable_lock); drop_prog_refcnt(t);
spin_unlock(&t->sleepable_lock);
this also looks odd.
I basically need to protect "t->prog = NULL;" from happening while bpf_timer_work_cb is setting up the bpf program to be run.
Ok. I think I understand the race you're trying to fix. The bpf_timer_cancel_and_free() is doing cancel_work() and proceeds with kfree_rcu(t, rcu);
That's the only race and these extra locks don't help.
The t->prog = NULL is nothing to worry about. The bpf_timer_work_cb() might still see callback_fn == NULL "when it's being setup" and it's ok. These locks don't help that.
I suggest to drop sleepable_lock everywhere. READ_ONCE of callback_fn in bpf_timer_work_cb() is enough. Add rcu_read_lock_trace() before calling bpf prog.
The race to fix is above 'cancel_work + kfree_rcu' since kfree_rcu might free 'struct bpf_hrtimer *t' while the work is pending and work_queue internal logic might UAF struct work_struct work. By the time it may luckily enter bpf_timer_work_cb() it's too late. The argument 'struct work_struct *work' might already be freed.
To fix this problem, how about the following: don't call kfree_rcu and instead queue the work to free it. After cancel_work(&t->work); the work_struct can be reused. So set it up to call "freeing callback" and do schedule_work(&t->work);
There is a big assumption here that new work won't be executed before cancelled work completes. Need to check with wq experts.
Another approach is to do something smart with cancel_work() return code. If it returns true set a flag inside bpf_hrtimer and make bpf_timer_work_cb() free(t) after bpf prog finishes.
Looking through wq code... I think I have to correct myself. cancel_work and immediate free is probably fine from wq pov. It has this comment: worker->current_func(work); /* * While we must be careful to not use "work" after this, the trace * point will only record its address. */ trace_workqueue_execute_end(work, worker->current_func);
the bpf_timer_work_cb() might still be running bpf prog. So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns, since kfree_rcu(t, rcu); could have freed it by then. There is also this code in net/rxrpc/rxperf.c cancel_work(&call->work); kfree(call);
Correction to correction. Above piece in rxrpc is buggy. The following race is possible: cpu 0 process_one_work() set_work_pool_and_clear_pending(work, pool->id, 0);
cpu 1 cancel_work() kfree_rcu(work)
worker->current_func(work);
Here 'work' is a pointer to freed memory. Though wq code will not be touching it, callback will UAF.
Also what I proposed earlier as: INIT_WORK(A); schedule_work(); cancel_work(); INIT_WORK(B); schedule_work(); won't guarantee the ordering. Since the callback function is different, find_worker_executing_work() will consider it a separate work item.
Another option is to to keep bpf_timer_work_cb callback and add a 'bool free_me;' to struct bpf_hrtimer and let the callback free it. But it's also racy. cancel_work() may return false, though worker->current_func(work) wasn't called yet. So we cannot set 'free_me' in bpf_timer_cancel_and_free() in race free maner.
After brainstorming with Tejun it seems the best is to use another work_struct to call a different callback and do cancel_work_sync() there.
So we need something like:
struct bpf_hrtimer { union { struct hrtimer timer; + struct work_struct work; }; struct bpf_map *map; struct bpf_prog *prog; void __rcu *callback_fn; void *value; union { struct rcu_head rcu; + struct work_struct sync_work; }; + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE };
'work' will be used to call bpf_timer_work_cb. 'sync_work' will be used to call cancel_work_sync() + kfree_rcu().
And, of course, schedule_work(&t->sync_work); from bpf_timer_cancel_and_free() instead of kfree_rcu.