Syzbot found a corrupted list bug scenario that can be triggered from cgroup css_create(). The reproduces writes to cgroup.subtree_control file, which invokes cgroup_apply_control_enable(), css_create(), and css_populate_dir(), which then randomly fails with a fault injected -ENOMEM. In such scenario the css_create() error path rcu enqueues css_free_rwork_fn work for an css->refcnt initialized with css_release() destructor, and there is a chance that the css_release() function will be invoked for a cgroup_subsys_state, for which a destroy_work has already been queued via css_create() error path. This causes a list_add corruption as can be seen in the syzkaller report [1]. This can be avoided by adding a check to css_release() that checks if it has already been enqueued.
[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465db...
Cc: Tejun Heo tj@kernel.org Cc: Zefan Li lizefan.x@bytedance.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Christian Brauner brauner@kernel.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: Andrii Nakryiko andrii@kernel.org Cc: Martin KaFai Lau kafai@fb.com Cc: Song Liu songliubraving@fb.com Cc: Yonghong Song yhs@fb.com Cc: John Fastabend john.fastabend@gmail.com Cc: KP Singh kpsingh@kernel.org Cc: cgroups@vger.kernel.org Cc: netdev@vger.kernel.org Cc: bpf@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org
Reported-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item") Signed-off-by: Tadeusz Struk tadeusz.struk@linaro.org --- kernel/cgroup/cgroup.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index adb820e98f24..9ae2de29f8c9 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5210,8 +5210,11 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt);
- INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, + work_data_bits(&css->destroy_work))) { + INIT_WORK(&css->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } }
static void init_and_link_css(struct cgroup_subsys_state *css,
Hello Tadeusz.
Thanks for analyzing this syzbot report. Let me provide my understanding of the test case and explanation why I think your patch fixes it but is not fully correct.
On Tue, Apr 12, 2022 at 12:24:59PM -0700, Tadeusz Struk tadeusz.struk@linaro.org wrote:
Syzbot found a corrupted list bug scenario that can be triggered from cgroup css_create(). The reproduces writes to cgroup.subtree_control file, which invokes cgroup_apply_control_enable(), css_create(), and css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
The reproducer code makes it hard for me to understand which function fails with ENOMEM. But I can see your patch fixes the reproducer and your additional debug patch which proves that css->destroy_work is re-queued.
In such scenario the css_create() error path rcu enqueues css_free_rwork_fn work for an css->refcnt initialized with css_release() destructor,
Note that css_free_rwork_fn() utilizes css->destroy_*r*work. The error path in css_create() open codes relevant parts of css_release_work_fn() so that css_release() can be skipped and the refcnt is eventually just percpu_ref_exit()'d.
and there is a chance that the css_release() function will be invoked for a cgroup_subsys_state, for which a destroy_work has already been queued via css_create() error path.
But I think the problem is css_populate_dir() failing in cgroup_apply_control_enable(). (Is this what you actually meant? css_create() error path is then irrelevant, no?)
The already created csses should then be rolled back via cgroup_restore_control(cgrp); cgroup_apply_control_disable(cgrp); ... kill_css(css)
I suspect the double-queuing is a result of the fact that there exists only the single reference to the css->refcnt. I.e. it's percpu_ref_kill_and_confirm()'d and released both at the same time.
(Normally (when not killing the last reference), css->destroy_work reuse is not a problem because of the sequenced chain css_killed_work_fn()->css_put()->css_release().)
This can be avoided by adding a check to css_release() that checks if it has already been enqueued.
If that's what's happening, then your patch omits the final css_release_work_fn() in favor of css_killed_work_fn() but both should be run during the rollback upon css_populate_dir() failure.
So an alternative approach to tackle this situation would be to split css->destroy_work into two work work_structs (one for killing, one for releasing) at the cost of inflating cgroup_subsys_state.
Take my hypothesis with a grain of salt maybe the assumption (last reference == initial reference) is not different from normal operation.
Regards, Michal
Hi Michal, Thanks for your analysis.
On 4/14/22 09:44, Michal Koutný wrote:
Hello Tadeusz.
Thanks for analyzing this syzbot report. Let me provide my understanding of the test case and explanation why I think your patch fixes it but is not fully correct.
On Tue, Apr 12, 2022 at 12:24:59PM -0700, Tadeusz Struk tadeusz.struk@linaro.org wrote:
Syzbot found a corrupted list bug scenario that can be triggered from cgroup css_create(). The reproduces writes to cgroup.subtree_control file, which invokes cgroup_apply_control_enable(), css_create(), and css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
The reproducer code makes it hard for me to understand which function fails with ENOMEM. But I can see your patch fixes the reproducer and your additional debug patch which proves that css->destroy_work is re-queued.
Yes, it is hard to see the actual failing point because, I think it is randomly failing in different places. I think in the actual case that causes the list corruption is in fact in css_create(). It is the css_create() error path that does fist rcu enqueue in:
https://elixir.bootlin.com/linux/v5.10.109/source/kernel/cgroup/cgroup.c#L52...
and the second is triggered by the css->refcnt calling css_release()
The reason why we don't see it actually failing in css_create() in the trace dump is that the fail_dump() is rate-limited, see: https://elixir.bootlin.com/linux/v5.18-rc2/source/lib/fault-inject.c#L44
I was confused as well, so I put additional debug prints in every place where css_release() can fail, and it was actually in css_create()->cgroup_idr_alloc() that failed in my case.
What happened was, the write triggered: cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()
which, allocates and initializes the css, then fails in cgroup_idr_alloc(), bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
then cgroup_subtree_control_write() bails out to out_unlock:, which then goes:
cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)
which then calls ref->data->release(ref) and enqueues the same &css->destroy_rwork on cgroup_destroy_wq causing list corruption in insert_work.
In such scenario the css_create() error path rcu enqueues css_free_rwork_fn work for an css->refcnt initialized with css_release() destructor,
Note that css_free_rwork_fn() utilizes css->destroy_*r*work. The error path in css_create() open codes relevant parts of css_release_work_fn() so that css_release() can be skipped and the refcnt is eventually just percpu_ref_exit()'d.
and there is a chance that the css_release() function will be invoked for a cgroup_subsys_state, for which a destroy_work has already been queued via css_create() error path.
But I think the problem is css_populate_dir() failing in cgroup_apply_control_enable(). (Is this what you actually meant? css_create() error path is then irrelevant, no?)
I thought so too at first as the the crushdump shows that this is failing in css_populate_dir(), but this is not the fail that causes the list corruption. The code can recover from the fail in css_populate_dir(). The fail that causes trouble is in css_create(), that makes it go to its error path. I can dig out the patch with my debug prints and request syzbot to run it if you want.
The already created csses should then be rolled back via cgroup_restore_control(cgrp); cgroup_apply_control_disable(cgrp); ... kill_css(css)
I suspect the double-queuing is a result of the fact that there exists only the single reference to the css->refcnt. I.e. it's percpu_ref_kill_and_confirm()'d and released both at the same time.
(Normally (when not killing the last reference), css->destroy_work reuse is not a problem because of the sequenced chain css_killed_work_fn()->css_put()->css_release().)
This can be avoided by adding a check to css_release() that checks if it has already been enqueued.
If that's what's happening, then your patch omits the final css_release_work_fn() in favor of css_killed_work_fn() but both should be run during the rollback upon css_populate_dir() failure.
This change only prevents from double queue:
queue_[rcu]_work(cgroup_destroy_wq, &css->destroy_rwork);
I don't see how it affects the css_killed_work_fn() clean path. I didn't look at it, since I thought it is irrelevant in this case.
Hello,
On Thu, Apr 14, 2022 at 10:51:18AM -0700, Tadeusz Struk wrote:
What happened was, the write triggered: cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()
which, allocates and initializes the css, then fails in cgroup_idr_alloc(), bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
Yes, but this css hasn't been installed yet.
then cgroup_subtree_control_write() bails out to out_unlock:, which then goes:
cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)
And this is a different css. cgroup->self which isn't connected to the half built css which got destroyed in css_create().
So, I have a bit of difficulty following this scenario. The way that the current code uses destroy_work is definitely nasty and it'd probably be a good idea to separate out the different use cases, but let's first understand what's failing.
Thanks.
On Thu, Apr 14, 2022 at 06:44:09PM +0200, Michal Koutný wrote:
I suspect the double-queuing is a result of the fact that there exists only the single reference to the css->refcnt. I.e. it's percpu_ref_kill_and_confirm()'d and released both at the same time.
(Normally (when not killing the last reference), css->destroy_work reuse is not a problem because of the sequenced chain css_killed_work_fn()->css_put()->css_release().)
If this is the case, we need to hold an extra reference to be put by the css_killed_work_fn(), right?
Thanks.
On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo tj@kernel.org wrote:
If this is the case, we need to hold an extra reference to be put by the css_killed_work_fn(), right?
I looked into it a bit more lately and found that there already is such a fuse in kill_css() [1].
At the same type syzbots stack trace demonstrates the fuse is ineffective
css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146 (**) percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline] percpu_ref_put include/linux/percpu-refcount.h:338 [inline] percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline] (*) percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199 rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485 rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722 rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735 __do_softirq+0x27e/0x596 kernel/softirq.c:305
(*) this calls css_killed_ref_fn confirm_switch (**) zero references after confirmed kill?
So, I was also looking at the possible race with css_free_rwork_fn() (from failed css_create()) but that would likely emit a warning from __percpu_ref_exit().
So, I still think there's something fishy (so far possible only via artificial ENOMEM injection) that needs an explanation...
Michal
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kern...
On 4/22/22 04:05, Michal Koutný wrote:
On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo tj@kernel.org wrote:
If this is the case, we need to hold an extra reference to be put by the css_killed_work_fn(), right?
I looked into it a bit more lately and found that there already is such a fuse in kill_css() [1].
At the same type syzbots stack trace demonstrates the fuse is ineffective
css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146 (**) percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline] percpu_ref_put include/linux/percpu-refcount.h:338 [inline] percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline] (*) percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199 rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485 rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722 rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735 __do_softirq+0x27e/0x596 kernel/softirq.c:305
(*) this calls css_killed_ref_fn confirm_switch (**) zero references after confirmed kill?
So, I was also looking at the possible race with css_free_rwork_fn() (from failed css_create()) but that would likely emit a warning from __percpu_ref_exit().
So, I still think there's something fishy (so far possible only via artificial ENOMEM injection) that needs an explanation...
I can't reliably reproduce this issue on neither mainline nor v5.10, where syzbot originally found it. It still triggers for syzbot though.
Syzbot found a corrupted list bug scenario that can be triggered from cgroup css_create(). The reproduces writes to cgroup.subtree_control file, which invokes cgroup_apply_control_enable(), css_create(), and css_populate_dir(), which then randomly fails with a fault injected -ENOMEM. In such scenario the css_create() error path rcu enqueues css_free_rwork_fn work for an css->refcnt initialized with css_release() destructor, and there is a chance that the css_release() function will be invoked for a cgroup_subsys_state, for which a destroy_work has already been queued via css_create() error path. This causes a list_add corruption as can be seen in the syzkaller report [1]. This can be fixed by separating the css_release and ref_kill paths to work with two separate work_structs.
[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465db...
Cc: Tejun Heo tj@kernel.org Cc: Zefan Li lizefan.x@bytedance.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Christian Brauner brauner@kernel.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: Andrii Nakryiko andrii@kernel.org Cc: Martin KaFai Lau kafai@fb.com Cc: Song Liu songliubraving@fb.com Cc: Yonghong Song yhs@fb.com Cc: John Fastabend john.fastabend@gmail.com Cc: KP Singh kpsingh@kernel.org Cc: cgroups@vger.kernel.org Cc: netdev@vger.kernel.org Cc: bpf@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org
Reported-and-tested-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item") Signed-off-by: Tadeusz Struk tadeusz.struk@linaro.org --- v2: Add a separate work_struct for the css_ref_kill path instead of checking if a work has already been enqueued. --- include/linux/cgroup-defs.h | 5 +++-- kernel/cgroup/cgroup.c | 14 +++++++------- 2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1bfcfb1af352..92b0c5e8c472 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -178,8 +178,9 @@ struct cgroup_subsys_state { */ atomic_t online_cnt;
- /* percpu_ref killing and RCU release */ - struct work_struct destroy_work; + /* percpu_ref killing, css release, and RCU release work structs */ + struct work_struct release_work; + struct work_struct killed_ref_work; struct rcu_work destroy_rwork;
/* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index adb820e98f24..3e00a793e15d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5099,7 +5099,7 @@ static struct cftype cgroup_base_files[] = { * css_free_work_fn(). * * It is actually hairier because both step 2 and 4 require process context - * and thus involve punting to css->destroy_work adding two additional + * and thus involve punting to css->release_work adding two additional * steps to the already complex sequence. */ static void css_free_rwork_fn(struct work_struct *work) @@ -5154,7 +5154,7 @@ static void css_free_rwork_fn(struct work_struct *work) static void css_release_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css = - container_of(work, struct cgroup_subsys_state, destroy_work); + container_of(work, struct cgroup_subsys_state, release_work); struct cgroup_subsys *ss = css->ss; struct cgroup *cgrp = css->cgroup;
@@ -5210,8 +5210,8 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt);
- INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + INIT_WORK(&css->release_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->release_work); }
static void init_and_link_css(struct cgroup_subsys_state *css, @@ -5546,7 +5546,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) static void css_killed_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css = - container_of(work, struct cgroup_subsys_state, destroy_work); + container_of(work, struct cgroup_subsys_state, killed_ref_work);
mutex_lock(&cgroup_mutex);
@@ -5567,8 +5567,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref) container_of(ref, struct cgroup_subsys_state, refcnt);
if (atomic_dec_and_test(&css->online_cnt)) { - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + INIT_WORK(&css->killed_ref_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->killed_ref_work); } }
linux-stable-mirror@lists.linaro.org