[PATCH] sched/psi: fix use-after-free in ep_remove_wait_queue()

List overview All Threads
Download

newer

older

stable-rc/queue/5.10 build: 184...

Linux 5.15.94

Munehisa Kamata

2 Feb 2023 2 Feb '23

3 a.m.

If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed without clearing the queue and reference, then it can result in use-after-free as below. Use wake_up_pollfree() instead to ensure that the queue and reference are cleared before freeing.

BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404

CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017 Call Trace: <TASK> dump_stack_lvl+0x73/0xa0 print_report+0x16c/0x4e0 ? _printk+0x59/0x80 ? __virt_addr_valid+0xb8/0x130 ? _raw_spin_lock_irqsave+0x60/0xc0 kasan_report+0xc3/0xf0 ? _raw_spin_lock_irqsave+0x60/0xc0 kasan_check_range+0x2d2/0x310 _raw_spin_lock_irqsave+0x60/0xc0 remove_wait_queue+0x1a/0xa0 ep_free+0x12c/0x170 ep_eventpoll_release+0x26/0x30 __fput+0x202/0x400 task_work_run+0x11d/0x170 do_exit+0x495/0x1130 ? update_cfs_rq_load_avg+0x2c2/0x2e0 do_group_exit+0x100/0x100 get_signal+0xd67/0xde0 ? finish_task_switch+0x15f/0x3a0 arch_do_signal_or_restart+0x2a/0x2b0 exit_to_user_mode_prepare+0x94/0x100 syscall_exit_to_user_mode+0x20/0x40 do_syscall_64+0x52/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f8e392bfb91 Code: Unable to access opcode bytes at 0x7f8e392bfb67. RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91 RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004 RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240 R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0 R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000 </TASK>

Allocated by task 4404: kasan_set_track+0x3d/0x60 __kasan_kmalloc+0x85/0x90 psi_trigger_create+0x113/0x3e0 pressure_write+0x146/0x2e0 cgroup_file_write+0x11c/0x250 kernfs_fop_write_iter+0x186/0x220 vfs_write+0x3d8/0x5c0 ksys_write+0x90/0x110 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 4407: kasan_set_track+0x3d/0x60 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x11d/0x170 slab_free_freelist_hook+0x87/0x150 __kmem_cache_free+0xcb/0x180 psi_trigger_destroy+0x2e8/0x310 cgroup_file_release+0x4f/0xb0 kernfs_drain_open_files+0x165/0x1f0 kernfs_drain+0x162/0x1a0 __kernfs_remove+0x1fb/0x310 kernfs_remove_by_name_ns+0x95/0xe0 cgroup_addrm_files+0x67f/0x700 cgroup_destroy_locked+0x283/0x3c0 cgroup_rmdir+0x29/0x100 kernfs_iop_rmdir+0xd1/0x140 vfs_rmdir+0xfe/0x240 do_rmdir+0x13d/0x280 __x64_sys_rmdir+0x2c/0x30 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/ Fixes: 0e94682b73bf ("psi: introduce psi monitor") Cc: stable@vger.kernel.org Signed-off-by: Munehisa Kamata kamatam@amazon.com Signed-off-by: Mengchi Cheng mengcc@amazon.com --- kernel/sched/psi.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..6e66c15f6450 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)

group = t->group; /* - * Wakeup waiters to stop polling. Can happen if cgroup is deleted - * from under a polling process. + * Wakeup waiters to stop polling and clear the queue to prevent it from + * being accessed later. Can happen if cgroup is deleted from under a + * polling process otherwise. */ - wake_up_interruptible(&t->event_wait); + wake_up_pollfree(&t->event_wait);

mutex_lock(&group->trigger_lock);

-- 2.38.1

Show replies by date

Eric Biggers

2 Feb 2 Feb

4:56 a.m.

On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:

...

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..6e66c15f6450 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t) group = t->group; /*
* Wakeup waiters to stop polling. Can happen if cgroup is deleted
* from under a polling process.
* Wakeup waiters to stop polling and clear the queue to prevent it from
* being accessed later. Can happen if cgroup is deleted from under a
* polling process otherwise.
*/
wake_up_interruptible(&t->event_wait);

wake_up_pollfree(&t->event_wait);

mutex_lock(&group->trigger_lock);

wake_up_pollfree() should only be used in extremely rare cases. Why can't the lifetime of the waitqueue be fixed instead?

- Eric

Suren Baghdasaryan

9:11 p.m.

On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers ebiggers@kernel.org wrote:

...

On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:

...
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..6e66c15f6450 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
  group = t->group;
  /*
 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
 * from under a polling process.
 * Wakeup waiters to stop polling and clear the queue to prevent it from
 * being accessed later. Can happen if cgroup is deleted from under a
 * polling process otherwise.
 */
wake_up_interruptible(&t->event_wait);
wake_up_pollfree(&t->event_wait);

mutex_lock(&group->trigger_lock);
wake_up_pollfree() should only be used in extremely rare cases. Why can't the lifetime of the waitqueue be fixed instead?

waitqueue lifetime in this case is linked to cgroup_file_release(), which seems appropriate to me here. Unfortunately cgroup_file_release() is not directly linked to the file's lifetime. For more details see: https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdE... . So, if we want to fix the lifetime of the waitqueue, we would have to tie cgroup_file_release() to the fput() somehow. IOW, the fix would have to be done at the cgroups or higher (kernfs?) layer. Thanks, Suren.

...

Eric

Suren Baghdasaryan

9 Feb 9 Feb

5:09 p.m.

On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan surenb@google.com wrote:

...

On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers ebiggers@kernel.org wrote:

...
On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:

...
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..6e66c15f6450 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
  group = t->group;
  /*
 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
 * from under a polling process.
 * Wakeup waiters to stop polling and clear the queue to prevent it from
 * being accessed later. Can happen if cgroup is deleted from under a
 * polling process otherwise.
 */
wake_up_interruptible(&t->event_wait);
wake_up_pollfree(&t->event_wait);

mutex_lock(&group->trigger_lock);
wake_up_pollfree() should only be used in extremely rare cases. Why can't the lifetime of the waitqueue be fixed instead?
waitqueue lifetime in this case is linked to cgroup_file_release(), which seems appropriate to me here. Unfortunately cgroup_file_release() is not directly linked to the file's lifetime. For more details see: https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdE... . So, if we want to fix the lifetime of the waitqueue, we would have to tie cgroup_file_release() to the fput() somehow. IOW, the fix would have to be done at the cgroups or higher (kernfs?) layer.

Hi Eric, Do you still object to using wake_up_pollfree() for this case? Changing higher levels to make cgroup_file_release() be tied to fput() would be ideal but I think that would be a big change for this one case. If you agree I'll Ack this patch. Thanks, Suren.

...

Thanks, Suren.

...

Eric

Eric Biggers

6:46 p.m.

On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:

...

On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan surenb@google.com wrote:

...
On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers ebiggers@kernel.org wrote:

...
On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:

...
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..6e66c15f6450 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
  group = t->group;
  /*
 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
 * from under a polling process.
 * Wakeup waiters to stop polling and clear the queue to prevent it from
 * being accessed later. Can happen if cgroup is deleted from under a
 * polling process otherwise.
 */
wake_up_interruptible(&t->event_wait);
wake_up_pollfree(&t->event_wait);

mutex_lock(&group->trigger_lock);
wake_up_pollfree() should only be used in extremely rare cases. Why can't the lifetime of the waitqueue be fixed instead?
waitqueue lifetime in this case is linked to cgroup_file_release(), which seems appropriate to me here. Unfortunately cgroup_file_release() is not directly linked to the file's lifetime. For more details see: https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdE... . So, if we want to fix the lifetime of the waitqueue, we would have to tie cgroup_file_release() to the fput() somehow. IOW, the fix would have to be done at the cgroups or higher (kernfs?) layer.
Hi Eric, Do you still object to using wake_up_pollfree() for this case? Changing higher levels to make cgroup_file_release() be tied to fput() would be ideal but I think that would be a big change for this one case. If you agree I'll Ack this patch. Thanks, Suren.

I haven't read the code closely in this case. I'm just letting you know that wake_up_pollfree() is very much a last-resort option for when the waitqueue lifetime can't be fixed. So if you want to use wake_up_pollfree(), you need to explain why no other fix is possible. For example maybe the UAPI depends on the waitqueue having a nonstandard lifetime.

- Eric

Suren Baghdasaryan

7:13 p.m.

On Thu, Feb 9, 2023 at 10:46 AM Eric Biggers ebiggers@kernel.org wrote:

...

On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:

...
On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan surenb@google.com wrote:

...
On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers ebiggers@kernel.org wrote:

...
On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:

...
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..6e66c15f6450 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
  group = t->group;
  /*
 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
 * from under a polling process.
 * Wakeup waiters to stop polling and clear the queue to prevent it from
 * being accessed later. Can happen if cgroup is deleted from under a
 * polling process otherwise.
 */
wake_up_interruptible(&t->event_wait);
wake_up_pollfree(&t->event_wait);

mutex_lock(&group->trigger_lock);
wake_up_pollfree() should only be used in extremely rare cases. Why can't the lifetime of the waitqueue be fixed instead?
waitqueue lifetime in this case is linked to cgroup_file_release(), which seems appropriate to me here. Unfortunately cgroup_file_release() is not directly linked to the file's lifetime. For more details see: https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdE... . So, if we want to fix the lifetime of the waitqueue, we would have to tie cgroup_file_release() to the fput() somehow. IOW, the fix would have to be done at the cgroups or higher (kernfs?) layer.
Hi Eric, Do you still object to using wake_up_pollfree() for this case? Changing higher levels to make cgroup_file_release() be tied to fput() would be ideal but I think that would be a big change for this one case. If you agree I'll Ack this patch. Thanks, Suren.
I haven't read the code closely in this case. I'm just letting you know that wake_up_pollfree() is very much a last-resort option for when the waitqueue lifetime can't be fixed.

Got it. Thanks for the warning. I think it can be fixed but the right fix would require a sizable higher level refactoring which might be more justifiable if we have more such cases in the future.

...

So if you want to use wake_up_pollfree(), you need to explain why no other fix is possible. For example maybe the UAPI depends on the waitqueue having a nonstandard lifetime.

I think the changelog should explain that the waitqueue lifetime in cases of non-root cgroups is tied to cgroup_file_release() callback, which in turn is not tied to file's lifetime. That's the reason for waitqueue and the file having different lifecycles. Would that suffice as the justification? Again, I'm not saying that no other fix is possible, but that the right fix would be much more complex. Thanks, Suren.

...

Eric

Suren Baghdasaryan

13 Feb 13 Feb

11:50 p.m.

On Thu, Feb 9, 2023 at 11:13 AM Suren Baghdasaryan surenb@google.com wrote:

...

On Thu, Feb 9, 2023 at 10:46 AM Eric Biggers ebiggers@kernel.org wrote:

...
On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:

...
On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan surenb@google.com wrote:

...
On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers ebiggers@kernel.org wrote:

...
On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:

...
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..6e66c15f6450 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
  group = t->group;
  /*
 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
 * from under a polling process.
 * Wakeup waiters to stop polling and clear the queue to prevent it from
 * being accessed later. Can happen if cgroup is deleted from under a
 * polling process otherwise.
 */
wake_up_interruptible(&t->event_wait);
wake_up_pollfree(&t->event_wait);

mutex_lock(&group->trigger_lock);
wake_up_pollfree() should only be used in extremely rare cases. Why can't the lifetime of the waitqueue be fixed instead?
waitqueue lifetime in this case is linked to cgroup_file_release(), which seems appropriate to me here. Unfortunately cgroup_file_release() is not directly linked to the file's lifetime. For more details see: https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdE... . So, if we want to fix the lifetime of the waitqueue, we would have to tie cgroup_file_release() to the fput() somehow. IOW, the fix would have to be done at the cgroups or higher (kernfs?) layer.
Hi Eric, Do you still object to using wake_up_pollfree() for this case? Changing higher levels to make cgroup_file_release() be tied to fput() would be ideal but I think that would be a big change for this one case. If you agree I'll Ack this patch. Thanks, Suren.
I haven't read the code closely in this case. I'm just letting you know that wake_up_pollfree() is very much a last-resort option for when the waitqueue lifetime can't be fixed.
Got it. Thanks for the warning. I think it can be fixed but the right fix would require a sizable higher level refactoring which might be more justifiable if we have more such cases in the future.

...
So if you want to use wake_up_pollfree(), you need to explain why no other fix is possible. For example maybe the UAPI depends on the waitqueue having a nonstandard lifetime.

I think the changelog should explain that the waitqueue lifetime in cases of non-root cgroups is tied to cgroup_file_release() callback, which in turn is not tied to file's lifetime. That's the reason for waitqueue and the file having different lifecycles. Would that suffice as the justification?

Ok, in the absence of objections, I would suggest resending this patch with the changelog including details about waitqueue lifetime and reasons wake_up_pollfree() is required here. Munehisa, feel free to reuse https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdE... if you find it useful. Thanks, Suren.

...

Again, I'm not saying that no other fix is possible, but that the right fix would be much more complex. Thanks, Suren.

...

Eric

Munehisa Kamata

14 Feb 14 Feb

7:04 a.m.

New subject: [PATCH v2] sched/psi: fix use-after-free in ep_remove_wait_queue()

do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy

However, the polling thread can keep having the last reference to the pressure file that is tied to the freed waitqueue until explicit close or exit later.

fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue

Then, the thread accesses to the already-freed waitqueue when dropping the reference and results in use-after-free as pasted below.

The fundamental problem here is that the lifetime of the waitqueue is not tied to the file's real lifetime as shown above. Using wake_up_pollfree() here might be less than ideal, but it also is not fully contradicting the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this.

BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404

v2: updated commit message

mutex_lock(&group->trigger_lock);

-- 2.38.1

Suren Baghdasaryan

5:10 p.m.

New subject: [PATCH v2] sched/psi: fix use-after-free in ep_remove_wait_queue()

Thanks! Overall LGTM, just a couple of nits (simplifications):

On Mon, Feb 13, 2023 at 11:04 PM Munehisa Kamata kamatam@amazon.com wrote:

...

If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed without clearing the queue and reference in the following path.

Let's remove "without clearing the queue and reference" in the above sentence. The next section explains why this is problematic, therefore mentioning that here is unnecessary IMHO.

...

do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy

However, the polling thread can keep having the last reference to the pressure file that is tied to the freed waitqueue until explicit close or exit later.

Suggest replacing: However, the polling thread still has a reference to the pressure file it is polling and will access the freed waitqueue when file is closed or upon exit:

...

fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue

Then, the thread accesses to the already-freed waitqueue when dropping the reference and results in use-after-free as pasted below.

Suggest replacing: This results is use-after-free as pasted below.

...

The fundamental problem here is that the lifetime of the waitqueue is not tied to the file's real lifetime as shown above.

The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime.

...

Using wake_up_pollfree() here might be less than ideal, but it also is not fully contradicting the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this.

BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404

CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017 Call Trace:

<TASK> dump_stack_lvl+0x73/0xa0 print_report+0x16c/0x4e0 ? _printk+0x59/0x80 ? __virt_addr_valid+0xb8/0x130 ? _raw_spin_lock_irqsave+0x60/0xc0 kasan_report+0xc3/0xf0 ? _raw_spin_lock_irqsave+0x60/0xc0 kasan_check_range+0x2d2/0x310 _raw_spin_lock_irqsave+0x60/0xc0 remove_wait_queue+0x1a/0xa0 ep_free+0x12c/0x170 ep_eventpoll_release+0x26/0x30 __fput+0x202/0x400 task_work_run+0x11d/0x170 do_exit+0x495/0x1130 ? update_cfs_rq_load_avg+0x2c2/0x2e0 do_group_exit+0x100/0x100 get_signal+0xd67/0xde0 ? finish_task_switch+0x15f/0x3a0 arch_do_signal_or_restart+0x2a/0x2b0 exit_to_user_mode_prepare+0x94/0x100 syscall_exit_to_user_mode+0x20/0x40 do_syscall_64+0x52/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f8e392bfb91 Code: Unable to access opcode bytes at 0x7f8e392bfb67. RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91 RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004 RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240 R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0 R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000 </TASK>

Allocated by task 4404: kasan_set_track+0x3d/0x60 __kasan_kmalloc+0x85/0x90 psi_trigger_create+0x113/0x3e0 pressure_write+0x146/0x2e0 cgroup_file_write+0x11c/0x250 kernfs_fop_write_iter+0x186/0x220 vfs_write+0x3d8/0x5c0 ksys_write+0x90/0x110 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 4407: kasan_set_track+0x3d/0x60 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x11d/0x170 slab_free_freelist_hook+0x87/0x150 __kmem_cache_free+0xcb/0x180 psi_trigger_destroy+0x2e8/0x310 cgroup_file_release+0x4f/0xb0 kernfs_drain_open_files+0x165/0x1f0 kernfs_drain+0x162/0x1a0 __kernfs_remove+0x1fb/0x310 kernfs_remove_by_name_ns+0x95/0xe0 cgroup_addrm_files+0x67f/0x700 cgroup_destroy_locked+0x283/0x3c0 cgroup_rmdir+0x29/0x100 kernfs_iop_rmdir+0xd1/0x140 vfs_rmdir+0xfe/0x240 do_rmdir+0x13d/0x280 __x64_sys_rmdir+0x2c/0x30 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd

v2: updated commit message

Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/ Fixes: 0e94682b73bf ("psi: introduce psi monitor") Cc: stable@vger.kernel.org Signed-off-by: Munehisa Kamata kamatam@amazon.com Signed-off-by: Mengchi Cheng mengcc@amazon.com

kernel/sched/psi.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..6e66c15f6450 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
    group = t->group;
    /*
   * Wakeup waiters to stop polling. Can happen if cgroup is deleted
   * from under a polling process.
   * Wakeup waiters to stop polling and clear the queue to prevent it from
   * being accessed later. Can happen if cgroup is deleted from under a
   * polling process otherwise.

This "otherwise" at the end seems extra. Was there a continuation of this comment which was removed without removing this "otherwise" ?

...

*/

  wake_up_interruptible(&t->event_wait);

  wake_up_pollfree(&t->event_wait);

  mutex_lock(&group->trigger_lock);

-- 2.38.1

Munehisa Kamata

6:13 p.m.

New subject: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()

If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed in the following path.

do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy

However, the polling thread still has a reference to the pressure file and will access the freed waitqueue when the file is closed or upon exit.

fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue

This results in use-after-free as pasted below.

The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime. Using wake_up_pollfree() here might be less than ideal, but it also is not fully contradicting the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this.

BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404

v3: updated commit message and the comment in the code v2: updated commit message

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..02e011cabe91 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)

group = t->group; /* - * Wakeup waiters to stop polling. Can happen if cgroup is deleted - * from under a polling process. + * Wakeup waiters to stop polling and clear the queue to prevent it from + * being accessed later. Can happen if cgroup is deleted from under a + * polling process. */ - wake_up_interruptible(&t->event_wait); + wake_up_pollfree(&t->event_wait);

mutex_lock(&group->trigger_lock);

-- 2.38.1

Suren Baghdasaryan

6:28 p.m.

New subject: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()

On Tue, Feb 14, 2023 at 10:13 AM Munehisa Kamata kamatam@amazon.com wrote:

...

If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed in the following path.

do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy

However, the polling thread still has a reference to the pressure file and will access the freed waitqueue when the file is closed or upon exit.

fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue

This results in use-after-free as pasted below.

The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime. Using wake_up_pollfree() here might be less than ideal, but it also is not fully contradicting the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this.

BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404

CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017 Call Trace:

<TASK> dump_stack_lvl+0x73/0xa0 print_report+0x16c/0x4e0 ? _printk+0x59/0x80 ? __virt_addr_valid+0xb8/0x130 ? _raw_spin_lock_irqsave+0x60/0xc0 kasan_report+0xc3/0xf0 ? _raw_spin_lock_irqsave+0x60/0xc0 kasan_check_range+0x2d2/0x310 _raw_spin_lock_irqsave+0x60/0xc0 remove_wait_queue+0x1a/0xa0 ep_free+0x12c/0x170 ep_eventpoll_release+0x26/0x30 __fput+0x202/0x400 task_work_run+0x11d/0x170 do_exit+0x495/0x1130 ? update_cfs_rq_load_avg+0x2c2/0x2e0 do_group_exit+0x100/0x100 get_signal+0xd67/0xde0 ? finish_task_switch+0x15f/0x3a0 arch_do_signal_or_restart+0x2a/0x2b0 exit_to_user_mode_prepare+0x94/0x100 syscall_exit_to_user_mode+0x20/0x40 do_syscall_64+0x52/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f8e392bfb91 Code: Unable to access opcode bytes at 0x7f8e392bfb67. RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91 RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004 RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240 R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0 R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000 </TASK>

Allocated by task 4404: kasan_set_track+0x3d/0x60 __kasan_kmalloc+0x85/0x90 psi_trigger_create+0x113/0x3e0 pressure_write+0x146/0x2e0 cgroup_file_write+0x11c/0x250 kernfs_fop_write_iter+0x186/0x220 vfs_write+0x3d8/0x5c0 ksys_write+0x90/0x110 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 4407: kasan_set_track+0x3d/0x60 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x11d/0x170 slab_free_freelist_hook+0x87/0x150 __kmem_cache_free+0xcb/0x180 psi_trigger_destroy+0x2e8/0x310 cgroup_file_release+0x4f/0xb0 kernfs_drain_open_files+0x165/0x1f0 kernfs_drain+0x162/0x1a0 __kernfs_remove+0x1fb/0x310 kernfs_remove_by_name_ns+0x95/0xe0 cgroup_addrm_files+0x67f/0x700 cgroup_destroy_locked+0x283/0x3c0 cgroup_rmdir+0x29/0x100 kernfs_iop_rmdir+0xd1/0x140 vfs_rmdir+0xfe/0x240 do_rmdir+0x13d/0x280 __x64_sys_rmdir+0x2c/0x30 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd

v3: updated commit message and the comment in the code v2: updated commit message

Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/ Fixes: 0e94682b73bf ("psi: introduce psi monitor") Cc: stable@vger.kernel.org Signed-off-by: Munehisa Kamata kamatam@amazon.com Signed-off-by: Mengchi Cheng mengcc@amazon.com

Acked-by: Suren Baghdasaryan surenb@google.com

...

kernel/sched/psi.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..02e011cabe91 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
    group = t->group;
    /*
   * Wakeup waiters to stop polling. Can happen if cgroup is deleted
   * from under a polling process.
   * Wakeup waiters to stop polling and clear the queue to prevent it from
   * being accessed later. Can happen if cgroup is deleted from under a
   * polling process.
   */
  wake_up_interruptible(&t->event_wait);
  wake_up_pollfree(&t->event_wait);

  mutex_lock(&group->trigger_lock);
-- 2.38.1

Suren Baghdasaryan

6:29 p.m.

New subject: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()

On Tue, Feb 14, 2023 at 10:28 AM Suren Baghdasaryan surenb@google.com wrote:

...

On Tue, Feb 14, 2023 at 10:13 AM Munehisa Kamata kamatam@amazon.com wrote:

...
If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed in the following path.

do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy

However, the polling thread still has a reference to the pressure file and will access the freed waitqueue when the file is closed or upon exit.

fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue

This results in use-after-free as pasted below.

The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime. Using wake_up_pollfree() here might be less than ideal, but it also is not fully contradicting the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this.

BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404

CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017 Call Trace:

<TASK> dump_stack_lvl+0x73/0xa0 print_report+0x16c/0x4e0 ? _printk+0x59/0x80 ? __virt_addr_valid+0xb8/0x130 ? _raw_spin_lock_irqsave+0x60/0xc0 kasan_report+0xc3/0xf0 ? _raw_spin_lock_irqsave+0x60/0xc0 kasan_check_range+0x2d2/0x310 _raw_spin_lock_irqsave+0x60/0xc0 remove_wait_queue+0x1a/0xa0 ep_free+0x12c/0x170 ep_eventpoll_release+0x26/0x30 __fput+0x202/0x400 task_work_run+0x11d/0x170 do_exit+0x495/0x1130 ? update_cfs_rq_load_avg+0x2c2/0x2e0 do_group_exit+0x100/0x100 get_signal+0xd67/0xde0 ? finish_task_switch+0x15f/0x3a0 arch_do_signal_or_restart+0x2a/0x2b0 exit_to_user_mode_prepare+0x94/0x100 syscall_exit_to_user_mode+0x20/0x40 do_syscall_64+0x52/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f8e392bfb91 Code: Unable to access opcode bytes at 0x7f8e392bfb67. RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91 RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004 RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240 R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0 R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000 </TASK>

Allocated by task 4404: kasan_set_track+0x3d/0x60 __kasan_kmalloc+0x85/0x90 psi_trigger_create+0x113/0x3e0 pressure_write+0x146/0x2e0 cgroup_file_write+0x11c/0x250 kernfs_fop_write_iter+0x186/0x220 vfs_write+0x3d8/0x5c0 ksys_write+0x90/0x110 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 4407: kasan_set_track+0x3d/0x60 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x11d/0x170 slab_free_freelist_hook+0x87/0x150 __kmem_cache_free+0xcb/0x180 psi_trigger_destroy+0x2e8/0x310 cgroup_file_release+0x4f/0xb0 kernfs_drain_open_files+0x165/0x1f0 kernfs_drain+0x162/0x1a0 __kernfs_remove+0x1fb/0x310 kernfs_remove_by_name_ns+0x95/0xe0 cgroup_addrm_files+0x67f/0x700 cgroup_destroy_locked+0x283/0x3c0 cgroup_rmdir+0x29/0x100 kernfs_iop_rmdir+0xd1/0x140 vfs_rmdir+0xfe/0x240 do_rmdir+0x13d/0x280 __x64_sys_rmdir+0x2c/0x30 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd

v3: updated commit message and the comment in the code v2: updated commit message

Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/ Fixes: 0e94682b73bf ("psi: introduce psi monitor") Cc: stable@vger.kernel.org Signed-off-by: Munehisa Kamata kamatam@amazon.com Signed-off-by: Mengchi Cheng mengcc@amazon.com

Acked-by: Suren Baghdasaryan surenb@google.com

CC'ing PeterZ directly for inclusion into his tree.

...

...
kernel/sched/psi.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8ac8b81bfee6..02e011cabe91 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
    group = t->group;
    /*
   * Wakeup waiters to stop polling. Can happen if cgroup is deleted
   * from under a polling process.
   * Wakeup waiters to stop polling and clear the queue to prevent it from
   * being accessed later. Can happen if cgroup is deleted from under a
   * polling process.
   */
  wake_up_interruptible(&t->event_wait);
  wake_up_pollfree(&t->event_wait);

  mutex_lock(&group->trigger_lock);
-- 2.38.1

Eric Biggers

6:55 p.m.

New subject: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()

On Tue, Feb 14, 2023 at 10:13:35AM -0800, Munehisa Kamata wrote:

...

Using wake_up_pollfree() here might be less than ideal, but it also is not fully contradicting the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case.

If that comment is outdated now, then please update the comment too. We can do better than "not fully contradicting".

- Eric

Suren Baghdasaryan

7:13 p.m.

New subject: [PATCH v3] sched/psi: fix use-after-free in ep_remove_wait_queue()

On Tue, Feb 14, 2023 at 10:55 AM Eric Biggers ebiggers@kernel.org wrote:

...

On Tue, Feb 14, 2023 at 10:13:35AM -0800, Munehisa Kamata wrote:

...
Using wake_up_pollfree() here might be less than ideal, but it also is not fully contradicting the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case.

If that comment is outdated now, then please update the comment too. We can do better than "not fully contradicting".

Agree. I don't see it contradicting that comment.

...

Eric

1041

days inactive

1053

days old

linux-stable-mirror@lists.linaro.org

13 comments

participants

tags (0)

participants (3)

Eric Biggers
Munehisa Kamata
Suren Baghdasaryan