In our production environment, we found many hung tasks which are blocked for more than 18 hours. Their call traces are like this:
[346278.191038] __schedule+0x2d8/0x890 [346278.191046] schedule+0x4e/0xb0 [346278.191049] perf_event_free_task+0x220/0x270 [346278.191056] ? init_wait_var_entry+0x50/0x50 [346278.191060] copy_process+0x663/0x18d0 [346278.191068] kernel_clone+0x9d/0x3d0 [346278.191072] __do_sys_clone+0x5d/0x80 [346278.191076] __x64_sys_clone+0x25/0x30 [346278.191079] do_syscall_64+0x5c/0xc0 [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50 [346278.191086] ? do_syscall_64+0x69/0xc0 [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20 [346278.191092] ? irqentry_exit+0x19/0x30 [346278.191095] ? exc_page_fault+0x89/0x160 [346278.191097] ? asm_exc_page_fault+0x8/0x30 [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae
The task was waiting for the refcount become to 1, but from the vmcore, we found the refcount has already been 1. It seems that the task didn't get woken up by perf_event_release_kernel() and got stuck forever. The below scenario may cause the problem.
Thread A Thread B ... ... perf_event_free_task perf_event_release_kernel ... acquire event->child_mutex ... get_ctx ... release event->child_mutex acquire ctx->mutex ... perf_free_event (acquire/release event->child_mutex) ... release ctx->mutex wait_var_event acquire ctx->mutex acquire event->child_mutex # move existing events to free_list release event->child_mutex release ctx->mutex put_ctx ... ...
In this case, all events of the ctx have been freed, so we couldn't find the ctx in free_list and Thread A will miss the wakeup. It's thus necessary to add a wakeup after dropping the reference.
Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()") Cc: stable@vger.kernel.org Signed-off-by: Haifeng Xu haifeng.xu@shopee.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Acked-by: Mark Rutland mark.rutland@arm.com --- Changes since v1: - Add the fixed tag. - Simplify v1's patch. (Frederic)
Changes since v2: - Use Reviewed-by tag instead of Signed-off-by tag.
Changes since v3: - Add Acked-by tag. - Cc stable@vger.kernel.org. (Mark) --- kernel/events/core.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c index 4f0c45ab8d7d..15c35070db6a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5340,6 +5340,7 @@ int perf_event_release_kernel(struct perf_event *event) again: mutex_lock(&event->child_mutex); list_for_each_entry(child, &event->child_list, child_list) { + void *var = NULL;
/* * Cannot change, child events are not migrated, see the @@ -5380,11 +5381,23 @@ int perf_event_release_kernel(struct perf_event *event) * this can't be the last reference. */ put_event(event); + } else { + var = &ctx->refcount; }
mutex_unlock(&event->child_mutex); mutex_unlock(&ctx->mutex); put_ctx(ctx); + + if (var) { + /* + * If perf_event_free_task() has deleted all events from the + * ctx while the child_mutex got released above, make sure to + * notify about the preceding put_ctx(). + */ + smp_mb(); /* pairs with wait_var_event() */ + wake_up_var(var); + } goto again; } mutex_unlock(&event->child_mutex);
On Mon, May 13, 2024 at 10:39:48AM +0000, Haifeng Xu wrote:
In our production environment, we found many hung tasks which are blocked for more than 18 hours. Their call traces are like this:
[346278.191038] __schedule+0x2d8/0x890 [346278.191046] schedule+0x4e/0xb0 [346278.191049] perf_event_free_task+0x220/0x270 [346278.191056] ? init_wait_var_entry+0x50/0x50 [346278.191060] copy_process+0x663/0x18d0 [346278.191068] kernel_clone+0x9d/0x3d0 [346278.191072] __do_sys_clone+0x5d/0x80 [346278.191076] __x64_sys_clone+0x25/0x30 [346278.191079] do_syscall_64+0x5c/0xc0 [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50 [346278.191086] ? do_syscall_64+0x69/0xc0 [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20 [346278.191092] ? irqentry_exit+0x19/0x30 [346278.191095] ? exc_page_fault+0x89/0x160 [346278.191097] ? asm_exc_page_fault+0x8/0x30 [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae
The task was waiting for the refcount become to 1, but from the vmcore, we found the refcount has already been 1. It seems that the task didn't get woken up by perf_event_release_kernel() and got stuck forever. The below scenario may cause the problem.
Thread A Thread B ... ... perf_event_free_task perf_event_release_kernel ... acquire event->child_mutex ... get_ctx ... release event->child_mutex acquire ctx->mutex ... perf_free_event (acquire/release event->child_mutex) ... release ctx->mutex wait_var_event acquire ctx->mutex acquire event->child_mutex # move existing events to free_list release event->child_mutex release ctx->mutex put_ctx ... ...
In this case, all events of the ctx have been freed, so we couldn't find the ctx in free_list and Thread A will miss the wakeup. It's thus necessary to add a wakeup after dropping the reference.
Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()") Cc: stable@vger.kernel.org Signed-off-by: Haifeng Xu haifeng.xu@shopee.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Acked-by: Mark Rutland mark.rutland@arm.com
Thanks!, I'll hang onto this until after the merge window and then stick it in tip/perf/urgent or somesuch.
On Thu, May 16, 2024 at 10:51:06AM +0200, Peter Zijlstra wrote:
Thanks!, I'll hang onto this until after the merge window and then stick it in tip/perf/urgent or somesuch.
Just to check -- I couldn't spot this in tip/perf/urgent just now; are you still happy to pick this up?
Mark.
The following commit has been merged into the perf/urgent branch of tip:
Commit-ID: 74751ef5c1912ebd3e65c3b65f45587e05ce5d36 Gitweb: https://git.kernel.org/tip/74751ef5c1912ebd3e65c3b65f45587e05ce5d36 Author: Haifeng Xu haifeng.xu@shopee.com AuthorDate: Mon, 13 May 2024 10:39:48 Committer: Peter Zijlstra peterz@infradead.org CommitterDate: Wed, 05 Jun 2024 15:52:33 +02:00
perf/core: Fix missing wakeup when waiting for context reference
In our production environment, we found many hung tasks which are blocked for more than 18 hours. Their call traces are like this:
[346278.191038] __schedule+0x2d8/0x890 [346278.191046] schedule+0x4e/0xb0 [346278.191049] perf_event_free_task+0x220/0x270 [346278.191056] ? init_wait_var_entry+0x50/0x50 [346278.191060] copy_process+0x663/0x18d0 [346278.191068] kernel_clone+0x9d/0x3d0 [346278.191072] __do_sys_clone+0x5d/0x80 [346278.191076] __x64_sys_clone+0x25/0x30 [346278.191079] do_syscall_64+0x5c/0xc0 [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50 [346278.191086] ? do_syscall_64+0x69/0xc0 [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20 [346278.191092] ? irqentry_exit+0x19/0x30 [346278.191095] ? exc_page_fault+0x89/0x160 [346278.191097] ? asm_exc_page_fault+0x8/0x30 [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae
The task was waiting for the refcount become to 1, but from the vmcore, we found the refcount has already been 1. It seems that the task didn't get woken up by perf_event_release_kernel() and got stuck forever. The below scenario may cause the problem.
Thread A Thread B ... ... perf_event_free_task perf_event_release_kernel ... acquire event->child_mutex ... get_ctx ... release event->child_mutex acquire ctx->mutex ... perf_free_event (acquire/release event->child_mutex) ... release ctx->mutex wait_var_event acquire ctx->mutex acquire event->child_mutex # move existing events to free_list release event->child_mutex release ctx->mutex put_ctx ... ...
In this case, all events of the ctx have been freed, so we couldn't find the ctx in free_list and Thread A will miss the wakeup. It's thus necessary to add a wakeup after dropping the reference.
Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()") Signed-off-by: Haifeng Xu haifeng.xu@shopee.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Frederic Weisbecker frederic@kernel.org Acked-by: Mark Rutland mark.rutland@arm.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240513103948.33570-1-haifeng.xu@shopee.com --- kernel/events/core.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c index f0128c5..8f908f0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5384,6 +5384,7 @@ int perf_event_release_kernel(struct perf_event *event) again: mutex_lock(&event->child_mutex); list_for_each_entry(child, &event->child_list, child_list) { + void *var = NULL;
/* * Cannot change, child events are not migrated, see the @@ -5424,11 +5425,23 @@ again: * this can't be the last reference. */ put_event(event); + } else { + var = &ctx->refcount; }
mutex_unlock(&event->child_mutex); mutex_unlock(&ctx->mutex); put_ctx(ctx); + + if (var) { + /* + * If perf_event_free_task() has deleted all events from the + * ctx while the child_mutex got released above, make sure to + * notify about the preceding put_ctx(). + */ + smp_mb(); /* pairs with wait_var_event() */ + wake_up_var(var); + } goto again; } mutex_unlock(&event->child_mutex);
linux-stable-mirror@lists.linaro.org