On Thu, Sep 18, 2025 at 5:58 PM Leon Romanovsky leon@kernel.org wrote:
On Wed, Sep 17, 2025 at 12:30:56PM -0700, yanjun.zhu wrote:
On 9/17/25 3:06 AM, Gui-Dong Han wrote:
When do_task() exhausts its RXE_MAX_ITERATIONS budget, it unconditionally
From the source code, it will check ret value, then set it to TASK_STATE_IDLE, not unconditionally.
sets the task state to TASK_STATE_IDLE to reschedule. This overwrites the TASK_STATE_DRAINING state that may have been concurrently set by rxe_cleanup_task() or rxe_disable_task().
From the source code, there is a spin lock to protect the state. It will not make race condition.
This race condition breaks the cleanup and disable logic, which expects the task to stop processing new work. The cleanup code may proceed while do_task() reschedules itself, leading to a potential use-after-free.
Can you post the call trace when this problem occurred?
Hi, Jason && Leon
Please comment on this problem.
The idea to recheck task->state looks correct to me, otherwise we overwrite it unconditionally. However I would write this patch slightly different (without cont = 1):
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c index 6f8f353e95838..2ff5d7cc0a933 100644 --- a/drivers/infiniband/sw/rxe/rxe_task.c +++ b/drivers/infiniband/sw/rxe/rxe_task.c @@ -132,8 +132,10 @@ static void do_task(struct rxe_task *task) * yield the cpu and reschedule the task */ if (!ret) {
task->state = TASK_STATE_IDLE;
resched = 1;
if (task->state != TASK_STATE_DRAINING) {
task->state = TASK_STATE_IDLE;
resched = 1;
} goto exit; }
@@ -151,7 +153,6 @@ static void do_task(struct rxe_task *task) break;
case TASK_STATE_DRAINING:
task->state = TASK_STATE_DRAINED; break; default:
(END)
Hi Leon,
Thanks for your review and for confirming the need for a fix.
Regarding your suggested patch, I believe removing the transition to TASK_STATE_DRAINED would cause an issue. As seen in the code and comments for rxe_cleanup_task() and is_done(), the cleanup process waits for the final TASK_STATE_DRAINED state. If the task remains stuck in DRAINING, the cleanup loop will never terminate.
My use of cont = 1 was intended as a minimal change. Since this regression was introduced during the migration from tasklets, restoring the pre-migration logic seemed like a reasonable approach. An alternative could be to set the state to TASK_STATE_DRAINED directly inside the if (!ret) block, and I am open to discussing the best fix.
Regards, Han