 
            On 8/10/20 2:35 PM, Jann Horn wrote:
On Mon, Aug 10, 2020 at 10:25 PM Jens Axboe axboe@kernel.dk wrote:
On 8/10/20 2:13 PM, Jens Axboe wrote:
Would it be clearer to write it like so perhaps?
/* * Optimization; when the task is RUNNING we can do with a * cheaper TWA_RESUME notification because,... <reason goes * here>. Otherwise do the more expensive, but always correct * TWA_SIGNAL. */ if (READ_ONCE(tsk->state) == TASK_RUNNING) { __task_work_notify(tsk, TWA_RESUME); if (READ_ONCE(tsk->state) == TASK_RUNNING) return; } __task_work_notify(tsk, TWA_SIGNAL); wake_up_process(tsk);Yeah that is easier to read, wasn't a huge fan of the loop since it's only a single retry kind of condition. I'll adopt this suggestion, thanks!
Re-write it a bit on top of that, just turning it into two separate READ_ONCE, and added appropriate comments. For the SQPOLL case, the wake_up_process() is enough, so we can clean up that if/else.
https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=49bc5c1...
I think I'm starting to understand the overall picture here, and I think if my understanding is correct, your solution isn't going to work properly.
My understanding of the scenario you're trying to address is:
- task A starts up io_uring
- task A tells io_uring to bump the counter of an eventfd E when work
has been completed
- task A submits some work ("read a byte from file descriptor X", or
something like that)
- io_uring internally starts an asynchronous I/O operation, with a callback C
- task A calls read(E, &counter, sizeof(counter)) to wait for events
to be processed
- the async I/O operation finishes, C is invoked, and C schedules
task_work for task A
And here you run into a deadlock, because the task_work will only run when task A returns from the syscall, but the syscall will only return once the task_work is executing and has finished the I/O operation.
If that is the scenario you're trying to solve here (where you're trying to force a task that's in the middle of some syscall that's completely unrelated to io_uring to return back to syscall context), I don't think this will work: It might well be that the task has e.g. just started entering the read() syscall, and is *about to* block, but is currently still running.
Your understanding of the scenario appears to be correct, and as far as I can tell, also your analysis of why the existing approach doesn't fully close it. You're right in that the task could currently be on its way to blocking, but still running. And for that case, TWA_RESUME isn't going to cut it.
Ugh. This basically means I have to use TWA_SIGNAL regardless of state, unless SQPOLL is used. That's not optimal.
Alternatively:
if (tsk->state != TASK_RUNNING || task_in_kernel(tsk)) notify = TWA_SIGNAL; else notify = TWA_RESUME;
should work as far as I can tell, but I don't even know if there's a reliable way to do task_in_kernel(). But I suspect this kind of check would still save the day, as we're not really expecting the common case to be that the task is in the kernel on the way to blocking. And it'd be kind of annoying to have to cater to that scenario by slowing down the fast path.
Suggestions?