 
            On 8/10/20 4:41 PM, Jann Horn wrote:
On Tue, Aug 11, 2020 at 12:01 AM Jens Axboe axboe@kernel.dk wrote:
On 8/10/20 3:28 PM, Jens Axboe wrote:
On 8/10/20 3:26 PM, Jann Horn wrote:
On Mon, Aug 10, 2020 at 11:12 PM Jens Axboe axboe@kernel.dk wrote:
On 8/10/20 3:10 PM, Peter Zijlstra wrote:
On Mon, Aug 10, 2020 at 03:06:49PM -0600, Jens Axboe wrote:
> should work as far as I can tell, but I don't even know if there's a > reliable way to do task_in_kernel().
Only on NOHZ_FULL, and tracking that is one of the things that makes it so horribly expensive.
Probably no other way than to bite the bullet and just use TWA_SIGNAL unconditionally...
Why are you trying to avoid using TWA_SIGNAL? Is there a specific part of handling it that's particularly slow?
Not particularly slow, but it's definitely heavier than TWA_RESUME. And as we're driving any pollable async IO through this, just trying to ensure it's as light as possible.
It's not a functional thing, just efficiency.
Ran some quick testing in a vm, which is worst case for this kind of thing as any kind of mucking with interrupts is really slow. And the hit is substantial. Though with the below, we're basically at parity again. Just for discussion...
diff --git a/kernel/task_work.c b/kernel/task_work.c index 5c0848ca1287..ea2c683c8563 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -42,7 +42,8 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify) set_notify_resume(task); break; case TWA_SIGNAL:
if (lock_task_sighand(task, &flags)) {
if (!(task->jobctl & JOBCTL_TASK_WORK) &&
lock_task_sighand(task, &flags)) { task->jobctl |= JOBCTL_TASK_WORK; signal_wake_up(task, 0); unlock_task_sighand(task, &flags);I think that should work in theory, but if you want to be able to do a proper unlocked read of task->jobctl here, then I think you'd have to use READ_ONCE() here and make all existing writes to ->jobctl use WRITE_ONCE().
Also, I think that to make this work, stuff like get_signal() will need to use memory barriers to ensure that reads from ->task_works are ordered after ->jobctl has been cleared - ideally written such that on the fastpath, the memory barrier doesn't execute.
I wonder if it's possible to just make it safe for the io_uring case, since a bigger change would make this performance regression persistent in this release... Would still require the split add/notification patch, but that one is trivial.