On Tue, May 27, 2025 at 05:01:24PM -0600, Uday Shankar wrote:
Currently, ublk_drv associates to each hardware queue (hctx) a unique task (called the queue's ubq_daemon) which is allowed to issue COMMIT_AND_FETCH commands against the hctx. If any other task attempts to do so, the command fails immediately with EINVAL. When considered together with the block layer architecture, the result is that for each CPU C on the system, there is a unique ublk server thread which is allowed to handle I/O submitted on CPU C. This can lead to suboptimal performance under imbalanced load generation. For an extreme example, suppose all the load is generated on CPUs mapping to a single ublk server thread. Then that thread may be fully utilized and become the bottleneck in the system, while other ublk server threads are totally idle.
This issue can also be addressed directly in the ublk server without kernel support by having threads dequeue I/Os and pass them around to ensure even load. But this solution requires inter-thread communication at least twice for each I/O (submission and completion), which is generally a bad pattern for performance. The problem gets even worse with zero copy, as more inter-thread communication would be required to have the buffer register/unregister calls to come from the correct thread.
Therefore, address this issue in ublk_drv by allowing each I/O to have its own daemon task. Two I/Os in the same queue are now allowed to be serviced by different daemon tasks - this was not possible before. Imbalanced load can then be balanced across all ublk server threads by having the ublk server threads issue FETCH_REQs in a round-robin manner. As a small toy example, consider a system with a single ublk device having 2 queues, each of depth 4. A ublk server having 4 threads could issue its FETCH_REQs against this device as follows (where each entry is the qid,tag pair that the FETCH_REQ targets):
ublk server thread: T0 T1 T2 T3 0,0 0,1 0,2 0,3 1,3 1,0 1,1 1,2
This setup allows for load that is concentrated on one hctx/ublk_queue to be spread out across all ublk server threads, alleviating the issue described above.
Add the new UBLK_F_PER_IO_DAEMON feature to ublk_drv, which ublk servers can use to essentially test for the presence of this change and tailor their behavior accordingly.
Signed-off-by: Uday Shankar ushankar@purestorage.com Reviewed-by: Caleb Sander Mateos csander@purestorage.com
This patch looks close to go, just one panic triggered immediately by the following steps, I think it needs to be addressed first.
Maybe we need to add one such stress test for UBLK_F_PER_IO_DAEMON too.
1) run heavy IO:
[root@ktest-40 ublk]# ./kublk add -t null -q 2 --nthreads 4 --per_io_tasks dev id 0: nr_hw_queues 2 queue_depth 128 block size 512 dev_capacity 524288000 max rq size 1048576 daemon pid 1283 flags 0x2042 state LIVE queue 0: affinity(0 ) queue 1: affinity(8 ) [root@ktest-40 ublk]# [root@ktest-40 ublk]# ~/git/fio/t/io_uring -p 0 -n 8 /dev/ublkb0
Or
`fio -numjobs=8 --ioengine=libaio --iodepth=128 --iodepth_batch_submit=32 \ --iodepth_batch_complete_min=32`
2) panic immediately:
[ 51.297750] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 51.298719] #PF: supervisor read access in kernel mode [ 51.299403] #PF: error_code(0x0000) - not-present page [ 51.300069] PGD 1161c8067 P4D 1161c8067 PUD 11a793067 PMD 0 [ 51.300825] Oops: Oops: 0000 [#1] SMP NOPTI [ 51.301389] CPU: 0 UID: 0 PID: 1285 Comm: kublk Not tainted 6.15.0+ #288 PREEMPT(full) [ 51.302375] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 [ 51.303551] RIP: 0010:io_uring_cmd_done+0xa7/0x1d0 [ 51.304226] Code: 48 89 f1 48 89 f0 48 83 e1 bf 80 cc 01 48 81 c9 00 01 80 00 83 e6 40 48 0f 45 c1 48 89 43 48 44 89 6b 58 c7 43 5c 00 00 00 00 <8b> 07 f6 c4 08 74 12 48 89 93 e8 00 00 0 [ 51.306554] RSP: 0018:ffffd1da436e3a40 EFLAGS: 00010246 [ 51.307253] RAX: 0000000000000100 RBX: ffff8d9cd3737300 RCX: 0000000000000001 [ 51.308178] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 51.309333] RBP: 0000000000000001 R08: 0000000000000018 R09: 0000000000190015 [ 51.310744] R10: 0000000000190015 R11: 0000000000000035 R12: ffff8d9cd1c7c000 [ 51.311986] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 51.313386] FS: 00007f2c293916c0(0000) GS:ffff8da179df6000(0000) knlGS:0000000000000000 [ 51.314899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 51.315926] CR2: 0000000000000000 CR3: 00000001161c9002 CR4: 0000000000772ef0 [ 51.317179] PKRU: 55555554 [ 51.317682] Call Trace: [ 51.318040] <TASK> [ 51.318355] ublk_cmd_list_tw_cb+0x30/0x40 [ublk_drv] [ 51.319061] __io_run_local_work_loop+0x72/0x80 [ 51.319696] __io_run_local_work+0x69/0x1e0 [ 51.320274] io_cqring_wait+0x8f/0x6a0 [ 51.320794] __do_sys_io_uring_enter+0x500/0x770 [ 51.321422] do_syscall_64+0x82/0x170 [ 51.321891] ? __do_sys_io_uring_enter+0x500/0x770
Thanks, Ming