The patch set allows to register a dmabuf to an io_uring instance for a specified file and use it with io_uring read / write requests. The infrastructure is not tied to io_uring and there could be more users in the future. A similar idea was attempted some years ago by Keith [1], from where I borrowed a good number of changes, and later was brough up by Tushar and Vishal from Intel.
It's an opt-in feature for files, and they need to implement a new file operation to use it. Only NVMe block devices are supported in this series. The user API is built on top of io_uring's "registered buffers", where a dmabuf is registered in a special way, but after it can be used as any other "registered buffer" with IORING_OP_{READ,WRITE}_FIXED requests. It's created via a new file operation and the resulted map is then passed through the I/O stack in a new iterator type. There is some additional infrastructure to bind it all, which also counts requests using a dmabuf map and managing lifetimes, which is used to implement map invalidation.
It was tested for GPU <-> NVMe transfers. Also, as it maintains a long-term dma mapping, it helps with the IOMMU cost. The numbers below are for udmabuf reads previously run by Anuj for different IOMMU modes:
- STRICT: before = 570 KIOPS, after = 5.01 MIOPS - LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS - PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
There are some liburing tests that can serve as an example: git: https://github.com/isilence/liburing.git rw-dmabuf-tests-v3 url: https://github.com/isilence/liburing/tree/rw-dmabuf-tests-v3
[1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
v3: - Rework io_uring registration - Move token/map infrastructure code out of blk-mq - Simplify callbacks: remove a separate blk-mq table, which was mostly just forwarding calls (to nvme). - Don't skip dma sync depending on request direction - Fix a couple of hangs - Rename s/dma/dmabuf/ - Other small changes
v2: - Don't pass raw dma addresses, wrap it into a driver specific object - Split into two objects: token and map - Implement move_notify
Pavel Begunkov (10): file: add callback for creating long-term dmabuf maps iov_iter: add iterator type for dmabuf maps block: move bvec init into __bio_clone block: introduce dma map backed bio type lib: add dmabuf token infrastructure block: forward create_dmabuf_token to drivers nvme-pci: implement dma_token backed requests io_uring/rsrc: introduce buf registration structure io_uring/rsrc: extend buffer update io_uring/rsrc: add dmabuf backed registered buffers
block/bio.c | 28 +++- block/blk-merge.c | 14 ++ block/blk.h | 3 +- block/fops.c | 16 ++ drivers/nvme/host/pci.c | 282 ++++++++++++++++++++++++++++++++ include/linux/bio.h | 19 ++- include/linux/blk-mq.h | 9 + include/linux/blk_types.h | 8 +- include/linux/fs.h | 2 + include/linux/io_dmabuf_token.h | 92 +++++++++++ include/linux/io_uring_types.h | 5 + include/linux/uio.h | 11 ++ include/uapi/linux/io_uring.h | 31 +++- io_uring/io_uring.c | 3 +- io_uring/rsrc.c | 266 +++++++++++++++++++++++++----- io_uring/rsrc.h | 30 +++- io_uring/rw.c | 4 +- lib/Kconfig | 4 + lib/Makefile | 2 + lib/io_dmabuf_token.c | 272 ++++++++++++++++++++++++++++++ lib/iov_iter.c | 29 +++- 21 files changed, 1071 insertions(+), 59 deletions(-) create mode 100644 include/linux/io_dmabuf_token.h create mode 100644 lib/io_dmabuf_token.c