From: Leon Romanovsky <leonro(a)nvidia.com>
From Yishai,
Overview
--------
This patch series aims to enable multi-path DMA support, allowing an
mlx5 RDMA device to issue DMA commands through multiple paths. This
feature is critical for improving performance and reaching line rate
in certain environments where issuing PCI transactions over one path
may be significantly faster than over another. These differences can
arise from various PCI generations in the system or the specific system
topology.
To achieve this functionality, we introduced a data direct DMA device
that can serve the RDMA device by issuing DMA transactions on its behalf.
The main key features and changes are described below.
Multi-Path Discovery
--------------------
API Implementation:
* Introduced an API to discover multiple paths for a given mlx5 RDMA device.
IOCTL Command:
* Added a new IOCTL command, MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH, to
the DEVICE object. When an affiliated Data-Direct/DMA device is present,
its sysfs path is returned.
Feature Activation by mlx5 RDMA Application
-------------------------------------------
UVERBS Extension:
* Extended UVERBS_METHOD_REG_DMABUF_MR over UVERBS_OBJECT_MR to include
mlx5 extended flags.
Access Flag:
* Introduced the MLX5_IB_UAPI_REG_DMABUF_ACCESS_DATA_DIRECT flag, allowing
applications to request the use of the affiliated DMA device for DMABUF
registration.
Data-Direct/DMA Device
----------------------
New Driver:
* Introduced a new driver to manage the new DMA PF device ID (0x2100).
Its registration/un-registration is handled as part of the mlx5_ib init/exit
flows, with mlx5 IB devices as its clients.
Functionality:
* The driver does not interface directly with the firmware (no command interface,
no caps, etc.) but works over PCI to activate its DMA functionality. It serves
as the DMA device for efficiently accessing other PCI devices (e.g., GPU PF) and
reads its VUID over PCI to handle NICs registrations with the same VUID.
mlx5 IB RDMA Device
---------------------------
VUID Query:
* Reads its affiliated DMA PF VUID via the QUERY_VUID command with the data_direct
bit set.
Driver Registration:
* Registers with the DMA PF driver to be notified upon bind/unbind.
Application Request Handling:
* Uses the DMA PF device upon application request as described above.
DMABUF over Umem
----------------
Introduced an option to obtain a DMABUF UMEM using a different DMA
device instead of the IB device, allowing the device to register over
IOMMU with the expected DMA device for a given buffer registration.
Further details are provided in the commit logs of the patches in this
series.
Thanks
Yishai Hadas (8):
net/mlx5: Add IFC related stuff for data direct
RDMA/mlx5: Introduce the 'data direct' driver
RDMA/mlx5: Add the initialization flow to utilize the 'data direct'
device
RDMA/umem: Add support for creating pinned DMABUF umem with a given
dma device
RDMA/umem: Introduce an option to revoke DMABUF umem
RDMA: Pass uverbs_attr_bundle as part of '.reg_user_mr_dmabuf' API
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct
RDMA/mlx5: Introduce GET_DATA_DIRECT_SYSFS_PATH ioctl
drivers/infiniband/core/umem_dmabuf.c | 66 +++-
drivers/infiniband/core/uverbs_std_types_mr.c | 2 +-
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 3 +-
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 2 +-
drivers/infiniband/hw/efa/efa.h | 2 +-
drivers/infiniband/hw/efa/efa_verbs.c | 4 +-
drivers/infiniband/hw/irdma/verbs.c | 2 +-
drivers/infiniband/hw/mlx5/Makefile | 1 +
drivers/infiniband/hw/mlx5/cmd.c | 21 ++
drivers/infiniband/hw/mlx5/cmd.h | 2 +
drivers/infiniband/hw/mlx5/data_direct.c | 227 +++++++++++++
drivers/infiniband/hw/mlx5/data_direct.h | 23 ++
drivers/infiniband/hw/mlx5/main.c | 125 +++++++
drivers/infiniband/hw/mlx5/mlx5_ib.h | 22 +-
drivers/infiniband/hw/mlx5/mr.c | 304 +++++++++++++++---
drivers/infiniband/hw/mlx5/odp.c | 5 +-
drivers/infiniband/hw/mlx5/std_types.c | 55 +++-
drivers/infiniband/hw/mlx5/umr.c | 93 ++++--
drivers/infiniband/hw/mlx5/umr.h | 1 +
include/linux/mlx5/mlx5_ifc.h | 51 ++-
include/rdma/ib_umem.h | 18 ++
include/rdma/ib_verbs.h | 2 +-
include/uapi/rdma/mlx5_user_ioctl_cmds.h | 9 +
include/uapi/rdma/mlx5_user_ioctl_verbs.h | 4 +
24 files changed, 944 insertions(+), 100 deletions(-)
create mode 100644 drivers/infiniband/hw/mlx5/data_direct.c
create mode 100644 drivers/infiniband/hw/mlx5/data_direct.h
--
2.45.2
On Thu, Aug 01, 2024 at 10:53:45AM +0800, Huan Yang wrote:
>
> 在 2024/8/1 4:46, Daniel Vetter 写道:
> > On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
> > > 在 2024/7/30 17:05, Huan Yang 写道:
> > > > 在 2024/7/30 16:56, Daniel Vetter 写道:
> > > > > [????????? daniel.vetter(a)ffwll.ch ?????????
> > > > > https://aka.ms/LearnAboutSenderIdentification?????????????]
> > > > >
> > > > > On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> > > > > > UDMA-BUF step:
> > > > > > 1. memfd_create
> > > > > > 2. open file(buffer/direct)
> > > > > > 3. udmabuf create
> > > > > > 4. mmap memfd
> > > > > > 5. read file into memfd vaddr
> > > > > Yeah this is really slow and the worst way to do it. You absolutely want
> > > > > to start _all_ the io before you start creating the dma-buf, ideally
> > > > > with
> > > > > everything running in parallel. But just starting the direct I/O with
> > > > > async and then creating the umdabuf should be a lot faster and avoid
> > > > That's greate, Let me rephrase that, and please correct me if I'm wrong.
> > > >
> > > > UDMA-BUF step:
> > > > 1. memfd_create
> > > > 2. mmap memfd
> > > > 3. open file(buffer/direct)
> > > > 4. start thread to async read
> > > > 3. udmabuf create
> > > >
> > > > With this, can improve
> > > I just test with it. Step is:
> > >
> > > UDMA-BUF step:
> > > 1. memfd_create
> > > 2. mmap memfd
> > > 3. open file(buffer/direct)
> > > 4. start thread to async read
> > > 5. udmabuf create
> > >
> > > 6 . join wait
> > >
> > > 3G file read all step cost 1,527,103,431ns, it's greate.
> > Ok that's almost the throughput of your patch set, which I think is close
> > enough. The remaining difference is probably just the mmap overhead, not
> > sure whether/how we can do direct i/o to an fd directly ... in principle
> > it's possible for any file that uses the standard pagecache.
>
> Yes, for mmap, IMO, now that we get all folios and pin it. That's mean all
> pfn it's got when udmabuf created.
>
> So, I think mmap with page fault is helpless for save memory but increase
> the mmap access cost.(maybe can save a little page table's memory)
>
> I want to offer a patchset to remove it and more suitable for folios
> operate(And remove unpin list). And contains some fix patch.
>
> I'll send it when I test it's good.
>
>
> About fd operation for direct I/O, maybe use sendfile or copy_file_range?
>
> sendfile base pipe buffer, it's low performance when I test is.
>
> copy_file_range can't work due to it's not the same file system.
>
> So, I can't find other way to do it. Can someone give some suggestions?
Yeah direct I/O to pagecache without an mmap might be too niche to be
supported. Maybe io_uring has something, but I guess as unlikely as
anything else.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On 2024-08-01 20:32, Kasireddy, Vivek wrote:
> Hi Huan,
>
>> This patchset attempts to fix some errors in udmabuf and remove the
>> upin_list structure.
>>
>> Some of this fix just gather the patches which I upload before.
>>
>> Patch1
>> ===
>> Try to remove page fault mmap and direct map it.
>> Due to current udmabuf has already obtained and pinned the folio
>> upon completion of the creation.This means that the physical memory has
>> already been acquired, rather than being accessed dynamically. The
>> current page fault method only saves some page table memory.
>>
>> As a result, the page fault mechanism has lost its purpose as a demanding
>> page. Due to the fact that page fault requires trapping into kernel mode
>> and filling in when accessing the corresponding virtual address in mmap,
>> this means that user mode access to virtual addresses needs to trap into
>> kernel mode.
>>
>> Therefore, when creating a large size udmabuf, this represents a
>> considerable overhead.
> Just want to mention that for the main use-case the udmabuf driver is designed for,
> (sharing Qemu Guest FB with Host for GPU DMA), udmabufs are not created very
> frequently. And, I think providing CPU access via mmap is just a backup, mainly
> intended for debugging purposes.
FYI, Mesa now uses udmabuf for supporting dma-bufs with software rendering.
--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer