dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
This series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. I imagine this dmabuf approach to be usable by iommufd as well for generic and safe P2P mappings.
This series goes after the "Break up ioctl dispatch functions to one function per ioctl" series.
This is on github: https://github.com/jgunthorpe/linux/commits/vfio_dma_buf
Jason Gunthorpe (4): dma-buf: Add dma_buf_try_get() vfio: Add vfio_device_get() vfio_pci: Do not open code pci_try_reset_function() vfio/pci: Allow MMIO regions to be exported through dma-buf
drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci_config.c | 22 ++- drivers/vfio/pci/vfio_pci_core.c | 33 +++- drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 24 +++ drivers/vfio/vfio_main.c | 3 +- include/linux/dma-buf.h | 13 ++ include/linux/vfio.h | 6 + include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 18 ++ 10 files changed, 364 insertions(+), 22 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dma_buf.c
base-commit: 385f0411fcd2780b5273992832cdc8edcd5b8ea9
To increment a reference the caller already holds. Export vfio_device_put() to pair with it.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- drivers/vfio/vfio_main.c | 3 ++- include/linux/vfio.h | 6 ++++++ 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 6f96e6d07a5e98..5ad50aec7dc94c 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -451,11 +451,12 @@ static void vfio_group_get(struct vfio_group *group) * Device objects - create, release, get, put, search */ /* Device reference always implies a group reference */ -static void vfio_device_put(struct vfio_device *device) +void vfio_device_put(struct vfio_device *device) { if (refcount_dec_and_test(&device->refcount)) complete(&device->comp); } +EXPORT_SYMBOL_GPL(vfio_device_put);
static bool vfio_device_try_get(struct vfio_device *device) { diff --git a/include/linux/vfio.h b/include/linux/vfio.h index e05ddc6fe6a556..25850b1e08cba9 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -143,6 +143,12 @@ void vfio_uninit_group_dev(struct vfio_device *device); int vfio_register_group_dev(struct vfio_device *device); int vfio_register_emulated_iommu_dev(struct vfio_device *device); void vfio_unregister_group_dev(struct vfio_device *device); +void vfio_device_put(struct vfio_device *device); + +static inline void vfio_device_get(struct vfio_device *device) +{ + refcount_inc(&device->refcount); +}
int vfio_assign_device_set(struct vfio_device *device, void *set_id);
Used to increment the refcount of the dma buf's struct file, only if the refcount is not zero. Useful to allow the struct file's lifetime to control the lifetime of the dmabuf while still letting the driver to keep track of created dmabufs.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- include/linux/dma-buf.h | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 71731796c8c3a8..a35f1554f2fb36 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -618,6 +618,19 @@ int dma_buf_fd(struct dma_buf *dmabuf, int flags); struct dma_buf *dma_buf_get(int fd); void dma_buf_put(struct dma_buf *dmabuf);
+/** + * dma_buf_try_get - try to get a reference on a dmabuf + * @dmabuf - the dmabuf to get + * + * Returns true if a reference was successfully obtained. The caller must + * interlock with the dmabuf's release function in some way, such as RCU, to + * ensure that this is not called on freed memory. + */ +static inline bool dma_buf_try_get(struct dma_buf *dmabuf) +{ + return get_file_rcu(dmabuf->file); +} + struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
FLR triggered by an emulated config space write should not behave differently from a FLR triggered by VFIO_DEVICE_RESET, currently the config space path misses the power management.
Consolidate all the call sites to invoke a single function.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- drivers/vfio/pci/vfio_pci_config.c | 14 ++++---------- drivers/vfio/pci/vfio_pci_core.c | 5 ++--- drivers/vfio/pci/vfio_pci_priv.h | 1 + 3 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 4a350421c5f62a..d22921efa25987 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -893,11 +893,8 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos, pos - offset + PCI_EXP_DEVCAP, &cap);
- if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - pci_try_reset_function(vdev->pdev); - up_write(&vdev->memory_lock); - } + if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) + vfio_pci_try_reset_function(vdev); }
/* @@ -975,11 +972,8 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos, pos - offset + PCI_AF_CAP, &cap);
- if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - pci_try_reset_function(vdev->pdev); - up_write(&vdev->memory_lock); - } + if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) + vfio_pci_try_reset_function(vdev); }
return count; diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 050b9d4b8c290c..d13e22021860cc 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -960,8 +960,7 @@ static int vfio_pci_ioctl_set_irqs(struct vfio_pci_core_device *vdev, return ret; }
-static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, - void __user *arg) +int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev) { int ret;
@@ -1202,7 +1201,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, case VFIO_DEVICE_PCI_HOT_RESET: return vfio_pci_ioctl_pci_hot_reset(vdev, uarg); case VFIO_DEVICE_RESET: - return vfio_pci_ioctl_reset(vdev, uarg); + return vfio_pci_try_reset_function(vdev); case VFIO_DEVICE_SET_IRQS: return vfio_pci_ioctl_set_irqs(vdev, uarg); default: diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 58b8d34c162cd6..5b1cb9a9838442 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -60,6 +60,7 @@ void vfio_config_free(struct vfio_pci_core_device *vdev); int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state);
+int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev); bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev); void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev); u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev);
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
The patch design loosely follows the pattern in commit db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this does not support pinning.
Instead, this implements what, in the past, we've called a revocable attachment using move. In normal situations the attachment is pinned, as a BAR does not change physical address. However when the VFIO device is closed, or a PCI reset is issued, access to the MMIO memory is revoked.
Revoked means that move occurs, but an attempt to immediately re-map the memory will fail. In the reset case a future move will be triggered when MMIO access returns. As both close and reset are under userspace control it is expected that userspace will suspend use of the dma-buf before doing these operations, the revoke is purely for kernel self-defense against a hostile userspace.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci_config.c | 8 +- drivers/vfio/pci/vfio_pci_core.c | 28 ++- drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 23 +++ include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 18 ++ 7 files changed, 336 insertions(+), 8 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dma_buf.c
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 24c524224da5a3..81006a157cde14 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,6 +2,7 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += vfio_pci_dma_buf.o obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index d22921efa25987..f8a9c24d04aeb7 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -584,10 +584,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
- if (!new_mem) + if (!new_mem) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + }
/* * If the user is writing mem/io enable (new_mem/io) and we @@ -622,6 +624,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask);
+ if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index d13e22021860cc..206f159c480e42 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -504,6 +504,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) vfio_spapr_pci_eeh_release(vdev->pdev); vfio_pci_core_disable(vdev);
+ vfio_pci_dma_buf_cleanup(vdev); + mutex_lock(&vdev->igate); if (vdev->err_trigger) { eventfd_ctx_put(vdev->err_trigger); @@ -980,7 +982,10 @@ int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev) */ vfio_pci_set_power_state(vdev, PCI_D0);
+ vfio_pci_dma_buf_move(vdev, true); ret = pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock);
return ret; @@ -1210,11 +1215,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, - uuid_t __user *arg, size_t argsz) +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev, + u32 flags, uuid_t __user *arg, + size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); uuid_t uuid; int ret;
@@ -1241,9 +1245,14 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) { + struct vfio_pci_core_device *vdev = + container_of(device, struct vfio_pci_core_device, vdev); + switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: - return vfio_pci_core_feature_token(device, flags, arg, argsz); + return vfio_pci_core_feature_token(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF: + return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); default: return -ENOTTY; } @@ -1881,6 +1890,7 @@ void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev, INIT_LIST_HEAD(&vdev->vma_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); init_rwsem(&vdev->memory_lock); + INIT_LIST_HEAD(&vdev->dmabufs); } EXPORT_SYMBOL_GPL(vfio_pci_core_init_device);
@@ -2227,11 +2237,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * cause the PCI config space reset without restoring the original * state (saved locally in 'vdev->pm_save'). */ - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { + vfio_pci_dma_buf_move(cur, true); vfio_pci_set_power_state(cur, PCI_D0); + }
ret = pci_reset_bus(pdev);
+ list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) + if (__vfio_pci_memory_enabled(cur)) + vfio_pci_dma_buf_move(cur, false); + err_undo: list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { if (cur == cur_mem) diff --git a/drivers/vfio/pci/vfio_pci_dma_buf.c b/drivers/vfio/pci/vfio_pci_dma_buf.c new file mode 100644 index 00000000000000..ac32405de5e48d --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_dma_buf.c @@ -0,0 +1,265 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. + */ +#include <linux/dma-buf.h> +#include <linux/pci-p2pdma.h> +#include <linux/dma-resv.h> + +#include "vfio_pci_priv.h" + +MODULE_IMPORT_NS(DMA_BUF); + +struct vfio_pci_dma_buf { + struct dma_buf *dmabuf; + struct vfio_pci_core_device *vdev; + struct list_head dmabufs_elm; + unsigned int index; + size_t offset; + bool revoked; +}; + +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + int rc; + + rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1, + true); + if (rc < 0) + attachment->peer2peer = false; + return 0; +} + +static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment) +{ +} + +static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment) +{ + /* + * Uses the dynamic interface but must always allow for + * dma_buf_move_notify() to do revoke + */ + return -EINVAL; +} + +static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, + enum dma_data_direction dir) +{ + size_t sgl_size = dma_get_max_seg_size(attachment->dev); + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + struct scatterlist *sgl; + struct sg_table *sgt; + dma_addr_t dma_addr; + unsigned int nents; + size_t offset; + int ret; + + dma_resv_assert_held(priv->dmabuf->resv); + + if (!attachment->peer2peer) + return ERR_PTR(-EPERM); + + if (!priv->revoked) + return ERR_PTR(-ENODEV); + + sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); + if (!sgt) + return ERR_PTR(-ENOMEM); + + nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size); + ret = sg_alloc_table(sgt, nents, GFP_KERNEL); + if (ret) + goto err_kfree_sgt; + + /* + * Since the memory being mapped is a device memory it could never be in + * CPU caches. + */ + dma_addr = dma_map_resource( + attachment->dev, + pci_resource_start(priv->vdev->pdev, priv->index) + + priv->offset, + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); + ret = dma_mapping_error(attachment->dev, dma_addr); + if (ret) + goto err_free_sgt; + + /* + * Break the BAR's physical range up into max sized SGL's according to + * the device's requirement. + */ + sgl = sgt->sgl; + for (offset = 0; offset != priv->dmabuf->size;) { + size_t chunk_size = min(priv->dmabuf->size - offset, sgl_size); + + sg_set_page(sgl, NULL, chunk_size, 0); + sg_dma_address(sgl) = dma_addr + offset; + sg_dma_len(sgl) = chunk_size; + sgl = sg_next(sgl); + offset += chunk_size; + } + + /* + * Because we are not going to include a CPU list we want to have some + * chance that other users will detect this by setting the orig_nents to + * 0 and using only nents (length of DMA list) when going over the sgl + */ + sgt->orig_nents = 0; + return sgt; + +err_free_sgt: + sg_free_table(sgt); +err_kfree_sgt: + kfree(sgt); + return ERR_PTR(ret); +} + +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + + dma_unmap_resource(attachment->dev, sg_dma_address(sgt->sgl), + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); + sg_free_table(sgt); + kfree(sgt); +} + +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + /* + * Either this or vfio_pci_dma_buf_cleanup() will remove from the list. + * The refcount prevents both. + */ + if (priv->vdev) { + down_write(&priv->vdev->memory_lock); + list_del_init(&priv->dmabufs_elm); + up_write(&priv->vdev->memory_lock); + vfio_device_put(&priv->vdev->vdev); + } + kfree(priv); +} + +static const struct dma_buf_ops vfio_pci_dmabuf_ops = { + .attach = vfio_pci_dma_buf_attach, + .map_dma_buf = vfio_pci_dma_buf_map, + .pin = vfio_pci_dma_buf_pin, + .unpin = vfio_pci_dma_buf_unpin, + .release = vfio_pci_dma_buf_release, + .unmap_dma_buf = vfio_pci_dma_buf_unmap, +}; + +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf get_dma_buf; + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + struct vfio_pci_dma_buf *priv; + int ret; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(get_dma_buf)); + if (ret != 1) + return ret; + + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf))) + return -EFAULT; + + /* For PCI the region_index is the BAR number like everything else */ + if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX) + return -EINVAL; + + exp_info.ops = &vfio_pci_dmabuf_ops; + exp_info.size = pci_resource_len(vdev->pdev, get_dma_buf.region_index); + if (!exp_info.size) + return -EINVAL; + if (get_dma_buf.offset || get_dma_buf.length) { + if (get_dma_buf.length > exp_info.size || + get_dma_buf.offset >= exp_info.size || + get_dma_buf.length > exp_info.size - get_dma_buf.offset || + get_dma_buf.offset % PAGE_SIZE || + get_dma_buf.length % PAGE_SIZE) + return -EINVAL; + exp_info.size = get_dma_buf.length; + } + exp_info.flags = get_dma_buf.open_flags; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + INIT_LIST_HEAD(&priv->dmabufs_elm); + priv->offset = get_dma_buf.offset; + + exp_info.priv = priv; + priv->dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(priv->dmabuf)) { + ret = PTR_ERR(priv->dmabuf); + kfree(priv); + return ret; + } + + /* dma_buf_put() now frees priv */ + + down_write(&vdev->memory_lock); + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = __vfio_pci_memory_enabled(vdev); + priv->vdev = vdev; + vfio_device_get(&vdev->vdev); + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); + dma_resv_unlock(priv->dmabuf->resv); + up_write(&vdev->memory_lock); + + /* + * dma_buf_fd() consumes the reference, when the file closes the dmabuf + * will be released. + */ + return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); +} + +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + lockdep_assert_held_write(&vdev->memory_lock); + + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!dma_buf_try_get(priv->dmabuf)) + continue; + if (priv->revoked != revoked) { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = revoked; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + } + dma_buf_put(priv->dmabuf); + } +} + +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + down_write(&vdev->memory_lock); + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!dma_buf_try_get(priv->dmabuf)) + continue; + dma_resv_lock(priv->dmabuf->resv, NULL); + list_del_init(&priv->dmabufs_elm); + priv->vdev = NULL; + priv->revoked = true; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + vfio_device_put(&vdev->vdev); + dma_buf_put(priv->dmabuf); + } + up_write(&vdev->memory_lock); +} diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 5b1cb9a9838442..c295a1166e7a9c 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -102,4 +102,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; }
+#ifdef CONFIG_DMA_SHARED_BUFFER +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz); +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); +#else +static int +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + return -ENOTTY; +} +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ +} +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, + bool revoked) +{ +} +#endif + #endif diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 9d18b832e61a0d..b57f4ecc2665e1 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -93,6 +93,7 @@ struct vfio_pci_core_device { struct mutex vma_lock; struct list_head vma_list; struct rw_semaphore memory_lock; + struct list_head dmabufs; };
/* Will be exported for vfio pci drivers usage */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 733a1cddde30a5..1dcfad6dcb6863 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -986,6 +986,24 @@ enum vfio_device_mig_state { VFIO_DEVICE_STATE_RUNNING_P2P = 5, };
+/** + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the + * region selected. + * + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, + * etc. offset/length specify a slice of the region to create the dmabuf from. + * If both are 0 then the whole region is used. + * + * Return: The fd number on success, -1 and errno is set on failure. + */ +struct vfio_device_feature_dma_buf { + __u32 region_index; + __u32 open_flags; + __u32 offset; + __u64 length; +}; +#define VFIO_DEVICE_FEATURE_DMA_BUF 3 + /* -------- API for Type1 VFIO IOMMU -------- */
/**
On Wed, Aug 17, 2022 at 7:11 PM Jason Gunthorpe jgg@nvidia.com wrote:
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
The patch design loosely follows the pattern in commit db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this does not support pinning.
Instead, this implements what, in the past, we've called a revocable attachment using move. In normal situations the attachment is pinned, as a BAR does not change physical address. However when the VFIO device is closed, or a PCI reset is issued, access to the MMIO memory is revoked.
Revoked means that move occurs, but an attempt to immediately re-map the memory will fail. In the reset case a future move will be triggered when MMIO access returns. As both close and reset are under userspace control it is expected that userspace will suspend use of the dma-buf before doing these operations, the revoke is purely for kernel self-defense against a hostile userspace.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com
drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci_config.c | 8 +- drivers/vfio/pci/vfio_pci_core.c | 28 ++- drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 23 +++ include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 18 ++ 7 files changed, 336 insertions(+), 8 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dma_buf.c
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 24c524224da5a3..81006a157cde14 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,6 +2,7 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += vfio_pci_dma_buf.o obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index d22921efa25987..f8a9c24d04aeb7 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -584,10 +584,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
if (!new_mem)
if (!new_mem) { vfio_pci_zap_and_down_write_memory_lock(vdev);
else
vfio_pci_dma_buf_move(vdev, true);
} else { down_write(&vdev->memory_lock);
} /* * If the user is writing mem/io enable (new_mem/io) and we
@@ -622,6 +624,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index d13e22021860cc..206f159c480e42 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -504,6 +504,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) vfio_spapr_pci_eeh_release(vdev->pdev); vfio_pci_core_disable(vdev);
vfio_pci_dma_buf_cleanup(vdev);
mutex_lock(&vdev->igate); if (vdev->err_trigger) { eventfd_ctx_put(vdev->err_trigger);
@@ -980,7 +982,10 @@ int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev) */ vfio_pci_set_power_state(vdev, PCI_D0);
vfio_pci_dma_buf_move(vdev, true); ret = pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); return ret;
@@ -1210,11 +1215,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
u32 flags, uuid_t __user *arg,
size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev); uuid_t uuid; int ret;
@@ -1241,9 +1245,14 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) {
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
return vfio_pci_core_feature_token(device, flags, arg, argsz);
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_DMA_BUF:
return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); default: return -ENOTTY; }
@@ -1881,6 +1890,7 @@ void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev, INIT_LIST_HEAD(&vdev->vma_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); init_rwsem(&vdev->memory_lock);
INIT_LIST_HEAD(&vdev->dmabufs);
} EXPORT_SYMBOL_GPL(vfio_pci_core_init_device);
@@ -2227,11 +2237,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * cause the PCI config space reset without restoring the original * state (saved locally in 'vdev->pm_save'). */
list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
vfio_pci_dma_buf_move(cur, true); vfio_pci_set_power_state(cur, PCI_D0);
} ret = pci_reset_bus(pdev);
list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
if (__vfio_pci_memory_enabled(cur))
vfio_pci_dma_buf_move(cur, false);
err_undo: list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { if (cur == cur_mem) diff --git a/drivers/vfio/pci/vfio_pci_dma_buf.c b/drivers/vfio/pci/vfio_pci_dma_buf.c new file mode 100644 index 00000000000000..ac32405de5e48d --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_dma_buf.c @@ -0,0 +1,265 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.
- */
+#include <linux/dma-buf.h> +#include <linux/pci-p2pdma.h> +#include <linux/dma-resv.h>
+#include "vfio_pci_priv.h"
+MODULE_IMPORT_NS(DMA_BUF);
+struct vfio_pci_dma_buf {
struct dma_buf *dmabuf;
struct vfio_pci_core_device *vdev;
struct list_head dmabufs_elm;
unsigned int index;
size_t offset;
bool revoked;
+};
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
struct vfio_pci_dma_buf *priv = dmabuf->priv;
int rc;
rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
true);
if (rc < 0)
attachment->peer2peer = false;
return 0;
+}
+static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment) +{ +}
+static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment) +{
/*
* Uses the dynamic interface but must always allow for
* dma_buf_move_notify() to do revoke
*/
return -EINVAL;
+}
+static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)
+{
size_t sgl_size = dma_get_max_seg_size(attachment->dev);
struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
struct scatterlist *sgl;
struct sg_table *sgt;
dma_addr_t dma_addr;
unsigned int nents;
size_t offset;
int ret;
dma_resv_assert_held(priv->dmabuf->resv);
if (!attachment->peer2peer)
return ERR_PTR(-EPERM);
if (!priv->revoked)
return ERR_PTR(-ENODEV);
sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
if (!sgt)
return ERR_PTR(-ENOMEM);
nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size);
ret = sg_alloc_table(sgt, nents, GFP_KERNEL);
if (ret)
goto err_kfree_sgt;
/*
* Since the memory being mapped is a device memory it could never be in
* CPU caches.
*/
dma_addr = dma_map_resource(
attachment->dev,
pci_resource_start(priv->vdev->pdev, priv->index) +
priv->offset,
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
ret = dma_mapping_error(attachment->dev, dma_addr);
if (ret)
goto err_free_sgt;
/*
* Break the BAR's physical range up into max sized SGL's according to
* the device's requirement.
*/
sgl = sgt->sgl;
for (offset = 0; offset != priv->dmabuf->size;) {
size_t chunk_size = min(priv->dmabuf->size - offset, sgl_size);
sg_set_page(sgl, NULL, chunk_size, 0);
sg_dma_address(sgl) = dma_addr + offset;
sg_dma_len(sgl) = chunk_size;
sgl = sg_next(sgl);
offset += chunk_size;
}
/*
* Because we are not going to include a CPU list we want to have some
* chance that other users will detect this by setting the orig_nents to
* 0 and using only nents (length of DMA list) when going over the sgl
*/
sgt->orig_nents = 0;
return sgt;
+err_free_sgt:
sg_free_table(sgt);
+err_kfree_sgt:
kfree(sgt);
return ERR_PTR(ret);
+}
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)
+{
struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
dma_unmap_resource(attachment->dev, sg_dma_address(sgt->sgl),
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
sg_free_table(sgt);
Before calling sg_free_table(), you need to restore the orig_nents as it is used in that function to free the allocated memory of the sgt.
kfree(sgt);
+}
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) +{
struct vfio_pci_dma_buf *priv = dmabuf->priv;
/*
* Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
* The refcount prevents both.
*/
if (priv->vdev) {
down_write(&priv->vdev->memory_lock);
list_del_init(&priv->dmabufs_elm);
up_write(&priv->vdev->memory_lock);
vfio_device_put(&priv->vdev->vdev);
}
kfree(priv);
+}
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
.attach = vfio_pci_dma_buf_attach,
.map_dma_buf = vfio_pci_dma_buf_map,
.pin = vfio_pci_dma_buf_pin,
.unpin = vfio_pci_dma_buf_unpin,
.release = vfio_pci_dma_buf_release,
.unmap_dma_buf = vfio_pci_dma_buf_unmap,
+};
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz)
+{
struct vfio_device_feature_dma_buf get_dma_buf;
DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
struct vfio_pci_dma_buf *priv;
int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
sizeof(get_dma_buf));
if (ret != 1)
return ret;
if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
return -EFAULT;
/* For PCI the region_index is the BAR number like everything else */
if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX)
return -EINVAL;
exp_info.ops = &vfio_pci_dmabuf_ops;
exp_info.size = pci_resource_len(vdev->pdev, get_dma_buf.region_index);
if (!exp_info.size)
return -EINVAL;
if (get_dma_buf.offset || get_dma_buf.length) {
if (get_dma_buf.length > exp_info.size ||
get_dma_buf.offset >= exp_info.size ||
get_dma_buf.length > exp_info.size - get_dma_buf.offset ||
get_dma_buf.offset % PAGE_SIZE ||
get_dma_buf.length % PAGE_SIZE)
return -EINVAL;
exp_info.size = get_dma_buf.length;
}
exp_info.flags = get_dma_buf.open_flags;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
return -ENOMEM;
INIT_LIST_HEAD(&priv->dmabufs_elm);
priv->offset = get_dma_buf.offset;
exp_info.priv = priv;
priv->dmabuf = dma_buf_export(&exp_info);
if (IS_ERR(priv->dmabuf)) {
ret = PTR_ERR(priv->dmabuf);
kfree(priv);
return ret;
}
/* dma_buf_put() now frees priv */
down_write(&vdev->memory_lock);
dma_resv_lock(priv->dmabuf->resv, NULL);
priv->revoked = __vfio_pci_memory_enabled(vdev);
priv->vdev = vdev;
vfio_device_get(&vdev->vdev);
list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
dma_resv_unlock(priv->dmabuf->resv);
up_write(&vdev->memory_lock);
/*
* dma_buf_fd() consumes the reference, when the file closes the dmabuf
* will be released.
*/
return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+}
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{
struct vfio_pci_dma_buf *priv;
struct vfio_pci_dma_buf *tmp;
lockdep_assert_held_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!dma_buf_try_get(priv->dmabuf))
continue;
if (priv->revoked != revoked) {
dma_resv_lock(priv->dmabuf->resv, NULL);
priv->revoked = revoked;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv);
}
dma_buf_put(priv->dmabuf);
}
+}
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{
struct vfio_pci_dma_buf *priv;
struct vfio_pci_dma_buf *tmp;
down_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!dma_buf_try_get(priv->dmabuf))
continue;
dma_resv_lock(priv->dmabuf->resv, NULL);
list_del_init(&priv->dmabufs_elm);
priv->vdev = NULL;
priv->revoked = true;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv);
vfio_device_put(&vdev->vdev);
dma_buf_put(priv->dmabuf);
}
up_write(&vdev->memory_lock);
+} diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 5b1cb9a9838442..c295a1166e7a9c 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -102,4 +102,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; }
+#ifdef CONFIG_DMA_SHARED_BUFFER +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); +#else +static int +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz)
+{
return -ENOTTY;
+} +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ +} +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
bool revoked)
+{ +} +#endif
#endif diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 9d18b832e61a0d..b57f4ecc2665e1 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -93,6 +93,7 @@ struct vfio_pci_core_device { struct mutex vma_lock; struct list_head vma_list; struct rw_semaphore memory_lock;
struct list_head dmabufs;
};
/* Will be exported for vfio pci drivers usage */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 733a1cddde30a5..1dcfad6dcb6863 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -986,6 +986,24 @@ enum vfio_device_mig_state { VFIO_DEVICE_STATE_RUNNING_P2P = 5, };
+/**
- Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
- region selected.
- open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
- etc. offset/length specify a slice of the region to create the dmabuf from.
- If both are 0 then the whole region is used.
- Return: The fd number on success, -1 and errno is set on failure.
- */
+struct vfio_device_feature_dma_buf {
__u32 region_index;
__u32 open_flags;
__u32 offset;
__u64 length;
+}; +#define VFIO_DEVICE_FEATURE_DMA_BUF 3
/* -------- API for Type1 VFIO IOMMU -------- */
/**
2.37.2
On Sun, Aug 21, 2022 at 04:51:34PM +0300, Oded Gabbay wrote:
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)
+{
struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
dma_unmap_resource(attachment->dev, sg_dma_address(sgt->sgl),
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
sg_free_table(sgt);
Before calling sg_free_table(), you need to restore the orig_nents as it is used in that function to free the allocated memory of the sgt.
Oops, right, thanks good catch
Jason
On Wed, Aug 17, 2022 at 01:11:42PM -0300, Jason Gunthorpe wrote:
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
The patch design loosely follows the pattern in commit db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this does not support pinning.
Instead, this implements what, in the past, we've called a revocable attachment using move. In normal situations the attachment is pinned, as a BAR does not change physical address. However when the VFIO device is closed, or a PCI reset is issued, access to the MMIO memory is revoked.
Revoked means that move occurs, but an attempt to immediately re-map the memory will fail. In the reset case a future move will be triggered when MMIO access returns. As both close and reset are under userspace control it is expected that userspace will suspend use of the dma-buf before doing these operations, the revoke is purely for kernel self-defense against a hostile userspace.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com
drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci_config.c | 8 +- drivers/vfio/pci/vfio_pci_core.c | 28 ++- drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 23 +++ include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 18 ++ 7 files changed, 336 insertions(+), 8 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dma_buf.c
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 24c524224da5a3..81006a157cde14 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,6 +2,7 @@ vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += vfio_pci_dma_buf.o obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index d22921efa25987..f8a9c24d04aeb7 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -584,10 +584,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
if (!new_mem)
if (!new_mem) { vfio_pci_zap_and_down_write_memory_lock(vdev);
else
vfio_pci_dma_buf_move(vdev, true);
} else { down_write(&vdev->memory_lock);
}
/* * If the user is writing mem/io enable (new_mem/io) and we @@ -622,6 +624,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask);
if (__vfio_pci_memory_enabled(vdev))
up_write(&vdev->memory_lock); }vfio_pci_dma_buf_move(vdev, false);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index d13e22021860cc..206f159c480e42 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -504,6 +504,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) vfio_spapr_pci_eeh_release(vdev->pdev); vfio_pci_core_disable(vdev);
- vfio_pci_dma_buf_cleanup(vdev);
- mutex_lock(&vdev->igate); if (vdev->err_trigger) { eventfd_ctx_put(vdev->err_trigger);
@@ -980,7 +982,10 @@ int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev) */ vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_dma_buf_move(vdev, true); ret = pci_try_reset_function(vdev->pdev);
- if (__vfio_pci_memory_enabled(vdev))
up_write(&vdev->memory_lock);vfio_pci_dma_buf_move(vdev, false);
return ret; @@ -1210,11 +1215,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); -static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
u32 flags, uuid_t __user *arg,
size_t argsz)
{
- struct vfio_pci_core_device *vdev =
uuid_t uuid; int ret;container_of(device, struct vfio_pci_core_device, vdev);
@@ -1241,9 +1245,14 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) {
- struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
- switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
return vfio_pci_core_feature_token(device, flags, arg, argsz);
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
- case VFIO_DEVICE_FEATURE_DMA_BUF:
default: return -ENOTTY; }return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
@@ -1881,6 +1890,7 @@ void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev, INIT_LIST_HEAD(&vdev->vma_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); init_rwsem(&vdev->memory_lock);
- INIT_LIST_HEAD(&vdev->dmabufs);
} EXPORT_SYMBOL_GPL(vfio_pci_core_init_device); @@ -2227,11 +2237,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * cause the PCI config space reset without restoring the original * state (saved locally in 'vdev->pm_save'). */
- list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
- list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
vfio_pci_set_power_state(cur, PCI_D0);vfio_pci_dma_buf_move(cur, true);
- }
ret = pci_reset_bus(pdev);
- list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
if (__vfio_pci_memory_enabled(cur))
vfio_pci_dma_buf_move(cur, false);
err_undo: list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { if (cur == cur_mem) diff --git a/drivers/vfio/pci/vfio_pci_dma_buf.c b/drivers/vfio/pci/vfio_pci_dma_buf.c new file mode 100644 index 00000000000000..ac32405de5e48d --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_dma_buf.c @@ -0,0 +1,265 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.
- */
+#include <linux/dma-buf.h> +#include <linux/pci-p2pdma.h> +#include <linux/dma-resv.h>
+#include "vfio_pci_priv.h"
+MODULE_IMPORT_NS(DMA_BUF);
+struct vfio_pci_dma_buf {
- struct dma_buf *dmabuf;
- struct vfio_pci_core_device *vdev;
- struct list_head dmabufs_elm;
- unsigned int index;
- size_t offset;
- bool revoked;
+};
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- int rc;
- rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
true);
- if (rc < 0)
attachment->peer2peer = false;
- return 0;
+}
+static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment) +{ +}
+static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment) +{
- /*
* Uses the dynamic interface but must always allow for
* dma_buf_move_notify() to do revoke
*/
- return -EINVAL;
+}
+static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)
+{
- size_t sgl_size = dma_get_max_seg_size(attachment->dev);
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct scatterlist *sgl;
- struct sg_table *sgt;
- dma_addr_t dma_addr;
- unsigned int nents;
- size_t offset;
- int ret;
- dma_resv_assert_held(priv->dmabuf->resv);
- if (!attachment->peer2peer)
return ERR_PTR(-EPERM);
- if (!priv->revoked)
return ERR_PTR(-ENODEV);
- sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
- if (!sgt)
return ERR_PTR(-ENOMEM);
- nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size);
- ret = sg_alloc_table(sgt, nents, GFP_KERNEL);
- if (ret)
goto err_kfree_sgt;
- /*
* Since the memory being mapped is a device memory it could never be in
* CPU caches.
*/
- dma_addr = dma_map_resource(
attachment->dev,
pci_resource_start(priv->vdev->pdev, priv->index) +
priv->offset,
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
dma_map_resource maps the phys to an IOVA in device's default_domain, which, however, may not be the domain that the device is currently attached to. So, the importer of this sgt will convert this dma_addr back to phys?
BTW, I don't see the assignment of priv->index in below vfio_pci_core_feature_dma_buf(), is it equal to get_dma_buf.region_index ?
Thanks Yan
- ret = dma_mapping_error(attachment->dev, dma_addr);
- if (ret)
goto err_free_sgt;
- /*
* Break the BAR's physical range up into max sized SGL's according to
* the device's requirement.
*/
- sgl = sgt->sgl;
- for (offset = 0; offset != priv->dmabuf->size;) {
size_t chunk_size = min(priv->dmabuf->size - offset, sgl_size);
sg_set_page(sgl, NULL, chunk_size, 0);
sg_dma_address(sgl) = dma_addr + offset;
sg_dma_len(sgl) = chunk_size;
sgl = sg_next(sgl);
offset += chunk_size;
- }
- /*
* Because we are not going to include a CPU list we want to have some
* chance that other users will detect this by setting the orig_nents to
* 0 and using only nents (length of DMA list) when going over the sgl
*/
- sgt->orig_nents = 0;
- return sgt;
+err_free_sgt:
- sg_free_table(sgt);
+err_kfree_sgt:
- kfree(sgt);
- return ERR_PTR(ret);
+}
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)
+{
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- dma_unmap_resource(attachment->dev, sg_dma_address(sgt->sgl),
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
- sg_free_table(sgt);
- kfree(sgt);
+}
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) +{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- /*
* Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
* The refcount prevents both.
*/
- if (priv->vdev) {
down_write(&priv->vdev->memory_lock);
list_del_init(&priv->dmabufs_elm);
up_write(&priv->vdev->memory_lock);
vfio_device_put(&priv->vdev->vdev);
- }
- kfree(priv);
+}
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
- .attach = vfio_pci_dma_buf_attach,
- .map_dma_buf = vfio_pci_dma_buf_map,
- .pin = vfio_pci_dma_buf_pin,
- .unpin = vfio_pci_dma_buf_unpin,
- .release = vfio_pci_dma_buf_release,
- .unmap_dma_buf = vfio_pci_dma_buf_unmap,
+};
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz)
+{
- struct vfio_device_feature_dma_buf get_dma_buf;
- DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
- struct vfio_pci_dma_buf *priv;
- int ret;
- ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
sizeof(get_dma_buf));
- if (ret != 1)
return ret;
- if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
return -EFAULT;
- /* For PCI the region_index is the BAR number like everything else */
- if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX)
return -EINVAL;
- exp_info.ops = &vfio_pci_dmabuf_ops;
- exp_info.size = pci_resource_len(vdev->pdev, get_dma_buf.region_index);
- if (!exp_info.size)
return -EINVAL;
- if (get_dma_buf.offset || get_dma_buf.length) {
if (get_dma_buf.length > exp_info.size ||
get_dma_buf.offset >= exp_info.size ||
get_dma_buf.length > exp_info.size - get_dma_buf.offset ||
get_dma_buf.offset % PAGE_SIZE ||
get_dma_buf.length % PAGE_SIZE)
return -EINVAL;
exp_info.size = get_dma_buf.length;
- }
- exp_info.flags = get_dma_buf.open_flags;
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- if (!priv)
return -ENOMEM;
- INIT_LIST_HEAD(&priv->dmabufs_elm);
- priv->offset = get_dma_buf.offset;
- exp_info.priv = priv;
- priv->dmabuf = dma_buf_export(&exp_info);
- if (IS_ERR(priv->dmabuf)) {
ret = PTR_ERR(priv->dmabuf);
kfree(priv);
return ret;
- }
- /* dma_buf_put() now frees priv */
- down_write(&vdev->memory_lock);
- dma_resv_lock(priv->dmabuf->resv, NULL);
- priv->revoked = __vfio_pci_memory_enabled(vdev);
- priv->vdev = vdev;
- vfio_device_get(&vdev->vdev);
- list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
- dma_resv_unlock(priv->dmabuf->resv);
- up_write(&vdev->memory_lock);
- /*
* dma_buf_fd() consumes the reference, when the file closes the dmabuf
* will be released.
*/
- return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+}
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{
- struct vfio_pci_dma_buf *priv;
- struct vfio_pci_dma_buf *tmp;
- lockdep_assert_held_write(&vdev->memory_lock);
- list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!dma_buf_try_get(priv->dmabuf))
continue;
if (priv->revoked != revoked) {
dma_resv_lock(priv->dmabuf->resv, NULL);
priv->revoked = revoked;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv);
}
dma_buf_put(priv->dmabuf);
- }
+}
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{
- struct vfio_pci_dma_buf *priv;
- struct vfio_pci_dma_buf *tmp;
- down_write(&vdev->memory_lock);
- list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!dma_buf_try_get(priv->dmabuf))
continue;
dma_resv_lock(priv->dmabuf->resv, NULL);
list_del_init(&priv->dmabufs_elm);
priv->vdev = NULL;
priv->revoked = true;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv);
vfio_device_put(&vdev->vdev);
dma_buf_put(priv->dmabuf);
- }
- up_write(&vdev->memory_lock);
+} diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 5b1cb9a9838442..c295a1166e7a9c 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -102,4 +102,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; } +#ifdef CONFIG_DMA_SHARED_BUFFER +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); +#else +static int +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz)
+{
- return -ENOTTY;
+} +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ +} +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
bool revoked)
+{ +} +#endif
#endif diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 9d18b832e61a0d..b57f4ecc2665e1 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -93,6 +93,7 @@ struct vfio_pci_core_device { struct mutex vma_lock; struct list_head vma_list; struct rw_semaphore memory_lock;
- struct list_head dmabufs;
}; /* Will be exported for vfio pci drivers usage */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 733a1cddde30a5..1dcfad6dcb6863 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -986,6 +986,24 @@ enum vfio_device_mig_state { VFIO_DEVICE_STATE_RUNNING_P2P = 5, }; +/**
- Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
- region selected.
- open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
- etc. offset/length specify a slice of the region to create the dmabuf from.
- If both are 0 then the whole region is used.
- Return: The fd number on success, -1 and errno is set on failure.
- */
+struct vfio_device_feature_dma_buf {
- __u32 region_index;
- __u32 open_flags;
- __u32 offset;
- __u64 length;
+}; +#define VFIO_DEVICE_FEATURE_DMA_BUF 3
/* -------- API for Type1 VFIO IOMMU -------- */ /** -- 2.37.2
On Mon, Aug 29, 2022 at 01:04:05PM +0800, Yan Zhao wrote:
+static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)
+{
- size_t sgl_size = dma_get_max_seg_size(attachment->dev);
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct scatterlist *sgl;
- struct sg_table *sgt;
- dma_addr_t dma_addr;
- unsigned int nents;
- size_t offset;
- int ret;
- dma_resv_assert_held(priv->dmabuf->resv);
- if (!attachment->peer2peer)
return ERR_PTR(-EPERM);
- if (!priv->revoked)
return ERR_PTR(-ENODEV);
- sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
- if (!sgt)
return ERR_PTR(-ENOMEM);
- nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size);
- ret = sg_alloc_table(sgt, nents, GFP_KERNEL);
- if (ret)
goto err_kfree_sgt;
- /*
* Since the memory being mapped is a device memory it could never be in
* CPU caches.
*/
- dma_addr = dma_map_resource(
attachment->dev,
pci_resource_start(priv->vdev->pdev, priv->index) +
priv->offset,
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
dma_map_resource maps the phys to an IOVA in device's default_domain, which, however, may not be the domain that the device is currently attached to.
dmabuf is defined only in terms of the DMA API, so it creates a SGL with DMA API mapped dma_addr's
So, the importer of this sgt will convert this dma_addr back to phys?
No, it is illegal to call this API if the importer is not using the DMA API.
I'm expecting to propose some new dmabuf API for this use-case, when we get to it. For now all importers use the DMA API.
BTW, I don't see the assignment of priv->index in below vfio_pci_core_feature_dma_buf(), is it equal to get_dma_buf.region_index ?
Yes, I noticed that too and have fixed it - the testing was all done on index 0.
Jason
Am 17.08.22 um 18:11 schrieb Jason Gunthorpe:
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
This series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. I imagine this dmabuf approach to be usable by iommufd as well for generic and safe P2P mappings.
In general looks good to me, but we really need to get away from using sg_tables for this here.
The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more.
Would that be somehow avoidable? Or could you at least explain the use case a bit better.
Thanks, Christian.
This series goes after the "Break up ioctl dispatch functions to one function per ioctl" series.
This is on github: https://github.com/jgunthorpe/linux/commits/vfio_dma_buf
Jason Gunthorpe (4): dma-buf: Add dma_buf_try_get() vfio: Add vfio_device_get() vfio_pci: Do not open code pci_try_reset_function() vfio/pci: Allow MMIO regions to be exported through dma-buf
drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci_config.c | 22 ++- drivers/vfio/pci/vfio_pci_core.c | 33 +++- drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 24 +++ drivers/vfio/vfio_main.c | 3 +- include/linux/dma-buf.h | 13 ++ include/linux/vfio.h | 6 + include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 18 ++ 10 files changed, 364 insertions(+), 22 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dma_buf.c
base-commit: 385f0411fcd2780b5273992832cdc8edcd5b8ea9
On Thu, Aug 18, 2022 at 01:07:16PM +0200, Christian König wrote:
Am 17.08.22 um 18:11 schrieb Jason Gunthorpe:
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
This series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. I imagine this dmabuf approach to be usable by iommufd as well for generic and safe P2P mappings.
In general looks good to me, but we really need to get away from using sg_tables for this here.
The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more.
Would that be somehow avoidable? Or could you at least explain the use case a bit better.
I didn't see a way, maybe you know of one
VFIO needs to maintain a list of dmabuf FDs that have been created by the user attached to each vfio_device:
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz) { down_write(&vdev->memory_lock); list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); up_write(&vdev->memory_lock);
And dmabuf FD's are removed from the list when the user closes the FD:
static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) { down_write(&priv->vdev->memory_lock); list_del_init(&priv->dmabufs_elm); up_write(&priv->vdev->memory_lock);
Which then poses the problem: How do you iterate over only dma_buf's that are still alive to execute move?
This seems necessary as parts of the dma_buf have already been destroyed by the time the user's release function is called.
Which I solved like this:
down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!dma_buf_try_get(priv->dmabuf)) continue;
So the scenarios resolve as: - Concurrent release is not in progress: dma_buf_try_get() succeeds and prevents concurrent release from starting - Release has started but not reached its memory_lock: dma_buf_try_get() fails - Release has started but passed its memory_lock: dmabuf is not on the list so dma_buf_try_get() is not called.
Jason
Am 18.08.22 um 14:03 schrieb Jason Gunthorpe:
On Thu, Aug 18, 2022 at 01:07:16PM +0200, Christian König wrote:
Am 17.08.22 um 18:11 schrieb Jason Gunthorpe:
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
This series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. I imagine this dmabuf approach to be usable by iommufd as well for generic and safe P2P mappings.
In general looks good to me, but we really need to get away from using sg_tables for this here.
The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more.
Would that be somehow avoidable? Or could you at least explain the use case a bit better.
I didn't see a way, maybe you know of one
For GEM objects we usually don't use the reference count of the DMA-buf, but rather that of the GEM object for this. But that's not an ideal solution either.
VFIO needs to maintain a list of dmabuf FDs that have been created by the user attached to each vfio_device:
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz) { down_write(&vdev->memory_lock); list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); up_write(&vdev->memory_lock);
And dmabuf FD's are removed from the list when the user closes the FD:
static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) { down_write(&priv->vdev->memory_lock); list_del_init(&priv->dmabufs_elm); up_write(&priv->vdev->memory_lock);
Which then poses the problem: How do you iterate over only dma_buf's that are still alive to execute move?
This seems necessary as parts of the dma_buf have already been destroyed by the time the user's release function is called.
Which I solved like this:
down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!dma_buf_try_get(priv->dmabuf)) continue;
What would happen if you don't skip destroyed dma-bufs here? In other words why do you maintain that list in the first place?
Regards, Christian.
So the scenarios resolve as:
- Concurrent release is not in progress: dma_buf_try_get() succeeds and prevents concurrent release from starting
- Release has started but not reached its memory_lock: dma_buf_try_get() fails
- Release has started but passed its memory_lock: dmabuf is not on the list so dma_buf_try_get() is not called.
Jason
On Thu, Aug 18, 2022 at 02:58:10PM +0200, Christian König wrote:
The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more.
Would that be somehow avoidable? Or could you at least explain the use case a bit better.
I didn't see a way, maybe you know of one
For GEM objects we usually don't use the reference count of the DMA-buf, but rather that of the GEM object for this. But that's not an ideal solution either.
You can't really ignore the dmabuf refcount. At some point you have to deal with the dmabuf being asynchronously released by userspace.
down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!dma_buf_try_get(priv->dmabuf)) continue;
What would happen if you don't skip destroyed dma-bufs here? In other words why do you maintain that list in the first place?
The list is to keep track of the dmabufs that were created, it is not optional.
The only question is what do you do about invoking dma_buf_move_notify() on a dmabuf that is already undergoing destruction.
For instance undergoing destruction means the dmabuf core has already done this:
mutex_lock(&db_list.lock); list_del(&dmabuf->list_node); mutex_unlock(&db_list.lock); dma_buf_stats_teardown(dmabuf);
So it seems non-ideal to continue to use it.
However, dma_buf_move_notify() specifically has no issue with that level of destruction since it only does
list_for_each_entry(attach, &dmabuf->attachments, node)
And attachments must be empty if the file refcount is zero.
So we could delete the try_buf and just rely on move being safe on partially destroyed dma_buf's as part of the API design.
Jason
Am 18.08.22 um 15:16 schrieb Jason Gunthorpe:
On Thu, Aug 18, 2022 at 02:58:10PM +0200, Christian König wrote:
The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more.
Would that be somehow avoidable? Or could you at least explain the use case a bit better.
I didn't see a way, maybe you know of one
For GEM objects we usually don't use the reference count of the DMA-buf, but rather that of the GEM object for this. But that's not an ideal solution either.
You can't really ignore the dmabuf refcount. At some point you have to deal with the dmabuf being asynchronously released by userspace.
Yeah, but in this case the dma-buf is just a reference to the real/private object which holds the backing store.
When the dma-buf is released you drop the real object reference and from your driver internals you only try_get only the real object.
The advantage is that only your driver can use the try_get function and not some importing driver which doesn't know about the internals of the exporter.
We just had to many cases where developers weren't sure if a pointer is still valid and by using try_get it just "magically" got working (well I have to admit it made the crashing less likely....).
down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!dma_buf_try_get(priv->dmabuf)) continue;
What would happen if you don't skip destroyed dma-bufs here? In other words why do you maintain that list in the first place?
The list is to keep track of the dmabufs that were created, it is not optional.
The only question is what do you do about invoking dma_buf_move_notify() on a dmabuf that is already undergoing destruction.
Ah, yes. Really good point.
For instance undergoing destruction means the dmabuf core has already done this:
mutex_lock(&db_list.lock); list_del(&dmabuf->list_node); mutex_unlock(&db_list.lock); dma_buf_stats_teardown(dmabuf);
So it seems non-ideal to continue to use it.
However, dma_buf_move_notify() specifically has no issue with that level of destruction since it only does
list_for_each_entry(attach, &dmabuf->attachments, node)
And attachments must be empty if the file refcount is zero.
So we could delete the try_buf and just rely on move being safe on partially destroyed dma_buf's as part of the API design.
I think that might be the more defensive approach. A comment on the dma_buf_move_notify() function should probably be a good idea.
Thanks, Christian.
Jason
On Thu, Aug 18, 2022 at 03:37:01PM +0200, Christian König wrote:
Am 18.08.22 um 15:16 schrieb Jason Gunthorpe:
On Thu, Aug 18, 2022 at 02:58:10PM +0200, Christian König wrote:
The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more.
Would that be somehow avoidable? Or could you at least explain the use case a bit better.
I didn't see a way, maybe you know of one
For GEM objects we usually don't use the reference count of the DMA-buf, but rather that of the GEM object for this. But that's not an ideal solution either.
You can't really ignore the dmabuf refcount. At some point you have to deal with the dmabuf being asynchronously released by userspace.
Yeah, but in this case the dma-buf is just a reference to the real/private object which holds the backing store.
The gem approach is backwards to what I did here.
GEM holds a singleton pointer to the dmabuf and holds a reference on it as long as it has the pointer. This means the dmabuf can not be freed until the GEM object is freed.
For this I held a "weak reference" on the dmabuf in a list, and we convert the weak reference to a strong reference in the usual way using a try_get.
The reason it is different is because the VFIO interface allows creating a DMABUF with unique parameters on every user request. Eg the user can select a BAR index and a slice of the MMIO space unique to each each request and this results in a unique DMABUF.
Due to this we have to store a list of DMABUFs and we need the DMABUF's to clean up their memory when the user closes the file.
So we could delete the try_buf and just rely on move being safe on partially destroyed dma_buf's as part of the API design.
I think that might be the more defensive approach. A comment on the dma_buf_move_notify() function should probably be a good idea.
IMHO, it is an anti-pattern. The caller should hold a strong reference on an object before invoking any API surface. Upgrading a weak reference to a strong reference requires the standard "try get" API.
But if you feel strongly I don't mind dropping the try_get around move.
Jason
Am 19.08.22 um 15:11 schrieb Jason Gunthorpe:
On Thu, Aug 18, 2022 at 03:37:01PM +0200, Christian König wrote:
Am 18.08.22 um 15:16 schrieb Jason Gunthorpe:
On Thu, Aug 18, 2022 at 02:58:10PM +0200, Christian König wrote:
The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more.
Would that be somehow avoidable? Or could you at least explain the use case a bit better.
I didn't see a way, maybe you know of one
For GEM objects we usually don't use the reference count of the DMA-buf, but rather that of the GEM object for this. But that's not an ideal solution either.
You can't really ignore the dmabuf refcount. At some point you have to deal with the dmabuf being asynchronously released by userspace.
Yeah, but in this case the dma-buf is just a reference to the real/private object which holds the backing store.
The gem approach is backwards to what I did here.
As I said, what GEM does is not necessary the best approach either.
GEM holds a singleton pointer to the dmabuf and holds a reference on it as long as it has the pointer. This means the dmabuf can not be freed until the GEM object is freed.
For this I held a "weak reference" on the dmabuf in a list, and we convert the weak reference to a strong reference in the usual way using a try_get.
The reason it is different is because the VFIO interface allows creating a DMABUF with unique parameters on every user request. Eg the user can select a BAR index and a slice of the MMIO space unique to each each request and this results in a unique DMABUF.
Due to this we have to store a list of DMABUFs and we need the DMABUF's to clean up their memory when the user closes the file.
Yeah, that makes sense.
So we could delete the try_buf and just rely on move being safe on partially destroyed dma_buf's as part of the API design.
I think that might be the more defensive approach. A comment on the dma_buf_move_notify() function should probably be a good idea.
IMHO, it is an anti-pattern. The caller should hold a strong reference on an object before invoking any API surface. Upgrading a weak reference to a strong reference requires the standard "try get" API.
But if you feel strongly I don't mind dropping the try_get around move.
Well I see it as well that both approaches are not ideal, but my gut feeling tells me that just documenting that dma_buf_move_notify() can still be called as long as the release callback wasn't called yet is probably the better approach.
On the other hand this is really just a gut feeling without strong arguments backing it. So if somebody has an argument which makes try_get necessary I'm happy to hear it.
Regards, Christian.
Jason
On Fri, Aug 19, 2022 at 03:33:04PM +0200, Christian König wrote:
So we could delete the try_buf and just rely on move being safe on partially destroyed dma_buf's as part of the API design.
I think that might be the more defensive approach. A comment on the dma_buf_move_notify() function should probably be a good idea.
IMHO, it is an anti-pattern. The caller should hold a strong reference on an object before invoking any API surface. Upgrading a weak reference to a strong reference requires the standard "try get" API.
But if you feel strongly I don't mind dropping the try_get around move.
Well I see it as well that both approaches are not ideal, but my gut feeling tells me that just documenting that dma_buf_move_notify() can still be called as long as the release callback wasn't called yet is probably the better approach.
The comment would say something like:
"dma_resv_lock(), dma_buf_move_notify(), dma_resv_unlock() may be called with a 0 refcount so long as ops->release() hasn't returned"
Which is a really abnormal API design, IMHO.
Jason
Am 19.08.22 um 15:39 schrieb Jason Gunthorpe:
On Fri, Aug 19, 2022 at 03:33:04PM +0200, Christian König wrote:
So we could delete the try_buf and just rely on move being safe on partially destroyed dma_buf's as part of the API design.
I think that might be the more defensive approach. A comment on the dma_buf_move_notify() function should probably be a good idea.
IMHO, it is an anti-pattern. The caller should hold a strong reference on an object before invoking any API surface. Upgrading a weak reference to a strong reference requires the standard "try get" API.
But if you feel strongly I don't mind dropping the try_get around move.
Well I see it as well that both approaches are not ideal, but my gut feeling tells me that just documenting that dma_buf_move_notify() can still be called as long as the release callback wasn't called yet is probably the better approach.
The comment would say something like:
"dma_resv_lock(), dma_buf_move_notify(), dma_resv_unlock() may be called with a 0 refcount so long as ops->release() hasn't returned"
Which is a really abnormal API design, IMHO.
Mhm, Daniel or other do you have any opinion on that as well?
Thanks, Christian.
Jason
On Wed, Aug 17, 2022 at 01:11:38PM -0300, Jason Gunthorpe wrote:
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
This series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. I imagine this dmabuf approach to be usable by iommufd as well for generic and safe P2P mappings.
This series goes after the "Break up ioctl dispatch functions to one function per ioctl" series.
This is on github: https://github.com/jgunthorpe/linux/commits/vfio_dma_buf
Jason Gunthorpe (4): dma-buf: Add dma_buf_try_get() vfio: Add vfio_device_get() vfio_pci: Do not open code pci_try_reset_function() vfio/pci: Allow MMIO regions to be exported through dma-buf
drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci_config.c | 22 ++- drivers/vfio/pci/vfio_pci_core.c | 33 +++- drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++
I forget about this..
Alex, do you want to start doing as Linus discused and I will rename this new file to "dma_buf.c" ?
Or keep this directory as having the vfio_pci_* prefix for consistency?
Jason
On Thu, 18 Aug 2022 09:05:24 -0300 Jason Gunthorpe jgg@nvidia.com wrote:
On Wed, Aug 17, 2022 at 01:11:38PM -0300, Jason Gunthorpe wrote:
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
This series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. I imagine this dmabuf approach to be usable by iommufd as well for generic and safe P2P mappings.
This series goes after the "Break up ioctl dispatch functions to one function per ioctl" series.
This is on github: https://github.com/jgunthorpe/linux/commits/vfio_dma_buf
Jason Gunthorpe (4): dma-buf: Add dma_buf_try_get() vfio: Add vfio_device_get() vfio_pci: Do not open code pci_try_reset_function() vfio/pci: Allow MMIO regions to be exported through dma-buf
drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci_config.c | 22 ++- drivers/vfio/pci/vfio_pci_core.c | 33 +++- drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++
I forget about this..
Alex, do you want to start doing as Linus discused and I will rename this new file to "dma_buf.c" ?
Or keep this directory as having the vfio_pci_* prefix for consistency?
I have a hard time generating a strong opinion over file name redundancy relative to directory structure. By my count, over 17% of files in drivers/ have some file name redundancy to their parent directory structure (up to two levels). I see we already have two $FOO_dma_buf.c files in the tree, virtio and amdgpu among these. In the virtio case this is somewhat justified, to me at least, as the virtio_dma_buf.h file exists in a shared include namespace. However, this justification only accounts for about 1% of such files, for many others the redundancy exists in the include path as well.
I guess if we don't have a reason other than naming consistency and accept an end goal to incrementally remove file name vs directory structure redundancy where it makes sense, sure, name it dma_buf.c. Ugh. Thanks,
Alex
linaro-mm-sig@lists.linaro.org