dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
This series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. I imagine this dmabuf approach to be usable by iommufd as well for generic and safe P2P mappings.
This series goes after the "Break up ioctl dispatch functions to one function per ioctl" series.
This is on github: https://github.com/jgunthorpe/linux/commits/vfio_dma_buf
v2: - Name the new file dma_buf.c - Restore orig_nents before freeing - Fix reversed logic around priv->revoked - Set priv->index - Rebased on v2 "Break up ioctl dispatch functions" v1: https://lore.kernel.org/r/0-v1-9e6e1739ed95+5fa-vfio_dma_buf_jgg@nvidia.com
Cc: linux-rdma@vger.kernel.org Cc: Oded Gabbay ogabbay@kernel.org Cc: Christian König christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Leon Romanovsky leon@kernel.org Cc: Maor Gottlieb maorg@nvidia.com Cc: dri-devel@lists.freedesktop.org Signed-off-by: Jason Gunthorpe jgg@nvidia.com
Jason Gunthorpe (4): dma-buf: Add dma_buf_try_get() vfio: Add vfio_device_get() vfio_pci: Do not open code pci_try_reset_function() vfio/pci: Allow MMIO regions to be exported through dma-buf
drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/dma_buf.c | 269 +++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_config.c | 22 ++- drivers/vfio/pci/vfio_pci_core.c | 33 +++- drivers/vfio/pci/vfio_pci_priv.h | 24 +++ drivers/vfio/vfio_main.c | 3 +- include/linux/dma-buf.h | 13 ++ include/linux/vfio.h | 6 + include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 18 ++ 10 files changed, 368 insertions(+), 22 deletions(-) create mode 100644 drivers/vfio/pci/dma_buf.c
base-commit: 285fef0ff7f1a97d8acd380971c061985d8dafb5
To increment a reference the caller already holds. Export vfio_device_put() to pair with it.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- drivers/vfio/vfio_main.c | 3 ++- include/linux/vfio.h | 6 ++++++ 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index eb714a484662fc..dadbb5a7bb4c5b 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -451,11 +451,12 @@ static void vfio_group_get(struct vfio_group *group) * Device objects - create, release, get, put, search */ /* Device reference always implies a group reference */ -static void vfio_device_put(struct vfio_device *device) +void vfio_device_put(struct vfio_device *device) { if (refcount_dec_and_test(&device->refcount)) complete(&device->comp); } +EXPORT_SYMBOL_GPL(vfio_device_put);
static bool vfio_device_try_get(struct vfio_device *device) { diff --git a/include/linux/vfio.h b/include/linux/vfio.h index e05ddc6fe6a556..25850b1e08cba9 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -143,6 +143,12 @@ void vfio_uninit_group_dev(struct vfio_device *device); int vfio_register_group_dev(struct vfio_device *device); int vfio_register_emulated_iommu_dev(struct vfio_device *device); void vfio_unregister_group_dev(struct vfio_device *device); +void vfio_device_put(struct vfio_device *device); + +static inline void vfio_device_get(struct vfio_device *device) +{ + refcount_inc(&device->refcount); +}
int vfio_assign_device_set(struct vfio_device *device, void *set_id);
FLR triggered by an emulated config space write should not behave differently from a FLR triggered by VFIO_DEVICE_RESET, currently the config space path misses the power management.
Consolidate all the call sites to invoke a single function.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- drivers/vfio/pci/vfio_pci_config.c | 14 ++++---------- drivers/vfio/pci/vfio_pci_core.c | 5 ++--- drivers/vfio/pci/vfio_pci_priv.h | 1 + 3 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 4a350421c5f62a..d22921efa25987 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -893,11 +893,8 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos, pos - offset + PCI_EXP_DEVCAP, &cap);
- if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - pci_try_reset_function(vdev->pdev); - up_write(&vdev->memory_lock); - } + if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) + vfio_pci_try_reset_function(vdev); }
/* @@ -975,11 +972,8 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos, pos - offset + PCI_AF_CAP, &cap);
- if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - pci_try_reset_function(vdev->pdev); - up_write(&vdev->memory_lock); - } + if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) + vfio_pci_try_reset_function(vdev); }
return count; diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 9273f1ffd0ddd0..885706944e4d96 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -960,8 +960,7 @@ static int vfio_pci_ioctl_set_irqs(struct vfio_pci_core_device *vdev, return ret; }
-static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, - void __user *arg) +int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev) { int ret;
@@ -1202,7 +1201,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, case VFIO_DEVICE_PCI_HOT_RESET: return vfio_pci_ioctl_pci_hot_reset(vdev, uarg); case VFIO_DEVICE_RESET: - return vfio_pci_ioctl_reset(vdev, uarg); + return vfio_pci_try_reset_function(vdev); case VFIO_DEVICE_SET_IRQS: return vfio_pci_ioctl_set_irqs(vdev, uarg); default: diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 58b8d34c162cd6..5b1cb9a9838442 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -60,6 +60,7 @@ void vfio_config_free(struct vfio_pci_core_device *vdev); int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state);
+int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev); bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev); void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev); u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev);
Used to increment the refcount of the dma buf's struct file, only if the refcount is not zero. Useful to allow the struct file's lifetime to control the lifetime of the dmabuf while still letting the driver to keep track of created dmabufs.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- include/linux/dma-buf.h | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 71731796c8c3a8..a35f1554f2fb36 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -618,6 +618,19 @@ int dma_buf_fd(struct dma_buf *dmabuf, int flags); struct dma_buf *dma_buf_get(int fd); void dma_buf_put(struct dma_buf *dmabuf);
+/** + * dma_buf_try_get - try to get a reference on a dmabuf + * @dmabuf - the dmabuf to get + * + * Returns true if a reference was successfully obtained. The caller must + * interlock with the dmabuf's release function in some way, such as RCU, to + * ensure that this is not called on freed memory. + */ +static inline bool dma_buf_try_get(struct dma_buf *dmabuf) +{ + return get_file_rcu(dmabuf->file); +} + struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
Am 01.09.22 um 01:12 schrieb Jason Gunthorpe:
Used to increment the refcount of the dma buf's struct file, only if the refcount is not zero. Useful to allow the struct file's lifetime to control the lifetime of the dmabuf while still letting the driver to keep track of created dmabufs.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com
include/linux/dma-buf.h | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 71731796c8c3a8..a35f1554f2fb36 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -618,6 +618,19 @@ int dma_buf_fd(struct dma_buf *dmabuf, int flags); struct dma_buf *dma_buf_get(int fd); void dma_buf_put(struct dma_buf *dmabuf); +/**
- dma_buf_try_get - try to get a reference on a dmabuf
- @dmabuf - the dmabuf to get
- Returns true if a reference was successfully obtained. The caller must
- interlock with the dmabuf's release function in some way, such as RCU, to
- ensure that this is not called on freed memory.
I still have a bad feeling about this, but I also see that we can only choose between evils here.
Could you just call get_file_rcu() from the exporter with a comment explaining why this works instead?
That would at least not give importers the opportunity to abuse this.
Thanks, Christian.
- */
+static inline bool dma_buf_try_get(struct dma_buf *dmabuf) +{
- return get_file_rcu(dmabuf->file);
+}
- struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
On Thu, Sep 01, 2022 at 09:55:08AM +0200, Christian König wrote:
Am 01.09.22 um 01:12 schrieb Jason Gunthorpe:
Used to increment the refcount of the dma buf's struct file, only if the refcount is not zero. Useful to allow the struct file's lifetime to control the lifetime of the dmabuf while still letting the driver to keep track of created dmabufs.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com
include/linux/dma-buf.h | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 71731796c8c3a8..a35f1554f2fb36 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -618,6 +618,19 @@ int dma_buf_fd(struct dma_buf *dmabuf, int flags); struct dma_buf *dma_buf_get(int fd); void dma_buf_put(struct dma_buf *dmabuf); +/**
- dma_buf_try_get - try to get a reference on a dmabuf
- @dmabuf - the dmabuf to get
- Returns true if a reference was successfully obtained. The caller must
- interlock with the dmabuf's release function in some way, such as RCU, to
- ensure that this is not called on freed memory.
I still have a bad feeling about this, but I also see that we can only choose between evils here.
Could you just call get_file_rcu() from the exporter with a comment explaining why this works instead?
I guess, are you sure? It seems very hacky.
Jason
Am 06.09.22 um 18:44 schrieb Jason Gunthorpe:
On Thu, Sep 01, 2022 at 09:55:08AM +0200, Christian König wrote:
Am 01.09.22 um 01:12 schrieb Jason Gunthorpe:
Used to increment the refcount of the dma buf's struct file, only if the refcount is not zero. Useful to allow the struct file's lifetime to control the lifetime of the dmabuf while still letting the driver to keep track of created dmabufs.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com
include/linux/dma-buf.h | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 71731796c8c3a8..a35f1554f2fb36 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -618,6 +618,19 @@ int dma_buf_fd(struct dma_buf *dmabuf, int flags); struct dma_buf *dma_buf_get(int fd); void dma_buf_put(struct dma_buf *dmabuf); +/**
- dma_buf_try_get - try to get a reference on a dmabuf
- @dmabuf - the dmabuf to get
- Returns true if a reference was successfully obtained. The caller must
- interlock with the dmabuf's release function in some way, such as RCU, to
- ensure that this is not called on freed memory.
I still have a bad feeling about this, but I also see that we can only choose between evils here.
Could you just call get_file_rcu() from the exporter with a comment explaining why this works instead?
I guess, are you sure? It seems very hacky.
Yes, it's still better than exposing a dma_buf_try_get() interface to everyone.
Keep in mind that those functions here are mostly supposed to be used by the importer and not the exporter.
Christian.
Jason
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
The patch design loosely follows the pattern in commit db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this does not support pinning.
Instead, this implements what, in the past, we've called a revocable attachment using move. In normal situations the attachment is pinned, as a BAR does not change physical address. However when the VFIO device is closed, or a PCI reset is issued, access to the MMIO memory is revoked.
Revoked means that move occurs, but an attempt to immediately re-map the memory will fail. In the reset case a future move will be triggered when MMIO access returns. As both close and reset are under userspace control it is expected that userspace will suspend use of the dma-buf before doing these operations, the revoke is purely for kernel self-defense against a hostile userspace.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/dma_buf.c | 269 +++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_config.c | 8 +- drivers/vfio/pci/vfio_pci_core.c | 28 ++- drivers/vfio/pci/vfio_pci_priv.h | 23 +++ include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 18 ++ 7 files changed, 340 insertions(+), 8 deletions(-) create mode 100644 drivers/vfio/pci/dma_buf.c
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 24c524224da5a3..514c12a81127d6 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,6 +2,7 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c new file mode 100644 index 00000000000000..11396aeabff405 --- /dev/null +++ b/drivers/vfio/pci/dma_buf.c @@ -0,0 +1,269 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. + */ +#include <linux/dma-buf.h> +#include <linux/pci-p2pdma.h> +#include <linux/dma-resv.h> + +#include "vfio_pci_priv.h" + +MODULE_IMPORT_NS(DMA_BUF); + +struct vfio_pci_dma_buf { + struct dma_buf *dmabuf; + struct vfio_pci_core_device *vdev; + struct list_head dmabufs_elm; + unsigned int index; + unsigned int orig_nents; + size_t offset; + bool revoked; +}; + +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + int rc; + + rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1, + true); + if (rc < 0) + attachment->peer2peer = false; + return 0; +} + +static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment) +{ +} + +static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment) +{ + /* + * Uses the dynamic interface but must always allow for + * dma_buf_move_notify() to do revoke + */ + return -EINVAL; +} + +static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, + enum dma_data_direction dir) +{ + size_t sgl_size = dma_get_max_seg_size(attachment->dev); + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + struct scatterlist *sgl; + struct sg_table *sgt; + dma_addr_t dma_addr; + unsigned int nents; + size_t offset; + int ret; + + dma_resv_assert_held(priv->dmabuf->resv); + + if (!attachment->peer2peer) + return ERR_PTR(-EPERM); + + if (priv->revoked) + return ERR_PTR(-ENODEV); + + sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); + if (!sgt) + return ERR_PTR(-ENOMEM); + + nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size); + ret = sg_alloc_table(sgt, nents, GFP_KERNEL); + if (ret) + goto err_kfree_sgt; + + /* + * Since the memory being mapped is a device memory it could never be in + * CPU caches. + */ + dma_addr = dma_map_resource( + attachment->dev, + pci_resource_start(priv->vdev->pdev, priv->index) + + priv->offset, + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); + ret = dma_mapping_error(attachment->dev, dma_addr); + if (ret) + goto err_free_sgt; + + /* + * Break the BAR's physical range up into max sized SGL's according to + * the device's requirement. + */ + sgl = sgt->sgl; + for (offset = 0; offset != priv->dmabuf->size;) { + size_t chunk_size = min(priv->dmabuf->size - offset, sgl_size); + + sg_set_page(sgl, NULL, chunk_size, 0); + sg_dma_address(sgl) = dma_addr + offset; + sg_dma_len(sgl) = chunk_size; + sgl = sg_next(sgl); + offset += chunk_size; + } + + /* + * Because we are not going to include a CPU list we want to have some + * chance that other users will detect this by setting the orig_nents to + * 0 and using only nents (length of DMA list) when going over the sgl + */ + priv->orig_nents = sgt->orig_nents; + sgt->orig_nents = 0; + return sgt; + +err_free_sgt: + sg_free_table(sgt); +err_kfree_sgt: + kfree(sgt); + return ERR_PTR(ret); +} + +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + + sgt->orig_nents = priv->orig_nents; + dma_unmap_resource(attachment->dev, sg_dma_address(sgt->sgl), + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); + sg_free_table(sgt); + kfree(sgt); +} + +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + /* + * Either this or vfio_pci_dma_buf_cleanup() will remove from the list. + * The refcount prevents both. + */ + if (priv->vdev) { + down_write(&priv->vdev->memory_lock); + list_del_init(&priv->dmabufs_elm); + up_write(&priv->vdev->memory_lock); + vfio_device_put(&priv->vdev->vdev); + } + kfree(priv); +} + +static const struct dma_buf_ops vfio_pci_dmabuf_ops = { + .attach = vfio_pci_dma_buf_attach, + .map_dma_buf = vfio_pci_dma_buf_map, + .pin = vfio_pci_dma_buf_pin, + .unpin = vfio_pci_dma_buf_unpin, + .release = vfio_pci_dma_buf_release, + .unmap_dma_buf = vfio_pci_dma_buf_unmap, +}; + +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf get_dma_buf; + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + struct vfio_pci_dma_buf *priv; + int ret; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(get_dma_buf)); + if (ret != 1) + return ret; + + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf))) + return -EFAULT; + + /* For PCI the region_index is the BAR number like everything else */ + if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX) + return -EINVAL; + + exp_info.ops = &vfio_pci_dmabuf_ops; + exp_info.size = pci_resource_len(vdev->pdev, get_dma_buf.region_index); + if (!exp_info.size) + return -EINVAL; + if (get_dma_buf.offset || get_dma_buf.length) { + if (get_dma_buf.length > exp_info.size || + get_dma_buf.offset >= exp_info.size || + get_dma_buf.length > exp_info.size - get_dma_buf.offset || + get_dma_buf.offset % PAGE_SIZE || + get_dma_buf.length % PAGE_SIZE) + return -EINVAL; + exp_info.size = get_dma_buf.length; + } + exp_info.flags = get_dma_buf.open_flags; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + INIT_LIST_HEAD(&priv->dmabufs_elm); + priv->offset = get_dma_buf.offset; + priv->index = get_dma_buf.region_index; + + exp_info.priv = priv; + priv->dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(priv->dmabuf)) { + ret = PTR_ERR(priv->dmabuf); + kfree(priv); + return ret; + } + + /* dma_buf_put() now frees priv */ + + down_write(&vdev->memory_lock); + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = !__vfio_pci_memory_enabled(vdev); + priv->vdev = vdev; + vfio_device_get(&vdev->vdev); + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); + dma_resv_unlock(priv->dmabuf->resv); + up_write(&vdev->memory_lock); + + /* + * dma_buf_fd() consumes the reference, when the file closes the dmabuf + * will be released. + */ + return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); +} + +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + lockdep_assert_held_write(&vdev->memory_lock); + + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!dma_buf_try_get(priv->dmabuf)) + continue; + if (priv->revoked != revoked) { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = revoked; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + } + dma_buf_put(priv->dmabuf); + } +} + +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + down_write(&vdev->memory_lock); + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!dma_buf_try_get(priv->dmabuf)) + continue; + dma_resv_lock(priv->dmabuf->resv, NULL); + list_del_init(&priv->dmabufs_elm); + priv->vdev = NULL; + priv->revoked = true; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + vfio_device_put(&vdev->vdev); + dma_buf_put(priv->dmabuf); + } + up_write(&vdev->memory_lock); +} diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index d22921efa25987..f8a9c24d04aeb7 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -584,10 +584,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
- if (!new_mem) + if (!new_mem) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + }
/* * If the user is writing mem/io enable (new_mem/io) and we @@ -622,6 +624,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask);
+ if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 885706944e4d96..7e27bfc60440c2 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -504,6 +504,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) vfio_spapr_pci_eeh_release(vdev->pdev); vfio_pci_core_disable(vdev);
+ vfio_pci_dma_buf_cleanup(vdev); + mutex_lock(&vdev->igate); if (vdev->err_trigger) { eventfd_ctx_put(vdev->err_trigger); @@ -980,7 +982,10 @@ int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev) */ vfio_pci_set_power_state(vdev, PCI_D0);
+ vfio_pci_dma_buf_move(vdev, true); ret = pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock);
return ret; @@ -1210,11 +1215,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, - uuid_t __user *arg, size_t argsz) +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev, + u32 flags, uuid_t __user *arg, + size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); uuid_t uuid; int ret;
@@ -1241,9 +1245,14 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) { + struct vfio_pci_core_device *vdev = + container_of(device, struct vfio_pci_core_device, vdev); + switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: - return vfio_pci_core_feature_token(device, flags, arg, argsz); + return vfio_pci_core_feature_token(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF: + return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); default: return -ENOTTY; } @@ -1881,6 +1890,7 @@ void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev, INIT_LIST_HEAD(&vdev->vma_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); init_rwsem(&vdev->memory_lock); + INIT_LIST_HEAD(&vdev->dmabufs); } EXPORT_SYMBOL_GPL(vfio_pci_core_init_device);
@@ -2227,11 +2237,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * cause the PCI config space reset without restoring the original * state (saved locally in 'vdev->pm_save'). */ - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { + vfio_pci_dma_buf_move(cur, true); vfio_pci_set_power_state(cur, PCI_D0); + }
ret = pci_reset_bus(pdev);
+ list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) + if (__vfio_pci_memory_enabled(cur)) + vfio_pci_dma_buf_move(cur, false); + err_undo: list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { if (cur == cur_mem) diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 5b1cb9a9838442..c295a1166e7a9c 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -102,4 +102,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; }
+#ifdef CONFIG_DMA_SHARED_BUFFER +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz); +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); +#else +static int +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + return -ENOTTY; +} +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ +} +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, + bool revoked) +{ +} +#endif + #endif diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index e5cf0d3313a694..fb9769d9e5d149 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -93,6 +93,7 @@ struct vfio_pci_core_device { struct mutex vma_lock; struct list_head vma_list; struct rw_semaphore memory_lock; + struct list_head dmabufs; };
/* Will be exported for vfio pci drivers usage */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 733a1cddde30a5..1dcfad6dcb6863 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -986,6 +986,24 @@ enum vfio_device_mig_state { VFIO_DEVICE_STATE_RUNNING_P2P = 5, };
+/** + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the + * region selected. + * + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, + * etc. offset/length specify a slice of the region to create the dmabuf from. + * If both are 0 then the whole region is used. + * + * Return: The fd number on success, -1 and errno is set on failure. + */ +struct vfio_device_feature_dma_buf { + __u32 region_index; + __u32 open_flags; + __u32 offset; + __u64 length; +}; +#define VFIO_DEVICE_FEATURE_DMA_BUF 3 + /* -------- API for Type1 VFIO IOMMU -------- */
/**
+{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- int rc;
- rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
true);
This should just use pci_p2pdma_distance.
- /*
* Since the memory being mapped is a device memory it could never be in
* CPU caches.
*/
DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure where this wisdom comes from.
- dma_addr = dma_map_resource(
attachment->dev,
pci_resource_start(priv->vdev->pdev, priv->index) +
priv->offset,
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
This is not how P2P addresses are mapped. You need to use dma_map_sgtable and have the proper pgmap for it.
The above is just a badly implemented version of the dma-direct PCI_P2PDMA_MAP_BUS_ADDR case, ignoring mappings through the host bridge or dma-map-ops interactions.
Am 06.09.22 um 11:51 schrieb Christoph Hellwig:
+{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- int rc;
- rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
true);
This should just use pci_p2pdma_distance.
- /*
* Since the memory being mapped is a device memory it could never be in
* CPU caches.
*/
DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure where this wisdom comes from.
- dma_addr = dma_map_resource(
attachment->dev,
pci_resource_start(priv->vdev->pdev, priv->index) +
priv->offset,
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
This is not how P2P addresses are mapped. You need to use dma_map_sgtable and have the proper pgmap for it.
The problem is once more that this is MMIO space, in other words register BARs which needs to be exported/imported.
Adding struct pages for it generally sounds like the wrong approach here. You can't even access this with the CPU or would trigger potentially unwanted hardware actions.
Would you mind if I start to tackle this problem?
Regards, Christian.
The above is just a badly implemented version of the dma-direct PCI_P2PDMA_MAP_BUS_ADDR case, ignoring mappings through the host bridge or dma-map-ops interactions.
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
Am 06.09.22 um 11:51 schrieb Christoph Hellwig:
+{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- int rc;
- rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
true);
This should just use pci_p2pdma_distance.
OK
- /*
* Since the memory being mapped is a device memory it could never be in
* CPU caches.
*/
DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure where this wisdom comes from.
Habana driver
- dma_addr = dma_map_resource(
attachment->dev,
pci_resource_start(priv->vdev->pdev, priv->index) +
priv->offset,
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
This is not how P2P addresses are mapped. You need to use dma_map_sgtable and have the proper pgmap for it.
The problem is once more that this is MMIO space, in other words register BARs which needs to be exported/imported.
Adding struct pages for it generally sounds like the wrong approach here. You can't even access this with the CPU or would trigger potentially unwanted hardware actions.
Right, this whole thing is the "standard" that dmabuf has adopted instead of the struct pages. Once the AMD GPU driver started doing this some time ago other drivers followed.
Now we have struct pages, almost, but I'm not sure if their limits are compatible with VFIO? This has to work for small bars as well.
Jason
On Tue, Sep 6, 2022 at 2:48 PM Jason Gunthorpe jgg@nvidia.com wrote:
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
Am 06.09.22 um 11:51 schrieb Christoph Hellwig:
+{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- int rc;
- rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
true);
This should just use pci_p2pdma_distance.
OK
- /*
- Since the memory being mapped is a device memory it could never be in
- CPU caches.
- */
DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure where this wisdom comes from.
Habana driver
I hate to throw the ball at someone else, but I actually copied the code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc. And if you remember Jason, you asked why we use this specific define in the original review you did and I replied the following (to which you agreed and that's why we added the comment):
"The memory behind this specific dma-buf has *always* resided on the device itself, i.e. it lives only in the 'device' domain (after all, it maps a PCI bar address which points to the device memory). Therefore, it was never in the 'CPU' domain and hence, there is no need to perform a sync of the memory to the CPU's cache, as it was never inside that cache to begin with.
This is not the same case as with regular memory which is dma-mapped and then copied into the device using a dma engine. In that case, the memory started in the 'CPU' domain and moved to the 'device' domain. When it is unmapped it will indeed be recycled to be used for another purpose and therefore we need to sync the CPU cache."
Oded
- dma_addr = dma_map_resource(
attachment->dev,
pci_resource_start(priv->vdev->pdev, priv->index) +
priv->offset,
priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
This is not how P2P addresses are mapped. You need to use dma_map_sgtable and have the proper pgmap for it.
The problem is once more that this is MMIO space, in other words register BARs which needs to be exported/imported.
Adding struct pages for it generally sounds like the wrong approach here. You can't even access this with the CPU or would trigger potentially unwanted hardware actions.
Right, this whole thing is the "standard" that dmabuf has adopted instead of the struct pages. Once the AMD GPU driver started doing this some time ago other drivers followed.
Now we have struct pages, almost, but I'm not sure if their limits are compatible with VFIO? This has to work for small bars as well.
Jason
On Tue, Sep 06, 2022 at 03:34:02PM +0300, Oded Gabbay wrote:
- /*
- Since the memory being mapped is a device memory it could never be in
- CPU caches.
- */
DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure where this wisdom comes from.
Habana driver
I hate to throw the ball at someone else, but I actually copied the code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc. And if you remember Jason, you asked why we use this specific define in the original review you did and I replied the following (to which you agreed and that's why we added the comment):
Yes, I remember, but Christophs remark is that DMA_ATTR_SKIP_CPU_SYNC doesn't even do anything when passed to dma_map_resource().
The only attr that seems to be used is DMA_ATTR_PRIVILEGED from what I can see.
Jason
On Tue, Sep 6, 2022 at 8:59 PM Jason Gunthorpe jgg@nvidia.com wrote:
On Tue, Sep 06, 2022 at 03:34:02PM +0300, Oded Gabbay wrote:
- /*
- Since the memory being mapped is a device memory it could never be in
- CPU caches.
- */
DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure where this wisdom comes from.
Habana driver
I hate to throw the ball at someone else, but I actually copied the code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc. And if you remember Jason, you asked why we use this specific define in the original review you did and I replied the following (to which you agreed and that's why we added the comment):
Yes, I remember, but Christophs remark is that DMA_ATTR_SKIP_CPU_SYNC doesn't even do anything when passed to dma_map_resource().
The only attr that seems to be used is DMA_ATTR_PRIVILEGED from what I can see.
Jason
Yes, it appears he is correct. Probably what happened is that either this was originally copied from a use of dma_map_page or something similar that does check this attribute, or maybe dma_map_resource used it in the past and the underlying code has changed.
Regardless, it seems we can remove it from the calls to dma_map_resource. I went over the kernel code and it seems only habana and amdgpu (and amdkfd) are passing this property to dma_map_resource. All other callers just pass 0.
Oded
On Tue, Sep 06, 2022 at 10:44:45PM +0300, Oded Gabbay wrote:
Regardless, it seems we can remove it from the calls to dma_map_resource. I went over the kernel code and it seems only habana and amdgpu (and amdkfd) are passing this property to dma_map_resource. All other callers just pass 0.
What really need to happen is a revert of that entire commit, given that dma_map_resource must not be used for PCIe P2P transactions as explained in my previous mail.
On Tue, Sep 06, 2022 at 08:48:28AM -0300, Jason Gunthorpe wrote:
Right, this whole thing is the "standard" that dmabuf has adopted instead of the struct pages. Once the AMD GPU driver started doing this some time ago other drivers followed.
But it is simple wrong. The scatterlist requires struct page backing. In theory a physical address would be enough, but when Dan Williams sent patches for that Linus shot them down.
That being said the scatterlist is the wrong interface here (and probably for most of it's uses). We really want a lot-level struct with just the dma_address and length for the DMA side, and leave it separate from that what is used to generate it (in most cases that would be a bio_vec).
Now we have struct pages, almost, but I'm not sure if their limits are compatible with VFIO? This has to work for small bars as well.
Why would small BARs be problematic for the pages? The pages are more a problem for gigantic BARs do the memory overhead.
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
The problem is once more that this is MMIO space, in other words register BARs which needs to be exported/imported.
Everything used for P2P is bar space.
Adding struct pages for it generally sounds like the wrong approach here. You can't even access this with the CPU or would trigger potentially unwanted hardware actions.
How would an access from the CPU vs anther device make any difference?
Would you mind if I start to tackle this problem?
Yes, I've been waiting forever for someone to tacke how the scatterlist is abused in dma-buf..
Am 07.09.22 um 14:03 schrieb Christoph Hellwig:
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
The problem is once more that this is MMIO space, in other words register BARs which needs to be exported/imported.
Everything used for P2P is bar space.
Adding struct pages for it generally sounds like the wrong approach here. You can't even access this with the CPU or would trigger potentially unwanted hardware actions.
How would an access from the CPU vs anther device make any difference?
The key point is that you can't do any CPU fallback with this as long as the CPU wouldn't do exactly the same thing as the original hardware device. E.g. not write combine nor do any fully page copies etc...
See what happens here is not really P2P DMA transfer, but rather P2P signaling of events.
For a simple example think of a camera and a video codec. The camera is pumping video data into system memory the video codec should encode into an H264 stream.
So after every frame the camera hardware issues a P2P write into the BAR of the video codec to signal that the frame is completed and it can start decoding.
This is not even a memory write, but rather just some trigger event. That's essentially the background why I think having struct pages and sg_tables doesn't make any sense at all for this use case.
Would you mind if I start to tackle this problem?
Yes, I've been waiting forever for someone to tacke how the scatterlist is abused in dma-buf..
How about we separate the scatterlist into page and DMA address container?
Regards, Christian.
On Wed, Sep 07, 2022 at 05:08:37PM +0200, Christian König wrote:
The key point is that you can't do any CPU fallback with this as long as the CPU wouldn't do exactly the same thing as the original hardware device. E.g. not write combine nor do any fully page copies etc...
That is true for MMIO in general. You need to use the mmio accessors, and even then it needs to match what the hardware expects
This is not even a memory write, but rather just some trigger event. That's essentially the background why I think having struct pages and sg_tables doesn't make any sense at all for this use case.
Well, Jasons patch uses a sg_table, just in a very broken way.
Would you mind if I start to tackle this problem?
Yes, I've been waiting forever for someone to tacke how the scatterlist is abused in dma-buf..
How about we separate the scatterlist into page and DMA address container?
As said before that is fundamentally the right thing to do. It just takes a lot of work. I think on the dma mapping side we're finally getting ready for it now that arm uses the common dma-direct code, although getting rid of the arm iommu ops first would be even better.
Note that this only solves the problem of needing the pages to hold the physical address. The other thing the pages are used for in p2p is that (through the pgmap) it points to the owning device of the p2p memory and thus allows figuring out how it needs to be accessed.
linaro-mm-sig@lists.linaro.org