This adds the pasid attach/detach uAPIs for userspace to attach/detach a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is enabled in this series. After this series, PASID-capable devices bound with vfio-pci can report PASID capability to userspace and VM to enable PASID usages like Shared Virtual Addressing (SVA).
Based on the discussion about reporting the vPASID to VM [1], it's agreed that we will let the userspace VMM to synthesize the vPASID capability. The VMM needs to figure out a hole to put the vPASID cap. This includes the hidden bits handling for some devices. While, it's up to the userspace, it's not the focus of this series.
This series first adds the helpers for pasid attach in vfio core and then adds the device cdev ioctls for pasid attach/detach. In the end of this series, the IOMMU_GET_HW_INFO ioctl is extended to report the PCI PASID capability to the userspace. Userspace should check this before using any PASID related uAPIs provided by VFIO, which is the agreement in [2]. This series depends on the iommufd pasid attach/detach series [3].
The completed code can be found at [4], tested with a hacky Qemu branch [5].
[1] https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB... [2] https://lore.kernel.org/kvm/4f2daf50-a5ad-4599-ab59-bcfc008688d8@intel.com/ [3] https://lore.kernel.org/linux-iommu/20240912131255.13305-1-yi.l.liu@intel.co... [4] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid [5] https://github.com/yiliu1765/qemu/tree/wip/zhenzhong/iommufd_nesting_rfcv2-t...
Change log:
v3: - Misc enhancement on patch 01 of v2 (Alex, Jason) - Add Jason's r-b to patch 03 of v2 - Drop the logic that report PASID via VFIO_DEVICE_FEATURE ioctl - Extend IOMMU_GET_HW_INFO to report PASID support (Kevin, Jason, Alex)
v2: https://lore.kernel.org/kvm/20240412082121.33382-1-yi.l.liu@intel.com/ - Use IDA to track if PASID is attached or not in VFIO. (Jason) - Fix the issue of calling pasid_at[de]tach_ioas callback unconditionally (Alex) - Fix the wrong data copy in vfio_df_ioctl_pasid_detach_pt() (Zhenzhong) - Minor tweaks in comments (Kevin)
v1: https://lore.kernel.org/kvm/20231127063909.129153-1-yi.l.liu@intel.com/ - Report PASID capability via VFIO_DEVICE_FEATURE (Alex)
rfc: https://lore.kernel.org/linux-iommu/20230926093121.18676-1-yi.l.liu@intel.co...
Regards, Yi Liu
Yi Liu (4): ida: Add ida_find_first_range() vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
drivers/iommu/iommufd/device.c | 27 +++++++++++++- drivers/pci/ats.c | 32 ++++++++++++++++ drivers/vfio/device_cdev.c | 51 ++++++++++++++++++++++++++ drivers/vfio/iommufd.c | 50 +++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci.c | 2 + drivers/vfio/vfio.h | 4 ++ drivers/vfio/vfio_main.c | 8 ++++ include/linux/idr.h | 11 ++++++ include/linux/pci-ats.h | 3 ++ include/linux/vfio.h | 11 ++++++ include/uapi/linux/iommufd.h | 14 ++++++- include/uapi/linux/vfio.h | 55 ++++++++++++++++++++++++++++ lib/idr.c | 67 ++++++++++++++++++++++++++++++++++ 13 files changed, 333 insertions(+), 2 deletions(-)
There is no helpers for user to check if a given ID is allocated or not, neither a helper to loop all the allocated IDs in an IDA and do something for cleanup. With the two needs, a helper to get the lowest allocated ID of a range and two variants based on it.
Caller can check if a given ID is allocated or not by:
bool ida_exists(struct ida *ida, unsigned int id)
Caller can iterate all allocated IDs by:
int id; while ((id = ida_find_first(&pasid_ida)) > 0) { //anything to do with the allocated ID ida_free(pasid_ida, pasid); }
Cc: Matthew Wilcox (Oracle) willy@infradead.org Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- include/linux/idr.h | 11 ++++++++ lib/idr.c | 67 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 78 insertions(+)
diff --git a/include/linux/idr.h b/include/linux/idr.h index da5f5fa4a3a6..718f9b1b91af 100644 --- a/include/linux/idr.h +++ b/include/linux/idr.h @@ -257,6 +257,7 @@ struct ida { int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t); void ida_free(struct ida *, unsigned int id); void ida_destroy(struct ida *ida); +int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);
/** * ida_alloc() - Allocate an unused ID. @@ -328,4 +329,14 @@ static inline bool ida_is_empty(const struct ida *ida) { return xa_empty(&ida->xa); } + +static inline bool ida_exists(struct ida *ida, unsigned int id) +{ + return ida_find_first_range(ida, id, id) == id; +} + +static inline int ida_find_first(struct ida *ida) +{ + return ida_find_first_range(ida, 0, ~0); +} #endif /* __IDR_H__ */ diff --git a/lib/idr.c b/lib/idr.c index da36054c3ca0..6644d3d1af02 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max, } EXPORT_SYMBOL(ida_alloc_range);
+/** + * ida_find_first_range - Get the lowest used ID. + * @ida: IDA handle. + * @min: Lowest ID to get. + * @max: Highest ID to get. + * + * Get the lowest used ID between @min and @max, inclusive. The returned + * ID will not exceed %INT_MAX, even if @max is larger. + * + * Context: Any context. Takes and releases the xa_lock. + * Return: The lowest used ID, or errno if no used ID is found. + */ +int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max) +{ + unsigned long index = min / IDA_BITMAP_BITS; + unsigned int offset = min % IDA_BITMAP_BITS; + unsigned long *addr, size, bit; + unsigned long flags; + void *entry; + int ret; + + if ((int)min < 0) + return -EINVAL; + if ((int)max < 0) + max = INT_MAX; + + xa_lock_irqsave(&ida->xa, flags); + + entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT); + if (!entry) { + ret = -ENOENT; + goto err_unlock; + } + + if (index > min / IDA_BITMAP_BITS) + offset = 0; + if (index * IDA_BITMAP_BITS + offset > max) { + ret = -ENOENT; + goto err_unlock; + } + + if (xa_is_value(entry)) { + unsigned long tmp = xa_to_value(entry); + + addr = &tmp; + size = BITS_PER_XA_VALUE; + } else { + addr = ((struct ida_bitmap *)entry)->bitmap; + size = IDA_BITMAP_BITS; + } + + bit = find_next_bit(addr, size, offset); + + xa_unlock_irqrestore(&ida->xa, flags); + + if (bit == size || + index * IDA_BITMAP_BITS + bit > max) + return -ENOENT; + + return index * IDA_BITMAP_BITS + bit; + +err_unlock: + xa_unlock_irqrestore(&ida->xa, flags); + return ret; +} +EXPORT_SYMBOL(ida_find_first_range); + /** * ida_free() - Release an allocated ID. * @ida: IDA handle.
On Thu, Sep 12, 2024 at 06:17:26AM -0700, Yi Liu wrote:
There is no helpers for user to check if a given ID is allocated or not, neither a helper to loop all the allocated IDs in an IDA and do something for cleanup. With the two needs, a helper to get the lowest allocated ID of a range and two variants based on it.
Caller can check if a given ID is allocated or not by:
bool ida_exists(struct ida *ida, unsigned int id)
Caller can iterate all allocated IDs by:
int id; while ((id = ida_find_first(&pasid_ida)) > 0) { //anything to do with the allocated ID ida_free(pasid_ida, pasid); }
Cc: Matthew Wilcox (Oracle) willy@infradead.org Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
include/linux/idr.h | 11 ++++++++ lib/idr.c | 67 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 78 insertions(+)
No test cases for the test suite? ;-(
On 2024/9/12 23:11, Matthew Wilcox wrote:
On Thu, Sep 12, 2024 at 06:17:26AM -0700, Yi Liu wrote:
There is no helpers for user to check if a given ID is allocated or not, neither a helper to loop all the allocated IDs in an IDA and do something for cleanup. With the two needs, a helper to get the lowest allocated ID of a range and two variants based on it.
Caller can check if a given ID is allocated or not by:
bool ida_exists(struct ida *ida, unsigned int id)
Caller can iterate all allocated IDs by:
int id; while ((id = ida_find_first(&pasid_ida)) > 0) { //anything to do with the allocated ID ida_free(pasid_ida, pasid); }
Cc: Matthew Wilcox (Oracle) willy@infradead.org Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
include/linux/idr.h | 11 ++++++++ lib/idr.c | 67 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 78 insertions(+)
No test cases for the test suite? ;-(
let me add something like the below. :)
diff --git a/lib/test_ida.c b/lib/test_ida.c index c80155a1956d..d89554ff5719 100644 --- a/lib/test_ida.c +++ b/lib/test_ida.c @@ -189,6 +189,75 @@ static void ida_check_bad_free(struct ida *ida) IDA_BUG_ON(ida, !ida_is_empty(ida)); }
+/* + * Check ida_find_first_range() and varriants. + */ +static void ida_check_find_first(struct ida *ida) +{ + /* IDA is empty; all of the below should be not exist */ + IDA_BUG_ON(ida, ida_exists(ida, 0)); + IDA_BUG_ON(ida, ida_exists(ida, 3)); + IDA_BUG_ON(ida, ida_exists(ida, 63)); + IDA_BUG_ON(ida, ida_exists(ida, 1023)); + IDA_BUG_ON(ida, ida_exists(ida, (1 << 20) - 1)); + /* IDA contains a single value entry */ + + IDA_BUG_ON(ida, ida_alloc_min(ida, 3, GFP_KERNEL) != 3); + IDA_BUG_ON(ida, ida_exists(ida, 0)); + IDA_BUG_ON(ida, !ida_exists(ida, 3)); + IDA_BUG_ON(ida, ida_exists(ida, 63)); + IDA_BUG_ON(ida, ida_exists(ida, 1023)); + IDA_BUG_ON(ida, ida_exists(ida, (1 << 20) - 1)); + + IDA_BUG_ON(ida, ida_alloc_min(ida, 63, GFP_KERNEL) != 63); + IDA_BUG_ON(ida, ida_exists(ida, 0)); + IDA_BUG_ON(ida, !ida_exists(ida, 3)); + IDA_BUG_ON(ida, !ida_exists(ida, 63)); + IDA_BUG_ON(ida, ida_exists(ida, 1023)); + IDA_BUG_ON(ida, ida_exists(ida, (1 << 20) - 1)); + + /* IDA contains a single bitmap */ + IDA_BUG_ON(ida, ida_alloc_min(ida, 1023, GFP_KERNEL) != 1023); + IDA_BUG_ON(ida, ida_exists(ida, 0)); + IDA_BUG_ON(ida, !ida_exists(ida, 3)); + IDA_BUG_ON(ida, !ida_exists(ida, 63)); + IDA_BUG_ON(ida, !ida_exists(ida, 1023)); + IDA_BUG_ON(ida, ida_exists(ida, (1 << 20) - 1)); + + /* IDA contains a tree */ + IDA_BUG_ON(ida, ida_alloc_min(ida, (1 << 20) - 1, GFP_KERNEL) != (1 << 20) - 1); + IDA_BUG_ON(ida, ida_exists(ida, 0)); + IDA_BUG_ON(ida, !ida_exists(ida, 3)); + IDA_BUG_ON(ida, !ida_exists(ida, 63)); + IDA_BUG_ON(ida, !ida_exists(ida, 1023)); + IDA_BUG_ON(ida, !ida_exists(ida, (1 << 20) - 1)); + + /* Now try to find first */ + IDA_BUG_ON(ida, ida_find_first(ida) != 3); + IDA_BUG_ON(ida, ida_find_first_range(ida, -1, 2) != -EINVAL); + IDA_BUG_ON(ida, ida_find_first_range(ida, 0, 2) != -ENOENT); // no used ID + IDA_BUG_ON(ida, ida_find_first_range(ida, 0, 3) != 3); + IDA_BUG_ON(ida, ida_find_first_range(ida, 1, 3) != 3); + IDA_BUG_ON(ida, ida_find_first_range(ida, 3, 3) != 3); + IDA_BUG_ON(ida, ida_find_first_range(ida, 2, 4) != 3); + IDA_BUG_ON(ida, ida_find_first_range(ida, 4, 3) != -ENOENT); // min > max, fail + IDA_BUG_ON(ida, ida_find_first_range(ida, 4, 60) != -ENOENT); // no used ID + IDA_BUG_ON(ida, ida_find_first_range(ida, 4, 64) != 63); + IDA_BUG_ON(ida, ida_find_first_range(ida, 63, 63) != 63); + IDA_BUG_ON(ida, ida_find_first_range(ida, 64, 1026) != 1023); + IDA_BUG_ON(ida, ida_find_first_range(ida, 1023, 1023) != 1023); + IDA_BUG_ON(ida, ida_find_first_range(ida, 1023, (1 << 20) - 1) != 1023); + IDA_BUG_ON(ida, ida_find_first_range(ida, 1024, (1 << 20) - 1) != (1 << 20) - 1); + IDA_BUG_ON(ida, ida_find_first_range(ida, (1 << 20), INT_MAX) != -ENOENT); + + ida_free(ida, 3); + ida_free(ida, 63); + ida_free(ida, 1023); + ida_free(ida, (1 << 20) - 1); + + IDA_BUG_ON(ida, !ida_is_empty(ida)); +} + static DEFINE_IDA(ida);
static int ida_checks(void) @@ -202,6 +270,7 @@ static int ida_checks(void) ida_check_max(&ida); ida_check_conv(&ida); ida_check_bad_free(&ida); + ida_check_find_first(&ida);
printk("IDA: %u of %u tests passed\n", tests_passed, tests_run); return (tests_run != tests_passed) ? 0 : -EINVAL;
On Fri, Sep 13, 2024 at 07:45:55PM +0800, Yi Liu wrote:
No test cases for the test suite? ;-(
let me add something like the below. :)
That looks pretty comprehensive, thanks!
Acked-by: Matthew Wilcox (Oracle) willy@infradead.org
On 2024/9/13 23:09, Matthew Wilcox wrote:
On Fri, Sep 13, 2024 at 07:45:55PM +0800, Yi Liu wrote:
No test cases for the test suite? ;-(
let me add something like the below. :)
That looks pretty comprehensive, thanks!
Acked-by: Matthew Wilcox (Oracle) willy@infradead.org
thanks, and FYI. I found a bug when running the unit test. will fix it in the next version. it's really helpful suggestion.
diff --git a/lib/idr.c b/lib/idr.c index 6644d3d1af02..f16eb3d172bc 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -494,6 +494,7 @@ int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max) unsigned int offset = min % IDA_BITMAP_BITS; unsigned long *addr, size, bit; unsigned long flags; + unsigned long tmp = 0; void *entry; int ret;
@@ -518,8 +519,7 @@ int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max) }
if (xa_is_value(entry)) { - unsigned long tmp = xa_to_value(entry); - + tmp = xa_to_value(entry); addr = &tmp; size = BITS_PER_XA_VALUE; } else {
On Thu, Sep 12, 2024 at 06:17:26AM -0700, Yi Liu wrote:
There is no helpers for user to check if a given ID is allocated or not, neither a helper to loop all the allocated IDs in an IDA and do something for cleanup. With the two needs, a helper to get the lowest allocated ID of a range and two variants based on it.
Caller can check if a given ID is allocated or not by:
bool ida_exists(struct ida *ida, unsigned int id)
Caller can iterate all allocated IDs by:
int id; while ((id = ida_find_first(&pasid_ida)) > 0) { //anything to do with the allocated ID ida_free(pasid_ida, pasid); }
Cc: Matthew Wilcox (Oracle) willy@infradead.org Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
include/linux/idr.h | 11 ++++++++ lib/idr.c | 67 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 78 insertions(+)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, September 12, 2024 9:17 PM
There is no helpers for user to check if a given ID is allocated or not, neither a helper to loop all the allocated IDs in an IDA and do something for cleanup. With the two needs, a helper to get the lowest allocated ID of a range and two variants based on it.
Caller can check if a given ID is allocated or not by:
bool ida_exists(struct ida *ida, unsigned int id)
Caller can iterate all allocated IDs by:
int id; while ((id = ida_find_first(&pasid_ida)) > 0) { //anything to do with the allocated ID ida_free(pasid_ida, pasid); }
Cc: Matthew Wilcox (Oracle) willy@infradead.org Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a device and the helpers for it. For now, only vfio-pci supports pasid attach/detach.
Signed-off-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/vfio/iommufd.c | 50 +++++++++++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci.c | 2 ++ include/linux/vfio.h | 11 ++++++++ 3 files changed, 63 insertions(+)
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c index 82eba6966fa5..2f5cb4f616ce 100644 --- a/drivers/vfio/iommufd.c +++ b/drivers/vfio/iommufd.c @@ -119,14 +119,22 @@ int vfio_iommufd_physical_bind(struct vfio_device *vdev, if (IS_ERR(idev)) return PTR_ERR(idev); vdev->iommufd_device = idev; + ida_init(&vdev->pasids); return 0; } EXPORT_SYMBOL_GPL(vfio_iommufd_physical_bind);
void vfio_iommufd_physical_unbind(struct vfio_device *vdev) { + int pasid; + lockdep_assert_held(&vdev->dev_set->lock);
+ while ((pasid = ida_find_first(&vdev->pasids)) >= 0) { + iommufd_device_pasid_detach(vdev->iommufd_device, pasid); + ida_free(&vdev->pasids, pasid); + } + if (vdev->iommufd_attached) { iommufd_device_detach(vdev->iommufd_device); vdev->iommufd_attached = false; @@ -168,6 +176,48 @@ void vfio_iommufd_physical_detach_ioas(struct vfio_device *vdev) } EXPORT_SYMBOL_GPL(vfio_iommufd_physical_detach_ioas);
+int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev, + u32 pasid, u32 *pt_id) +{ + int rc; + + lockdep_assert_held(&vdev->dev_set->lock); + + if (WARN_ON(!vdev->iommufd_device)) + return -EINVAL; + + if (ida_exists(&vdev->pasids, pasid)) + return iommufd_device_pasid_replace(vdev->iommufd_device, + pasid, pt_id); + + rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL); + if (rc < 0) + return rc; + + rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid, pt_id); + if (rc) + ida_free(&vdev->pasids, pasid); + + return 0; +} +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_pasid_attach_ioas); + +void vfio_iommufd_physical_pasid_detach_ioas(struct vfio_device *vdev, + u32 pasid) +{ + lockdep_assert_held(&vdev->dev_set->lock); + + if (WARN_ON(!vdev->iommufd_device)) + return; + + if (!ida_exists(&vdev->pasids, pasid)) + return; + + iommufd_device_pasid_detach(vdev->iommufd_device, pasid); + ida_free(&vdev->pasids, pasid); +} +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_pasid_detach_ioas); + /* * The emulated standard ops mean that vfio_device is going to use the * "mdev path" and will call vfio_pin_pages()/vfio_dma_rw(). Drivers using this diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index e727941f589d..6f7ae7e5b7b0 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -144,6 +144,8 @@ static const struct vfio_device_ops vfio_pci_ops = { .unbind_iommufd = vfio_iommufd_physical_unbind, .attach_ioas = vfio_iommufd_physical_attach_ioas, .detach_ioas = vfio_iommufd_physical_detach_ioas, + .pasid_attach_ioas = vfio_iommufd_physical_pasid_attach_ioas, + .pasid_detach_ioas = vfio_iommufd_physical_pasid_detach_ioas, };
static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 000a6cab2d31..11b3b453752e 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -67,6 +67,7 @@ struct vfio_device { struct inode *inode; #if IS_ENABLED(CONFIG_IOMMUFD) struct iommufd_device *iommufd_device; + struct ida pasids; u8 iommufd_attached:1; #endif u8 cdev_opened:1; @@ -91,6 +92,8 @@ struct vfio_device { * bound iommufd. Undo in unbind_iommufd if @detach_ioas is not * called. * @detach_ioas: Opposite of attach_ioas + * @pasid_attach_ioas: The pasid variation of attach_ioas + * @pasid_detach_ioas: Opposite of pasid_attach_ioas * @open_device: Called when the first file descriptor is opened for this device * @close_device: Opposite of open_device * @read: Perform read(2) on device file descriptor @@ -115,6 +118,8 @@ struct vfio_device_ops { void (*unbind_iommufd)(struct vfio_device *vdev); int (*attach_ioas)(struct vfio_device *vdev, u32 *pt_id); void (*detach_ioas)(struct vfio_device *vdev); + int (*pasid_attach_ioas)(struct vfio_device *vdev, u32 pasid, u32 *pt_id); + void (*pasid_detach_ioas)(struct vfio_device *vdev, u32 pasid); int (*open_device)(struct vfio_device *vdev); void (*close_device)(struct vfio_device *vdev); ssize_t (*read)(struct vfio_device *vdev, char __user *buf, @@ -139,6 +144,8 @@ int vfio_iommufd_physical_bind(struct vfio_device *vdev, void vfio_iommufd_physical_unbind(struct vfio_device *vdev); int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id); void vfio_iommufd_physical_detach_ioas(struct vfio_device *vdev); +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev, u32 pasid, u32 *pt_id); +void vfio_iommufd_physical_pasid_detach_ioas(struct vfio_device *vdev, u32 pasid); int vfio_iommufd_emulated_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx, u32 *out_device_id); void vfio_iommufd_emulated_unbind(struct vfio_device *vdev); @@ -166,6 +173,10 @@ vfio_iommufd_get_dev_id(struct vfio_device *vdev, struct iommufd_ctx *ictx) ((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL) #define vfio_iommufd_physical_detach_ioas \ ((void (*)(struct vfio_device *vdev)) NULL) +#define vfio_iommufd_physical_pasid_attach_ioas \ + ((int (*)(struct vfio_device *vdev, u32 pasid, u32 *pt_id)) NULL) +#define vfio_iommufd_physical_pasid_detach_ioas \ + ((void (*)(struct vfio_device *vdev, u32 pasid)) NULL) #define vfio_iommufd_emulated_bind \ ((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx, \ u32 *out_device_id)) NULL)
On Thu, Sep 12, 2024 at 06:17:27AM -0700, Yi Liu wrote:
This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a device and the helpers for it. For now, only vfio-pci supports pasid attach/detach.
Signed-off-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/vfio/iommufd.c | 50 +++++++++++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci.c | 2 ++ include/linux/vfio.h | 11 ++++++++ 3 files changed, 63 insertions(+)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
This adds ioctls for the userspace to attach/detach a given pasid of a vfio device to/from an IOAS/HWPT.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/vfio/device_cdev.c | 51 +++++++++++++++++++++++++++++++++++ drivers/vfio/vfio.h | 4 +++ drivers/vfio/vfio_main.c | 8 ++++++ include/uapi/linux/vfio.h | 55 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 118 insertions(+)
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index bb1817bd4ff3..275918275fb0 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -217,6 +217,57 @@ int vfio_df_ioctl_detach_pt(struct vfio_device_file *df, return 0; }
+int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df, + struct vfio_device_pasid_attach_iommufd_pt __user *arg) +{ + struct vfio_device_pasid_attach_iommufd_pt attach; + struct vfio_device *device = df->device; + unsigned long minsz; + int ret; + + minsz = offsetofend(struct vfio_device_pasid_attach_iommufd_pt, pt_id); + + if (copy_from_user(&attach, arg, minsz)) + return -EFAULT; + + if (attach.argsz < minsz || attach.flags) + return -EINVAL; + + if (!device->ops->pasid_attach_ioas) + return -EOPNOTSUPP; + + mutex_lock(&device->dev_set->lock); + ret = device->ops->pasid_attach_ioas(device, attach.pasid, &attach.pt_id); + mutex_unlock(&device->dev_set->lock); + + return ret; +} + +int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df, + struct vfio_device_pasid_detach_iommufd_pt __user *arg) +{ + struct vfio_device_pasid_detach_iommufd_pt detach; + struct vfio_device *device = df->device; + unsigned long minsz; + + minsz = offsetofend(struct vfio_device_pasid_detach_iommufd_pt, pasid); + + if (copy_from_user(&detach, arg, minsz)) + return -EFAULT; + + if (detach.argsz < minsz || detach.flags) + return -EINVAL; + + if (!device->ops->pasid_detach_ioas) + return -EOPNOTSUPP; + + mutex_lock(&device->dev_set->lock); + device->ops->pasid_detach_ioas(device, detach.pasid); + mutex_unlock(&device->dev_set->lock); + + return 0; +} + static char *vfio_device_devnode(const struct device *dev, umode_t *mode) { return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev)); diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h index 50128da18bca..20d3cb283ba0 100644 --- a/drivers/vfio/vfio.h +++ b/drivers/vfio/vfio.h @@ -353,6 +353,10 @@ int vfio_df_ioctl_attach_pt(struct vfio_device_file *df, struct vfio_device_attach_iommufd_pt __user *arg); int vfio_df_ioctl_detach_pt(struct vfio_device_file *df, struct vfio_device_detach_iommufd_pt __user *arg); +int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df, + struct vfio_device_pasid_attach_iommufd_pt __user *arg); +int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df, + struct vfio_device_pasid_detach_iommufd_pt __user *arg);
#if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV) void vfio_init_device_cdev(struct vfio_device *device); diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index a5a62d9d963f..577cba9d3a01 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -1286,6 +1286,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep, case VFIO_DEVICE_DETACH_IOMMUFD_PT: ret = vfio_df_ioctl_detach_pt(df, uptr); goto out; + + case VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT: + ret = vfio_df_ioctl_pasid_attach_pt(df, uptr); + goto out; + + case VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT: + ret = vfio_df_ioctl_pasid_detach_pt(df, uptr); + goto out; } }
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 2b68e6cdf190..889702ae12bc 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -977,6 +977,61 @@ struct vfio_device_detach_iommufd_pt {
#define VFIO_DEVICE_DETACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 20)
+/* + * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 21, + * struct vfio_device_pasid_attach_iommufd_pt) + * @argsz: User filled size of this data. + * @flags: Must be 0. + * @pasid: The pasid to be attached. + * @pt_id: Input the target id which can represent an ioas or a hwpt + * allocated via iommufd subsystem. + * Output the input ioas id or the attached hwpt id which could + * be the specified hwpt itself or a hwpt automatically created + * for the specified ioas by kernel during the attachment. + * + * Associate a pasid with an address space within the bound iommufd. Undo by + * VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT or device fd close. This is only allowed + * on cdev fds. + * + * If a pasid is currently attached to a valid hwpt, without doing a + * VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT, a second + * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT ioctl passing in another hwpt id is + * allowed. This action, also known as a hwpt replacement, will replace the + * pasid's currently attached hwpt with a new hwpt corresponding to the given + * @pt_id. + * + * Return: 0 on success, -errno on failure. + */ +struct vfio_device_pasid_attach_iommufd_pt { + __u32 argsz; + __u32 flags; + __u32 pasid; + __u32 pt_id; +}; + +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 21) + +/* + * VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 22, + * struct vfio_device_pasid_detach_iommufd_pt) + * @argsz: User filled size of this data. + * @flags: Must be 0. + * @pasid: The pasid to be detached. + * + * Remove the association of a pasid (of a cdev device) and its current + * associated address space. After it, the pasid of the device should be + * in a blocking DMA state. This is only allowed on cdev fds. + * + * Return: 0 on success, -errno on failure. + */ +struct vfio_device_pasid_detach_iommufd_pt { + __u32 argsz; + __u32 flags; + __u32 pasid; +}; + +#define VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 22) + /* * Provide support for setting a PCI VF Token, which is used as a shared * secret between PF and VF drivers. This feature may only be set on a
On Thu, Sep 12, 2024 at 06:17:28AM -0700, Yi Liu wrote:
This adds ioctls for the userspace to attach/detach a given pasid of a vfio device to/from an IOAS/HWPT.
Reviewed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/vfio/device_cdev.c | 51 +++++++++++++++++++++++++++++++++++ drivers/vfio/vfio.h | 4 +++ drivers/vfio/vfio_main.c | 8 ++++++ include/uapi/linux/vfio.h | 55 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 118 insertions(+)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, September 12, 2024 9:17 PM
+/*
- VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE,
VFIO_BASE + 21,
struct
vfio_device_pasid_attach_iommufd_pt)
- @argsz: User filled size of this data.
- @flags: Must be 0.
- @pasid: The pasid to be attached.
- @pt_id: Input the target id which can represent an ioas or a hwpt
allocated via iommufd subsystem.
Output the input ioas id or the attached hwpt id which could
be the specified hwpt itself or a hwpt automatically created
for the specified ioas by kernel during the attachment.
- Associate a pasid with an address space within the bound iommufd.
Undo by
- VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT or device fd close. This is
only allowed
- on cdev fds.
- If a pasid is currently attached to a valid hwpt, without doing a
- VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT, a second
- VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT ioctl passing in another
hwpt id is
- allowed. This action, also known as a hwpt replacement, will replace the
- pasid's currently attached hwpt with a new hwpt corresponding to the
given
- @pt_id.
- Return: 0 on success, -errno on failure.
- */
+struct vfio_device_pasid_attach_iommufd_pt {
- __u32 argsz;
- __u32 flags;
- __u32 pasid;
- __u32 pt_id;
+};
+#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 21)
Not sure whether this was discussed before. Does it make sense to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT by introducing a new pasid field and a new flag bit?
On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
+struct vfio_device_pasid_attach_iommufd_pt {
- __u32 argsz;
- __u32 flags;
- __u32 pasid;
- __u32 pt_id;
+};
+#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 21)
Not sure whether this was discussed before. Does it make sense to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT by introducing a new pasid field and a new flag bit?
Maybe? I don't have a strong feeling either way.
There is somewhat less code if you reuse the ioctl at least
Jason
On 2024/10/1 20:11, Jason Gunthorpe wrote:
On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
+struct vfio_device_pasid_attach_iommufd_pt {
- __u32 argsz;
- __u32 flags;
- __u32 pasid;
- __u32 pt_id;
+};
+#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 21)
Not sure whether this was discussed before. Does it make sense to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT by introducing a new pasid field and a new flag bit?
Maybe? I don't have a strong feeling either way.
There is somewhat less code if you reuse the ioctl at least
I had a rough memory that I was suggested to add a separate ioctl for PASID. Let's see Alex's opinion.
@Alex.
On Sat, 12 Oct 2024 21:49:05 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/1 20:11, Jason Gunthorpe wrote:
On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
+struct vfio_device_pasid_attach_iommufd_pt {
- __u32 argsz;
- __u32 flags;
- __u32 pasid;
- __u32 pt_id;
+};
+#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 21)
Not sure whether this was discussed before. Does it make sense to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT by introducing a new pasid field and a new flag bit?
Maybe? I don't have a strong feeling either way.
There is somewhat less code if you reuse the ioctl at least
I had a rough memory that I was suggested to add a separate ioctl for PASID. Let's see Alex's opinion.
I don't recall any previous arguments for separate ioctls, but it seems to make a lot of sense to me to extend the existing ioctls with a flag to indicate pasid cscope and id. Thanks,
Alex
On 2024/10/14 23:49, Alex Williamson wrote:
On Sat, 12 Oct 2024 21:49:05 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/1 20:11, Jason Gunthorpe wrote:
On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
+struct vfio_device_pasid_attach_iommufd_pt {
- __u32 argsz;
- __u32 flags;
- __u32 pasid;
- __u32 pt_id;
+};
+#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 21)
Not sure whether this was discussed before. Does it make sense to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT by introducing a new pasid field and a new flag bit?
Maybe? I don't have a strong feeling either way.
There is somewhat less code if you reuse the ioctl at least
I had a rough memory that I was suggested to add a separate ioctl for PASID. Let's see Alex's opinion.
I don't recall any previous arguments for separate ioctls, but it seems to make a lot of sense to me to extend the existing ioctls with a flag to indicate pasid cscope and id. Thanks,
thanks for the confirmation. How about the below?
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index bb1817bd4ff3..c78533bce3c6 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df) int vfio_df_ioctl_attach_pt(struct vfio_device_file *df, struct vfio_device_attach_iommufd_pt __user *arg) { - struct vfio_device *device = df->device; + unsigned long minsz = offsetofend( + struct vfio_device_attach_iommufd_pt, pt_id); struct vfio_device_attach_iommufd_pt attach; - unsigned long minsz; + struct vfio_device *device = df->device; + u32 user_size; int ret;
- minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id); + ret = get_user(user_size, (u32 __user *)arg); + if (ret) + return ret;
- if (copy_from_user(&attach, arg, minsz)) - return -EFAULT; + ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size); + if (ret) + return ret;
- if (attach.argsz < minsz || attach.flags) + if (attach.argsz < minsz || attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
+ if ((attach.flags & VFIO_DEVICE_ATTACH_PASID) && + !device->ops->pasid_attach_ioas) + return -EOPNOTSUPP; + mutex_lock(&device->dev_set->lock); - ret = device->ops->attach_ioas(device, &attach.pt_id); + if (attach.flags & VFIO_DEVICE_ATTACH_PASID) + ret = device->ops->pasid_attach_ioas(device, attach.pasid, + &attach.pt_id); + else + ret = device->ops->attach_ioas(device, &attach.pt_id); if (ret) goto out_unlock;
@@ -198,20 +211,33 @@ int vfio_df_ioctl_attach_pt(struct vfio_device_file *df, int vfio_df_ioctl_detach_pt(struct vfio_device_file *df, struct vfio_device_detach_iommufd_pt __user *arg) { - struct vfio_device *device = df->device; + unsigned long minsz = + offsetofend(struct vfio_device_detach_iommufd_pt, flags); struct vfio_device_detach_iommufd_pt detach; - unsigned long minsz; + struct vfio_device *device = df->device; + u32 user_size; + int ret;
- minsz = offsetofend(struct vfio_device_detach_iommufd_pt, flags); + ret = get_user(user_size, (u32 __user *)arg); + if (ret) + return ret;
- if (copy_from_user(&detach, arg, minsz)) - return -EFAULT; + ret = copy_struct_from_user(&detach, sizeof(detach), arg, user_size); + if (ret) + return ret;
- if (detach.argsz < minsz || detach.flags) + if (detach.argsz < minsz || detach.flags & (~VFIO_DEVICE_DETACH_PASID)) return -EINVAL;
+ if ((detach.flags & VFIO_DEVICE_DETACH_PASID) && + !device->ops->pasid_detach_ioas) + return -EOPNOTSUPP; + mutex_lock(&device->dev_set->lock); - device->ops->detach_ioas(device); + if (detach.flags & VFIO_DEVICE_DETACH_PASID) + device->ops->pasid_detach_ioas(device, detach.pasid); + else + device->ops->detach_ioas(device); mutex_unlock(&device->dev_set->lock);
return 0; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 2b68e6cdf190..40b414e642f5 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -931,29 +931,34 @@ struct vfio_device_bind_iommufd { * VFIO_DEVICE_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 19, * struct vfio_device_attach_iommufd_pt) * @argsz: User filled size of this data. - * @flags: Must be 0. + * @flags: Flags for attach. * @pt_id: Input the target id which can represent an ioas or a hwpt * allocated via iommufd subsystem. * Output the input ioas id or the attached hwpt id which could * be the specified hwpt itself or a hwpt automatically created * for the specified ioas by kernel during the attachment. + * @pasid: The pasid to be attached, only meaningful when + * VFIO_DEVICE_ATTACH_PASID is set in @flags * * Associate the device with an address space within the bound iommufd. * Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close. This is only * allowed on cdev fds. * - * If a vfio device is currently attached to a valid hw_pagetable, without doing - * a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl - * passing in another hw_pagetable (hwpt) id is allowed. This action, also known - * as a hw_pagetable replacement, will replace the device's currently attached - * hw_pagetable with a new hw_pagetable corresponding to the given pt_id. + * If a vfio device or a pasid of this device is currently attached to a valid + * hw_pagetable (hwpt), without doing a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second + * VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl passing in another hwpt id is allowed. + * This action, also known as a hw_pagetable replacement, will replace the + * currently attached hwpt of the device or the pasid of this device with a new + * hwpt corresponding to the given pt_id. * * Return: 0 on success, -errno on failure. */ struct vfio_device_attach_iommufd_pt { __u32 argsz; __u32 flags; +#define VFIO_DEVICE_ATTACH_PASID (1 << 0) __u32 pt_id; + __u32 pasid; };
#define VFIO_DEVICE_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 19) @@ -962,17 +967,21 @@ struct vfio_device_attach_iommufd_pt { * VFIO_DEVICE_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 20, * struct vfio_device_detach_iommufd_pt) * @argsz: User filled size of this data. - * @flags: Must be 0. + * @flags: Flags for detach. + * @pasid: The pasid to be detached, only meaningful when + * VFIO_DEVICE_DETACH_PASID is set in @flags * - * Remove the association of the device and its current associated address - * space. After it, the device should be in a blocking DMA state. This is only - * allowed on cdev fds. + * Remove the association of the device or a pasid of the device and its current + * associated address space. After it, the device or the pasid should be in a + * blocking DMA state. This is only allowed on cdev fds. * * Return: 0 on success, -errno on failure. */ struct vfio_device_detach_iommufd_pt { __u32 argsz; __u32 flags; +#define VFIO_DEVICE_DETACH_PASID (1 << 0) + __u32 pasid; };
#define VFIO_DEVICE_DETACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 20)
On Tue, 15 Oct 2024 19:07:52 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/14 23:49, Alex Williamson wrote:
On Sat, 12 Oct 2024 21:49:05 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/1 20:11, Jason Gunthorpe wrote:
On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
+struct vfio_device_pasid_attach_iommufd_pt {
- __u32 argsz;
- __u32 flags;
- __u32 pasid;
- __u32 pt_id;
+};
+#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 21)
Not sure whether this was discussed before. Does it make sense to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT by introducing a new pasid field and a new flag bit?
Maybe? I don't have a strong feeling either way.
There is somewhat less code if you reuse the ioctl at least
I had a rough memory that I was suggested to add a separate ioctl for PASID. Let's see Alex's opinion.
I don't recall any previous arguments for separate ioctls, but it seems to make a lot of sense to me to extend the existing ioctls with a flag to indicate pasid cscope and id. Thanks,
thanks for the confirmation. How about the below?
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index bb1817bd4ff3..c78533bce3c6 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df) int vfio_df_ioctl_attach_pt(struct vfio_device_file *df, struct vfio_device_attach_iommufd_pt __user *arg) {
- struct vfio_device *device = df->device;
- unsigned long minsz = offsetofend(
struct vfio_device_attach_iommufd_pt attach;struct vfio_device_attach_iommufd_pt, pt_id);
- unsigned long minsz;
- struct vfio_device *device = df->device;
- u32 user_size; int ret;
- minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
- ret = get_user(user_size, (u32 __user *)arg);
- if (ret)
return ret;
- if (copy_from_user(&attach, arg, minsz))
return -EFAULT;
- ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
- if (ret)
return ret;
I think this could break current users. For better or worse, we don't currently have any requirements for the remainder of the user buffer, whereas copy_struct_from_user() returns an error for non-zero trailing bytes. I think we need to monotonically increase the structure size, but maybe something more like below, using flags. The expectation would be that if we add another flag that extends the structure, we'd test that flag after PASID and clobber xend to a new value further into the new structure. We'd also add that flag to the flags mask, but we'd share the copy code.
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (xend) { if (attach.argsz < xend) return -EINVAL; if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, xend - minsz)) return -EFAULT; }
Maybe there are still more elegant options available.
We also generally try to label flags with FLAGS in the name, but it does get rather unwieldy, so I'm open to suggestions. Thanks,
Alex
- if (attach.argsz < minsz || attach.flags)
if (attach.argsz < minsz || attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
if ((attach.flags & VFIO_DEVICE_ATTACH_PASID) &&
!device->ops->pasid_attach_ioas)
return -EOPNOTSUPP;
mutex_lock(&device->dev_set->lock);
- ret = device->ops->attach_ioas(device, &attach.pt_id);
- if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
ret = device->ops->pasid_attach_ioas(device, attach.pasid,
&attach.pt_id);
- else
if (ret) goto out_unlock;ret = device->ops->attach_ioas(device, &attach.pt_id);
@@ -198,20 +211,33 @@ int vfio_df_ioctl_attach_pt(struct vfio_device_file *df, int vfio_df_ioctl_detach_pt(struct vfio_device_file *df, struct vfio_device_detach_iommufd_pt __user *arg) {
- struct vfio_device *device = df->device;
- unsigned long minsz =
struct vfio_device_detach_iommufd_pt detach;offsetofend(struct vfio_device_detach_iommufd_pt, flags);
- unsigned long minsz;
- struct vfio_device *device = df->device;
- u32 user_size;
- int ret;
- minsz = offsetofend(struct vfio_device_detach_iommufd_pt, flags);
- ret = get_user(user_size, (u32 __user *)arg);
- if (ret)
return ret;
- if (copy_from_user(&detach, arg, minsz))
return -EFAULT;
- ret = copy_struct_from_user(&detach, sizeof(detach), arg, user_size);
- if (ret)
return ret;
- if (detach.argsz < minsz || detach.flags)
if (detach.argsz < minsz || detach.flags & (~VFIO_DEVICE_DETACH_PASID)) return -EINVAL;
if ((detach.flags & VFIO_DEVICE_DETACH_PASID) &&
!device->ops->pasid_detach_ioas)
return -EOPNOTSUPP;
mutex_lock(&device->dev_set->lock);
- device->ops->detach_ioas(device);
if (detach.flags & VFIO_DEVICE_DETACH_PASID)
device->ops->pasid_detach_ioas(device, detach.pasid);
else
device->ops->detach_ioas(device);
mutex_unlock(&device->dev_set->lock);
return 0;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 2b68e6cdf190..40b414e642f5 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -931,29 +931,34 @@ struct vfio_device_bind_iommufd {
- VFIO_DEVICE_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 19,
struct vfio_device_attach_iommufd_pt)
- @argsz: User filled size of this data.
- @flags: Must be 0.
- @flags: Flags for attach.
- @pt_id: Input the target id which can represent an ioas or a hwpt
allocated via iommufd subsystem.
Output the input ioas id or the attached hwpt id which could
be the specified hwpt itself or a hwpt automatically created
for the specified ioas by kernel during the attachment.
- @pasid: The pasid to be attached, only meaningful when
VFIO_DEVICE_ATTACH_PASID is set in @flags
- Associate the device with an address space within the bound iommufd.
- Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close. This is only
- allowed on cdev fds.
- If a vfio device is currently attached to a valid hw_pagetable, without
doing
- a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second VFIO_DEVICE_ATTACH_IOMMUFD_PT
ioctl
- passing in another hw_pagetable (hwpt) id is allowed. This action, also
known
- as a hw_pagetable replacement, will replace the device's currently attached
- hw_pagetable with a new hw_pagetable corresponding to the given pt_id.
- If a vfio device or a pasid of this device is currently attached to a valid
- hw_pagetable (hwpt), without doing a VFIO_DEVICE_DETACH_IOMMUFD_PT, a
second
- VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl passing in another hwpt id is allowed.
- This action, also known as a hw_pagetable replacement, will replace the
- currently attached hwpt of the device or the pasid of this device with
a new
*/ struct vfio_device_attach_iommufd_pt { __u32 argsz; __u32 flags;
- hwpt corresponding to the given pt_id.
- Return: 0 on success, -errno on failure.
+#define VFIO_DEVICE_ATTACH_PASID (1 << 0) __u32 pt_id;
__u32 pasid; };
#define VFIO_DEVICE_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 19)
@@ -962,17 +967,21 @@ struct vfio_device_attach_iommufd_pt {
- VFIO_DEVICE_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 20,
struct vfio_device_detach_iommufd_pt)
- @argsz: User filled size of this data.
- @flags: Must be 0.
- @flags: Flags for detach.
- @pasid: The pasid to be detached, only meaningful when
VFIO_DEVICE_DETACH_PASID is set in @flags
- Remove the association of the device and its current associated address
- space. After it, the device should be in a blocking DMA state. This
is only
- allowed on cdev fds.
- Remove the association of the device or a pasid of the device and its
current
- associated address space. After it, the device or the pasid should be in a
*/ struct vfio_device_detach_iommufd_pt { __u32 argsz; __u32 flags;
- blocking DMA state. This is only allowed on cdev fds.
- Return: 0 on success, -errno on failure.
+#define VFIO_DEVICE_DETACH_PASID (1 << 0)
__u32 pasid; };
#define VFIO_DEVICE_DETACH_IOMMUFD_PT _IO(VFIO_TYPE, VFIO_BASE + 20)
On 2024/10/16 00:22, Alex Williamson wrote:
On Tue, 15 Oct 2024 19:07:52 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/14 23:49, Alex Williamson wrote:
On Sat, 12 Oct 2024 21:49:05 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/1 20:11, Jason Gunthorpe wrote:
On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
> +struct vfio_device_pasid_attach_iommufd_pt { > + __u32 argsz; > + __u32 flags; > + __u32 pasid; > + __u32 pt_id; > +}; > + > +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, > VFIO_BASE + 21)
Not sure whether this was discussed before. Does it make sense to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT by introducing a new pasid field and a new flag bit?
Maybe? I don't have a strong feeling either way.
There is somewhat less code if you reuse the ioctl at least
I had a rough memory that I was suggested to add a separate ioctl for PASID. Let's see Alex's opinion.
I don't recall any previous arguments for separate ioctls, but it seems to make a lot of sense to me to extend the existing ioctls with a flag to indicate pasid cscope and id. Thanks,
thanks for the confirmation. How about the below?
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index bb1817bd4ff3..c78533bce3c6 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df) int vfio_df_ioctl_attach_pt(struct vfio_device_file *df, struct vfio_device_attach_iommufd_pt __user *arg) {
- struct vfio_device *device = df->device;
- unsigned long minsz = offsetofend(
struct vfio_device_attach_iommufd_pt attach;struct vfio_device_attach_iommufd_pt, pt_id);
- unsigned long minsz;
- struct vfio_device *device = df->device;
- u32 user_size; int ret;
- minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
- ret = get_user(user_size, (u32 __user *)arg);
- if (ret)
return ret;
- if (copy_from_user(&attach, arg, minsz))
return -EFAULT;
- ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
- if (ret)
return ret;
I think this could break current users.
not quite get here. My understanding is as the below:
If the current user (compiled with the existing uapi struct), it will provide less data that the new kernel knows. The copy_struct_from_user() would zero the trailing bytes. And such user won't set the new flag, so it should be fine.
So to me, the trailing bytes concern comes when user is compiled with the new uapi struct but running on an eld kernel (say <= 6.12).But the eld kernel uses copy_from_user(), so even there is non-zero trailing bytes in the user buffer, the eld kernel just ignores them. So this seems not an issue to me so far.
For better or worse, we don't currently have any requirements for the remainder of the user buffer, whereas copy_struct_from_user() returns an error for non-zero trailing bytes.
This seems to be a general requirement when using copy_struct_from_user(). User needs to zero the fields that it does not intend to use. If there is no such requirement for user, then the trailing bytes can be a concern in the future but not this time as the existing kernel uses copy_from_user().
I think we need to monotonically increase the structure size, but maybe something more like below, using flags. The expectation would be that if we add another flag that extends the structure, we'd test that flag after PASID and clobber xend to a new value further into the new structure. We'd also add that flag to the flags mask, but we'd share the copy code.
agree, this share code might be needed for other path as well. Some macros I guess.
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (xend) { if (attach.argsz < xend) return -EINVAL; if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, xend - minsz)) return -EFAULT; }
I think it might need to zero the trailing bytes if the user does not set the extended flag. is it?
Maybe there are still more elegant options available.
We also generally try to label flags with FLAGS in the name, but it does get rather unwieldy, so I'm open to suggestions. Thanks,
There is already examples that new field added to a kernel-to-user uapi struct like the vfio_region_info::cap_offset and vfio_device_info::cap_offset. But it might be a little bit different with the case we face here as it's user-to-kernel struct. It's a time for you to set a rule for such extensions. :)
On Wed, 16 Oct 2024 11:35:05 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/16 00:22, Alex Williamson wrote:
On Tue, 15 Oct 2024 19:07:52 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/14 23:49, Alex Williamson wrote:
On Sat, 12 Oct 2024 21:49:05 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/1 20:11, Jason Gunthorpe wrote:
On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote: >> +struct vfio_device_pasid_attach_iommufd_pt { >> + __u32 argsz; >> + __u32 flags; >> + __u32 pasid; >> + __u32 pt_id; >> +}; >> + >> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, >> VFIO_BASE + 21) > > Not sure whether this was discussed before. Does it make sense > to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT > by introducing a new pasid field and a new flag bit?
Maybe? I don't have a strong feeling either way.
There is somewhat less code if you reuse the ioctl at least
I had a rough memory that I was suggested to add a separate ioctl for PASID. Let's see Alex's opinion.
I don't recall any previous arguments for separate ioctls, but it seems to make a lot of sense to me to extend the existing ioctls with a flag to indicate pasid cscope and id. Thanks,
thanks for the confirmation. How about the below?
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index bb1817bd4ff3..c78533bce3c6 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df) int vfio_df_ioctl_attach_pt(struct vfio_device_file *df, struct vfio_device_attach_iommufd_pt __user *arg) {
- struct vfio_device *device = df->device;
- unsigned long minsz = offsetofend(
struct vfio_device_attach_iommufd_pt attach;struct vfio_device_attach_iommufd_pt, pt_id);
- unsigned long minsz;
- struct vfio_device *device = df->device;
- u32 user_size; int ret;
- minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
- ret = get_user(user_size, (u32 __user *)arg);
- if (ret)
return ret;
- if (copy_from_user(&attach, arg, minsz))
return -EFAULT;
- ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
- if (ret)
return ret;
I think this could break current users.
not quite get here. My understanding is as the below:
If the current user (compiled with the existing uapi struct), it will provide less data that the new kernel knows. The copy_struct_from_user() would zero the trailing bytes. And such user won't set the new flag, so it should be fine.
You're making an assumption that the user is passing exactly the existing struct size as argsz, which is not a requirement. The user could allocate a buffer page, argsz might be 4K.
So to me, the trailing bytes concern comes when user is compiled with the new uapi struct but running on an eld kernel (say <= 6.12).But the eld kernel uses copy_from_user(), so even there is non-zero trailing bytes in the user buffer, the eld kernel just ignores them. So this seems not an issue to me so far.
That's new userspace, old kernel. I'm referencing an issue with old userspace, new kernel, where old userspace has no requirement that argsz is exactly the structure size, nor that the trailing bytes from the structure size to argsz are zero'd.
For better or worse, we don't currently have any requirements for the remainder of the user buffer, whereas copy_struct_from_user() returns an error for non-zero trailing bytes.
This seems to be a general requirement when using copy_struct_from_user(). User needs to zero the fields that it does not intend to use. If there is no such requirement for user, then the trailing bytes can be a concern in the future but not this time as the existing kernel uses copy_from_user().
Exactly, the current implementation makes no requirement on trailing non-zero bytes, copy_struct_from_user() does.
I think we need to monotonically increase the structure size, but maybe something more like below, using flags. The expectation would be that if we add another flag that extends the structure, we'd test that flag after PASID and clobber xend to a new value further into the new structure. We'd also add that flag to the flags mask, but we'd share the copy code.
agree, this share code might be needed for other path as well. Some macros I guess.
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (xend) { if (attach.argsz < xend) return -EINVAL; if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, xend - minsz)) return -EFAULT; }
I think it might need to zero the trailing bytes if the user does not set the extended flag. is it?
We could zero initialize the attach structure for safety, but the kernel side code should also be driven by flags. We should never look at a field beyond the base structure that isn't indicated by flags and copied from the user.
Maybe there are still more elegant options available.
We also generally try to label flags with FLAGS in the name, but it does get rather unwieldy, so I'm open to suggestions. Thanks,
There is already examples that new field added to a kernel-to-user uapi struct like the vfio_region_info::cap_offset and vfio_device_info::cap_offset. But it might be a little bit different with the case we face here as it's user-to-kernel struct. It's a time for you to set a rule for such extensions. :)
That's a returned structure, so it's a bit different, but the same philosophy, we don't break userspace. In that case we used argsz as an output field and flags to indicate there are capabilities. Old userspace ignores the flag and argsz semantics and continues to work. New userspace checks the flag, reallocs the buffer with argsz and retries. This is however an example of userspace providing an argsz value that exceeds the ioctl structure, with no requirement to zero the buffer, though it is an output structure rather than the input structure here.
I think the same holds here, our policy is to not break userspace. We potentially do break userspace if we impose a requirement that the trailing buffer size must now be zero. We have the flags field so that we don't need to blindly base the structure version on the size of the user buffer. We should use the flags field to determine relevant and valid fields beyond the base structure without imposing new requirements to userspace. Thanks,
Alex
On 2024/10/17 00:11, Alex Williamson wrote:
On Wed, 16 Oct 2024 11:35:05 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/16 00:22, Alex Williamson wrote:
On Tue, 15 Oct 2024 19:07:52 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/14 23:49, Alex Williamson wrote:
On Sat, 12 Oct 2024 21:49:05 +0800 Yi Liu yi.l.liu@intel.com wrote:
On 2024/10/1 20:11, Jason Gunthorpe wrote: > On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote: > >>> +struct vfio_device_pasid_attach_iommufd_pt { >>> + __u32 argsz; >>> + __u32 flags; >>> + __u32 pasid; >>> + __u32 pt_id; >>> +}; >>> + >>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT _IO(VFIO_TYPE, >>> VFIO_BASE + 21) >> >> Not sure whether this was discussed before. Does it make sense >> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT >> by introducing a new pasid field and a new flag bit? > > Maybe? I don't have a strong feeling either way. > > There is somewhat less code if you reuse the ioctl at least
I had a rough memory that I was suggested to add a separate ioctl for PASID. Let's see Alex's opinion.
I don't recall any previous arguments for separate ioctls, but it seems to make a lot of sense to me to extend the existing ioctls with a flag to indicate pasid cscope and id. Thanks,
thanks for the confirmation. How about the below?
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index bb1817bd4ff3..c78533bce3c6 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df) int vfio_df_ioctl_attach_pt(struct vfio_device_file *df, struct vfio_device_attach_iommufd_pt __user *arg) {
- struct vfio_device *device = df->device;
- unsigned long minsz = offsetofend(
struct vfio_device_attach_iommufd_pt, pt_id); struct vfio_device_attach_iommufd_pt attach;
- unsigned long minsz;
- struct vfio_device *device = df->device;
- u32 user_size; int ret;
- minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
- ret = get_user(user_size, (u32 __user *)arg);
- if (ret)
return ret;
- if (copy_from_user(&attach, arg, minsz))
return -EFAULT;
- ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
- if (ret)
return ret;
I think this could break current users.
not quite get here. My understanding is as the below:
If the current user (compiled with the existing uapi struct), it will provide less data that the new kernel knows. The copy_struct_from_user() would zero the trailing bytes. And such user won't set the new flag, so it should be fine.
You're making an assumption that the user is passing exactly the existing struct size as argsz, which is not a requirement. The user could allocate a buffer page, argsz might be 4K.
I see. If the argsz is far larger than the existing struct size, it might be large than the size of the new struct as well. Then the trailing bytes would result in failure which does not fail with old user and old kernel.
So to me, the trailing bytes concern comes when user is compiled with the new uapi struct but running on an eld kernel (say <= 6.12).But the eld kernel uses copy_from_user(), so even there is non-zero trailing bytes in the user buffer, the eld kernel just ignores them. So this seems not an issue to me so far.
That's new userspace, old kernel. I'm referencing an issue with old userspace, new kernel, where old userspace has no requirement that argsz is exactly the structure size, nor that the trailing bytes from the structure size to argsz are zero'd.
got it.
For better or worse, we don't currently have any requirements for the remainder of the user buffer, whereas copy_struct_from_user() returns an error for non-zero trailing bytes.
This seems to be a general requirement when using copy_struct_from_user(). User needs to zero the fields that it does not intend to use. If there is no such requirement for user, then the trailing bytes can be a concern in the future but not this time as the existing kernel uses copy_from_user().
Exactly, the current implementation makes no requirement on trailing non-zero bytes, copy_struct_from_user() does.
got it.
I think we need to monotonically increase the structure size, but maybe something more like below, using flags. The expectation would be that if we add another flag that extends the structure, we'd test that flag after PASID and clobber xend to a new value further into the new structure. We'd also add that flag to the flags mask, but we'd share the copy code.
agree, this share code might be needed for other path as well. Some macros I guess.
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (xend) { if (attach.argsz < xend) return -EINVAL; if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, xend - minsz)) return -EFAULT; }
I think it might need to zero the trailing bytes if the user does not set the extended flag. is it?
We could zero initialize the attach structure for safety, but the kernel side code should also be driven by flags. We should never look at a field beyond the base structure that isn't indicated by flags and copied from the user.
yes, zeroing the trailing bytes is for safety, and all the new fields should be indicated by flags.
Maybe there are still more elegant options available.
We also generally try to label flags with FLAGS in the name, but it does get rather unwieldy, so I'm open to suggestions. Thanks,
There is already examples that new field added to a kernel-to-user uapi struct like the vfio_region_info::cap_offset and vfio_device_info::cap_offset. But it might be a little bit different with the case we face here as it's user-to-kernel struct. It's a time for you to set a rule for such extensions. :)
That's a returned structure, so it's a bit different, but the same philosophy, we don't break userspace. In that case we used argsz as an output field and flags to indicate there are capabilities. Old userspace ignores the flag and argsz semantics and continues to work. New userspace checks the flag, reallocs the buffer with argsz and retries. This is however an example of userspace providing an argsz value that exceeds the ioctl structure, with no requirement to zero the buffer, though it is an output structure rather than the input structure here.
I think the same holds here, our policy is to not break userspace. We potentially do break userspace if we impose a requirement that the trailing buffer size must now be zero. We have the flags field so that we don't need to blindly base the structure version on the size of the user buffer. We should use the flags field to determine relevant and valid fields beyond the base structure without imposing new requirements to userspace. Thanks,
got it. let me send another version.
Hi Alex,
On 2024/10/18 13:40, Yi Liu wrote:
I think we need to monotonically increase the structure size, but maybe something more like below, using flags. The expectation would be that if we add another flag that extends the structure, we'd test that flag after PASID and clobber xend to a new value further into the new structure. We'd also add that flag to the flags mask, but we'd share the copy code.
agree, this share code might be needed for other path as well. Some macros I guess.
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (xend) { if (attach.argsz < xend) return -EINVAL;
Need to check the future usage of 'xend'. In understanding, 'xend' should always be offsetofend(struct, the_last_field). A userspace that uses @pasid field would set argsz >= offsetofend(struct, pasid), most likely it would just set argsz==offsetofend(struct, pasid). If so, such userspace would be failed when running on a kernel that has added new fields behind @pasid.
Say two decades later, we add a new field (say @xyz) to this user struct, the 'xend' would be updated to be offsetofend(struct, xyz). This 'xend' would be larger than the argsz provided by the aforementioned userspace. Hence it would be failed in the above check. To make it work, I'm considering to make some changes to the code. When argsz < xend, we only copy extra data with size==argsz-minsz. Just as the below.
if (xend) { unsigned long size;
if (attach.argsz < xend) size = attach.argsz - minsz; else size = xend - minsz;
if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, size)) return -EFAULT; }
However, it seems to have another problem. If the userspace that uses @pasid set the argsz==offsetofend(struct, pasid) - 1, such userspace is not supposed to work and should be failed by kernel. is it? However, my above code cannot fail it. :(
Any suggestion about it?
if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, xend - minsz)) return -EFAULT; }
On Wed, 30 Oct 2024 20:54:09 +0800 Yi Liu yi.l.liu@intel.com wrote:
Hi Alex,
On 2024/10/18 13:40, Yi Liu wrote:
I think we need to monotonically increase the structure size, but maybe something more like below, using flags. The expectation would be that if we add another flag that extends the structure, we'd test that flag after PASID and clobber xend to a new value further into the new structure. We'd also add that flag to the flags mask, but we'd share the copy code.
agree, this share code might be needed for other path as well. Some macros I guess.
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (xend) { if (attach.argsz < xend) return -EINVAL;
Need to check the future usage of 'xend'. In understanding, 'xend' should always be offsetofend(struct, the_last_field). A userspace that uses @pasid field would set argsz >= offsetofend(struct, pasid), most likely it would just set argsz==offsetofend(struct, pasid). If so, such userspace would be failed when running on a kernel that has added new fields behind @pasid.
No, xend denotes the end of the structure we need to satisfy the flags that are requested by the user.
Say two decades later, we add a new field (say @xyz) to this user struct, the 'xend' would be updated to be offsetofend(struct, xyz). This 'xend' would be larger than the argsz provided by the aforementioned userspace. Hence it would be failed in the above check.
New field xyz would require a new flag, VFIO_DEVICE_XYZ and we'd extend the above code as:
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~(VFIO_DEVICE_ATTACH_PASID | VFIO_DEVICE_XYZ))) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (attach.flags & VFIO_DEVICE_XYZ) xend = offsetofend(struct vfio_device_attach_iommufd_pt, xyz);
if (xend) { if (attach.argsz < xend) return -EINVAL;
New userspace can provide argsz = offsetofend(, xyz), just as it could provide argsz = PAGE_SIZE now if it really wanted, but argsz > minsz is only required if the user sets any of these new flags. Therefore old userspace on new kernel continues to work.
To make it work, I'm considering to make some changes to the code. When argsz < xend, we only copy extra data with size==argsz-minsz. Just as the below.
if (xend) { unsigned long size;
if (attach.argsz < xend)
This is an -EINVAL condition, xend tracks the flags the user has set. The user must provide a sufficient buffer for the flags they've set.
size = attach.argsz - minsz; else size = xend - minsz;
This is the only correct copy size.
if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, size)) return -EFAULT;
}
However, it seems to have another problem. If the userspace that uses @pasid set the argsz==offsetofend(struct, pasid) - 1, such userspace is not supposed to work and should be failed by kernel. is it? However, my above code cannot fail it. :(
Any suggestion about it?
If a user sets the ATTACH_PASID flag and argsz is less than offsetofend(struct, pasid), we need to return -EINVAL as indicated above. Thanks,
Alex
if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, xend - minsz)) return -EFAULT; }
On 2024/10/31 01:51, Alex Williamson wrote:
On Wed, 30 Oct 2024 20:54:09 +0800 Yi Liu yi.l.liu@intel.com wrote:
Hi Alex,
On 2024/10/18 13:40, Yi Liu wrote:
I think we need to monotonically increase the structure size, but maybe something more like below, using flags. The expectation would be that if we add another flag that extends the structure, we'd test that flag after PASID and clobber xend to a new value further into the new structure. We'd also add that flag to the flags mask, but we'd share the copy code.
agree, this share code might be needed for other path as well. Some macros I guess.
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID)) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (xend) { if (attach.argsz < xend) return -EINVAL;
Need to check the future usage of 'xend'. In understanding, 'xend' should always be offsetofend(struct, the_last_field). A userspace that uses @pasid field would set argsz >= offsetofend(struct, pasid), most likely it would just set argsz==offsetofend(struct, pasid). If so, such userspace would be failed when running on a kernel that has added new fields behind @pasid.
No, xend denotes the end of the structure we need to satisfy the flags that are requested by the user.
Say two decades later, we add a new field (say @xyz) to this user struct, the 'xend' would be updated to be offsetofend(struct, xyz). This 'xend' would be larger than the argsz provided by the aforementioned userspace. Hence it would be failed in the above check.
New field xyz would require a new flag, VFIO_DEVICE_XYZ and we'd extend the above code as:
if (attach.argsz < minsz) return -EINVAL;
if (attach.flags & (~(VFIO_DEVICE_ATTACH_PASID | VFIO_DEVICE_XYZ))) return -EINVAL;
if (attach.flags & VFIO_DEVICE_ATTACH_PASID) xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
if (attach.flags & VFIO_DEVICE_XYZ) xend = offsetofend(struct vfio_device_attach_iommufd_pt, xyz);
if (xend) { if (attach.argsz < xend) return -EINVAL;
New userspace can provide argsz = offsetofend(, xyz), just as it could provide argsz = PAGE_SIZE now if it really wanted, but argsz > minsz is only required if the user sets any of these new flags. Therefore old userspace on new kernel continues to work.
got it. This should work. thanks.:)
To make it work, I'm considering to make some changes to the code. When argsz < xend, we only copy extra data with size==argsz-minsz. Just as the below.
if (xend) { unsigned long size;
if (attach.argsz < xend)
This is an -EINVAL condition, xend tracks the flags the user has set. The user must provide a sufficient buffer for the flags they've set.
size = attach.argsz - minsz; else size = xend - minsz;
This is the only correct copy size.
if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, size)) return -EFAULT;
}
However, it seems to have another problem. If the userspace that uses @pasid set the argsz==offsetofend(struct, pasid) - 1, such userspace is not supposed to work and should be failed by kernel. is it? However, my above code cannot fail it. :(
Any suggestion about it?
If a user sets the ATTACH_PASID flag and argsz is less than offsetofend(struct, pasid), we need to return -EINVAL as indicated above. Thanks,
yep.
if (copy_from_user((void *)&attach + minsz, (void __user *)arg + minsz, xend - minsz)) return -EFAULT; }
PASID usage requires PASID support in both device and IOMMU. Since the iommu drivers always enable the PASID capability for the device if it is supported, so it is reasonable to extend the IOMMU_GET_HW_INFO to report the PASID capability to userspace.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 27 ++++++++++++++++++++++++++- drivers/pci/ats.c | 32 ++++++++++++++++++++++++++++++++ include/linux/pci-ats.h | 3 +++ include/uapi/linux/iommufd.h | 14 +++++++++++++- 4 files changed, 74 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 18f94aa462ea..6b7e3e5f4598 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -4,6 +4,8 @@ #include <linux/iommufd.h> #include <linux/slab.h> #include <linux/iommu.h> +#include <linux/pci.h> +#include <linux/pci-ats.h> #include <uapi/linux/iommufd.h> #include "../iommu-priv.h"
@@ -1185,7 +1187,8 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd) void *data; int rc;
- if (cmd->flags || cmd->__reserved) + if (cmd->flags || cmd->__reserved[0] || cmd->__reserved[1] || + cmd->__reserved[2]) return -EOPNOTSUPP;
idev = iommufd_get_device(ucmd, cmd->dev_id); @@ -1242,6 +1245,28 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd) if (device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY_TRACKING)) cmd->out_capabilities |= IOMMU_HW_CAP_DIRTY_TRACKING;
+ cmd->out_max_pasid_log2 = 0; + + if (dev_is_pci(idev->dev)) { + struct pci_dev *pdev = to_pci_dev(idev->dev); + int ctrl; + + if (pdev->is_virtfn) + pdev = pci_physfn(pdev); + + ctrl = pci_pasid_ctrl_status(pdev); + if (ctrl >= 0 && (ctrl & PCI_PASID_CTRL_ENABLE)) { + cmd->out_max_pasid_log2 = + ilog2(idev->dev->iommu->max_pasids); + if (ctrl & PCI_PASID_CTRL_EXEC) + cmd->out_capabilities |= + IOMMU_HW_CAP_PCI_PASID_EXEC; + if (ctrl & PCI_PASID_CTRL_PRIV) + cmd->out_capabilities |= + IOMMU_HW_CAP_PCI_PASID_PRIV; + } + } + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); out_free: kfree(data); diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c index c570892b2090..886f24e3999f 100644 --- a/drivers/pci/ats.c +++ b/drivers/pci/ats.c @@ -505,4 +505,36 @@ int pci_max_pasids(struct pci_dev *pdev) return (1 << FIELD_GET(PCI_PASID_CAP_WIDTH, supported)); } EXPORT_SYMBOL_GPL(pci_max_pasids); + +/** + * pci_pasid_ctrl_status - Check the PASID status + * @pdev: PCI device structure + * + * Returns a negative value when no PASI capability is present. + * Otherwise the value of the control register is returned. + * Status reported are: + * + * PCI_PASID_CTRL_ENABLE - PASID enabled + * PCI_PASID_CTRL_EXEC - Execute permission enabled + * PCI_PASID_CTRL_PRIV - Privileged mode enabled + */ +int pci_pasid_ctrl_status(struct pci_dev *pdev) +{ + u16 ctrl = 0; + int pasid; + + if (pdev->is_virtfn) + pdev = pci_physfn(pdev); + + pasid = pdev->pasid_cap; + if (!pasid) + return -EINVAL; + + pci_read_config_word(pdev, pasid + PCI_PASID_CTRL, &ctrl); + + ctrl &= PCI_PASID_CTRL_ENABLE | PCI_PASID_CTRL_EXEC | + PCI_PASID_CTRL_PRIV; + + return ctrl; +} #endif /* CONFIG_PCI_PASID */ diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h index df54cd5b15db..5cee388752a0 100644 --- a/include/linux/pci-ats.h +++ b/include/linux/pci-ats.h @@ -39,6 +39,7 @@ int pci_enable_pasid(struct pci_dev *pdev, int features); void pci_disable_pasid(struct pci_dev *pdev); int pci_pasid_features(struct pci_dev *pdev); int pci_max_pasids(struct pci_dev *pdev); +int pci_pasid_ctrl_status(struct pci_dev *pdev); #else /* CONFIG_PCI_PASID */ static inline int pci_enable_pasid(struct pci_dev *pdev, int features) { return -EINVAL; } @@ -47,6 +48,8 @@ static inline int pci_pasid_features(struct pci_dev *pdev) { return -EINVAL; } static inline int pci_max_pasids(struct pci_dev *pdev) { return -EINVAL; } +static inline int pci_pasid_ctrl_status(struct pci_dev *pdev) +{ return -EINVAL; } #endif /* CONFIG_PCI_PASID */
#endif /* LINUX_PCI_ATS_H */ diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 4dde745cfb7e..60eca4c73b25 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -504,9 +504,17 @@ enum iommu_hw_info_type { * IOMMU_HWPT_GET_DIRTY_BITMAP * IOMMU_HWPT_SET_DIRTY_TRACKING * + * @IOMMU_HW_CAP_PASID_EXEC: Execute Permission Supported, user ignores it + * when the struct iommu_hw_info::out_max_pasid_log2 + * is zero. + * @IOMMU_HW_CAP_PASID_PRIV: Privileged Mode Supported, user ignores it + * when the struct iommu_hw_info::out_max_pasid_log2 + * is zero. */ enum iommufd_hw_capabilities { IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0, + IOMMU_HW_CAP_PCI_PASID_EXEC = 1 << 1, + IOMMU_HW_CAP_PCI_PASID_PRIV = 1 << 2, };
/** @@ -522,6 +530,9 @@ enum iommufd_hw_capabilities { * iommu_hw_info_type. * @out_capabilities: Output the generic iommu capability info type as defined * in the enum iommu_hw_capabilities. + * @out_max_pasid_log2: Output the width of PASIDs. 0 means no PASID support. + * PCI devices turn to out_capabilities to check if the + * specific capabilities is supported or not. * @__reserved: Must be 0 * * Query an iommu type specific hardware information data from an iommu behind @@ -545,7 +556,8 @@ struct iommu_hw_info { __u32 data_len; __aligned_u64 data_uptr; __u32 out_data_type; - __u32 __reserved; + __u8 out_max_pasid_log2; + __u8 __reserved[3]; __aligned_u64 out_capabilities; }; #define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
On Thu, Sep 12, 2024 at 06:17:29AM -0700, Yi Liu wrote:
@@ -1242,6 +1245,28 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd) if (device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY_TRACKING)) cmd->out_capabilities |= IOMMU_HW_CAP_DIRTY_TRACKING;
- cmd->out_max_pasid_log2 = 0;
- if (dev_is_pci(idev->dev)) {
struct pci_dev *pdev = to_pci_dev(idev->dev);
int ctrl;
if (pdev->is_virtfn)
pdev = pci_physfn(pdev);
Don't we do this twice now?
Let's just keep it in the pci core?
It looks Ok otherwise
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
On 2024/9/27 03:35, Jason Gunthorpe wrote:
On Thu, Sep 12, 2024 at 06:17:29AM -0700, Yi Liu wrote:
@@ -1242,6 +1245,28 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd) if (device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY_TRACKING)) cmd->out_capabilities |= IOMMU_HW_CAP_DIRTY_TRACKING;
- cmd->out_max_pasid_log2 = 0;
- if (dev_is_pci(idev->dev)) {
struct pci_dev *pdev = to_pci_dev(idev->dev);
int ctrl;
if (pdev->is_virtfn)
pdev = pci_physfn(pdev);
Don't we do this twice now?
Let's just keep it in the pci core?
yes. let me drop it.
It looks Ok otherwise
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
thanks.
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, September 12, 2024 9:17 PM
PASID usage requires PASID support in both device and IOMMU. Since the iommu drivers always enable the PASID capability for the device if it is supported, so it is reasonable to extend the IOMMU_GET_HW_INFO to report the PASID capability to userspace.
Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Hi Yi,
On 9/12/2024 6:47 PM, Yi Liu wrote:
PASID usage requires PASID support in both device and IOMMU. Since the iommu drivers always enable the PASID capability for the device if it is supported, so it is reasonable to extend the IOMMU_GET_HW_INFO to report the PASID capability to userspace.
Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/iommufd/device.c | 27 ++++++++++++++++++++++++++- drivers/pci/ats.c | 32 ++++++++++++++++++++++++++++++++ include/linux/pci-ats.h | 3 +++ include/uapi/linux/iommufd.h | 14 +++++++++++++- 4 files changed, 74 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 18f94aa462ea..6b7e3e5f4598 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -4,6 +4,8 @@ #include <linux/iommufd.h> #include <linux/slab.h> #include <linux/iommu.h> +#include <linux/pci.h> +#include <linux/pci-ats.h> #include <uapi/linux/iommufd.h> #include "../iommu-priv.h" @@ -1185,7 +1187,8 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd) void *data; int rc;
- if (cmd->flags || cmd->__reserved)
- if (cmd->flags || cmd->__reserved[0] || cmd->__reserved[1] ||
return -EOPNOTSUPP;cmd->__reserved[2])
idev = iommufd_get_device(ucmd, cmd->dev_id); @@ -1242,6 +1245,28 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd) if (device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY_TRACKING)) cmd->out_capabilities |= IOMMU_HW_CAP_DIRTY_TRACKING;
- cmd->out_max_pasid_log2 = 0;
- if (dev_is_pci(idev->dev)) {
struct pci_dev *pdev = to_pci_dev(idev->dev);
int ctrl;
if (pdev->is_virtfn)
pdev = pci_physfn(pdev);
ctrl = pci_pasid_ctrl_status(pdev);
if (ctrl >= 0 && (ctrl & PCI_PASID_CTRL_ENABLE)) {
cmd->out_max_pasid_log2 =
ilog2(idev->dev->iommu->max_pasids);
if (ctrl & PCI_PASID_CTRL_EXEC)
cmd->out_capabilities |=
IOMMU_HW_CAP_PCI_PASID_EXEC;
if (ctrl & PCI_PASID_CTRL_PRIV)
cmd->out_capabilities |=
IOMMU_HW_CAP_PCI_PASID_PRIV;
}
- }
- rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
out_free: kfree(data); diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c index c570892b2090..886f24e3999f 100644 --- a/drivers/pci/ats.c +++ b/drivers/pci/ats.c @@ -505,4 +505,36 @@ int pci_max_pasids(struct pci_dev *pdev) return (1 << FIELD_GET(PCI_PASID_CAP_WIDTH, supported)); } EXPORT_SYMBOL_GPL(pci_max_pasids);
+/**
- pci_pasid_ctrl_status - Check the PASID status
- @pdev: PCI device structure
- Returns a negative value when no PASI capability is present.
s/PASI/PASID/
- Otherwise the value of the control register is returned.
- Status reported are:
- PCI_PASID_CTRL_ENABLE - PASID enabled
- PCI_PASID_CTRL_EXEC - Execute permission enabled
- PCI_PASID_CTRL_PRIV - Privileged mode enabled
- */
+int pci_pasid_ctrl_status(struct pci_dev *pdev) +{
- u16 ctrl = 0;
No need to initialize ctrl.
-Vasant
On Thu, 12 Sept 2024 at 21:18, Yi Liu yi.l.liu@intel.com wrote:
PASID usage requires PASID support in both device and IOMMU. Since the iommu drivers always enable the PASID capability for the device if it is supported, so it is reasonable to extend the IOMMU_GET_HW_INFO to report the PASID capability to userspace.
Signed-off-by: Yi Liu yi.l.liu@intel.com
Thanks Yi
Have verified on aarch64 platform.
Tested-by: Zhangfei Gao zhangfei.gao@linaro.org
linux-kselftest-mirror@lists.linaro.org