This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
The new VIOMMU object is an additional layer, between the nested HWPT and its parent HWPT, to give to both the IOMMUFD core and an IOMMU driver an additional structure to support HW-accelerated feature: ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested0 |--->| viommu0 ------------------ ---------------- | | HW-accel feats | ----------------------------
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to share the stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested0 |--->| viommu0 ------------------ ---------------- | | VMID0 | ---------------------------- ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested1 |--->| viommu1 ------------------ ---------------- | | VMID1 | ----------------------------
As an initial part-1, add ioctls to support a VIOMMU-based invalidation: IOMMUFD_CMD_VIOMMU_ALLOC to allocate a VIOMMU object IOMMUFD_CMD_VIOMMU_SET/UNSET_VDEV_ID to set/clear device's virtual ID (Resue IOMMUFD_CMD_HWPT_INVALIDATE for a VIOMMU object to flush cache by a given driver data)
Worth noting that the VDEV_ID is for a per-VIOMMU device list for drivers to look up the device's physical instance from its virtual ID in a VM. It is essential for a VIOMMU-based invalidation where the request contains a device's virtual ID for its device cache flush, e.g. ATC invalidation.
As for the implementation of the series, add an IOMMU_VIOMMU_TYPE_DEFAULT type for a core-allocated-core-managed VIOMMU object, allowing drivers to simply hook a default viommu ops for viommu-based invalidation alone. And provide some viommu helpers to drivers for VDEV_ID translation and parent domain lookup. Add VIOMMU invalidation support to ARM SMMUv3 driver for a real world use case. This adds supports of arm-smmuv-v3's CMDQ_OP_ATC_INV and CMDQ_OP_CFGI_CD/ALL commands, supplementing HWPT-based invalidations.
In the future, drivers will also be able to choose a driver-managed type to hold its own structure by adding a new type to enum iommu_viommu_type. More VIOMMU-based structures and ioctls will be introduced in part-2/3 to support a driver-managed VIOMMU, e.g. VQUEUE object for a HW accelerated queue, VIRQ (or VEVENT) object for IRQ injections. Although we repurposed the VIOMMU object from an earlier RFC discussion, for a referece: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v2 Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p1-v2
Changelog v2 * Limited vdev_id to one per idev * Added a rw_sem to protect the vdev_id list * Reworked driver-level APIs with proper lockings * Added a new viommu_api file for IOMMUFD_DRIVER config * Dropped useless iommu_dev point from the viommu structure * Added missing index numnbers to new types in the uAPI header * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one * Reworked mock_viommu_cache_invalidate() using the new iommu helper * Reordered details of set/unset_vdev_id handlers for proper lockings * Added arm_smmu_cache_invalidate_user patch from Jason's nesting series v1 https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks! Nicolin
Jason Gunthorpe (3): iommu: Add iommu_copy_struct_from_full_user_array helper iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED iommu/arm-smmu-v3: Update comments about ATS and bypass
Nicolin Chen (16): iommufd: Reorder struct forward declarations iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl iommu: Pass in a viommu pointer to domain_alloc_user op iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl iommufd/selftest: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID test coverage iommufd/viommu: Add cache_invalidate for IOMMU_VIOMMU_TYPE_DEFAULT iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE iommufd/viommu: Add vdev_id helpers for IOMMU drivers iommufd/selftest: Add mock_viommu_invalidate_user op iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command iommufd/selftest: Add VIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl iommufd/viommu: Add iommufd_viommu_to_parent_domain helper iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
drivers/iommu/amd/iommu.c | 1 + drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 218 ++++++++++++++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 + drivers/iommu/intel/iommu.c | 1 + drivers/iommu/iommufd/Makefile | 5 +- drivers/iommu/iommufd/device.c | 12 + drivers/iommu/iommufd/hw_pagetable.c | 59 +++- drivers/iommu/iommufd/iommufd_private.h | 37 +++ drivers/iommu/iommufd/iommufd_test.h | 30 ++ drivers/iommu/iommufd/main.c | 12 + drivers/iommu/iommufd/selftest.c | 101 ++++++- drivers/iommu/iommufd/viommu.c | 196 +++++++++++++ drivers/iommu/iommufd/viommu_api.c | 53 ++++ include/linux/iommu.h | 56 +++- include/linux/iommufd.h | 51 +++- include/uapi/linux/iommufd.h | 117 +++++++- tools/testing/selftests/iommu/iommufd.c | 259 +++++++++++++++++- tools/testing/selftests/iommu/iommufd_utils.h | 126 +++++++++ 18 files changed, 1299 insertions(+), 38 deletions(-) create mode 100644 drivers/iommu/iommufd/viommu.c create mode 100644 drivers/iommu/iommufd/viommu_api.c
Reorder struct forward declarations to alphabetic order to simplify maintenance, as upcoming patches will add more to the list.
No functional change intended.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommufd.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index c2f2f6b9148e..30f832a60ccb 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -11,12 +11,12 @@ #include <linux/types.h>
struct device; -struct iommufd_device; -struct page; -struct iommufd_ctx; -struct iommufd_access; struct file; struct iommu_group; +struct iommufd_access; +struct iommufd_ctx; +struct iommufd_device; +struct page;
struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, struct device *dev, u32 *id);
On Tue, Aug 27, 2024 at 09:59:38AM -0700, Nicolin Chen wrote:
Reorder struct forward declarations to alphabetic order to simplify maintenance, as upcoming patches will add more to the list.
No functional change intended.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
include/linux/iommufd.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
I picked this one up
Thanks, Jason
Add a new IOMMUFD_OBJ_VIOMMU with an iommufd_viommu structure to represent a vIOMMU instance in the user space, backed by a physical IOMMU for its HW accelerated virtualization feature, such as nested translation support for a multi-viommu-instance VM, NVIDIA CMDQ-Virtualization extension for ARM SMMUv3, and AMD Hardware Accelerated Virtualized IOMMU (vIOMMU).
Also, add a new ioctl for user space to do a viommu allocation. It must be based on a nested parent HWPT, so take its refcount.
As an initial version, support a viommu of IOMMU_VIOMMU_TYPE_DEFAULT type. IOMMUFD core can use this viommu to store a virtual device ID lookup table in a following patch.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/Makefile | 3 +- drivers/iommu/iommufd/iommufd_private.h | 12 +++++ drivers/iommu/iommufd/main.c | 6 +++ drivers/iommu/iommufd/viommu.c | 72 +++++++++++++++++++++++++ include/uapi/linux/iommufd.h | 30 +++++++++++ 5 files changed, 122 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/iommufd/viommu.c
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile index cf4605962bea..df490e836b30 100644 --- a/drivers/iommu/iommufd/Makefile +++ b/drivers/iommu/iommufd/Makefile @@ -7,7 +7,8 @@ iommufd-y := \ ioas.o \ main.o \ pages.o \ - vfio_compat.o + vfio_compat.o \ + viommu.o
iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 5d3768d77099..154f7ba5f45c 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -131,6 +131,7 @@ enum iommufd_object_type { IOMMUFD_OBJ_IOAS, IOMMUFD_OBJ_ACCESS, IOMMUFD_OBJ_FAULT, + IOMMUFD_OBJ_VIOMMU, #ifdef CONFIG_IOMMUFD_TEST IOMMUFD_OBJ_SELFTEST, #endif @@ -526,6 +527,17 @@ static inline int iommufd_hwpt_replace_device(struct iommufd_device *idev, return iommu_group_replace_domain(idev->igroup->group, hwpt->domain); }
+struct iommufd_viommu { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommufd_hwpt_paging *hwpt; + + unsigned int type; +}; + +int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); +void iommufd_viommu_destroy(struct iommufd_object *obj); + #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); void iommufd_selftest_destroy(struct iommufd_object *obj); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index b5f5d27ee963..288ee51b6829 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -333,6 +333,7 @@ union ucmd_buffer { struct iommu_ioas_unmap unmap; struct iommu_option option; struct iommu_vfio_ioas vfio_ioas; + struct iommu_viommu_alloc viommu; #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif @@ -384,6 +385,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { val64), IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas, __reserved), + IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl, + struct iommu_viommu_alloc, out_viommu_id), #ifdef CONFIG_IOMMUFD_TEST IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last), #endif @@ -519,6 +522,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = { [IOMMUFD_OBJ_FAULT] = { .destroy = iommufd_fault_destroy, }, + [IOMMUFD_OBJ_VIOMMU] = { + .destroy = iommufd_viommu_destroy, + }, #ifdef CONFIG_IOMMUFD_TEST [IOMMUFD_OBJ_SELFTEST] = { .destroy = iommufd_selftest_destroy, diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c new file mode 100644 index 000000000000..200653a4bf57 --- /dev/null +++ b/drivers/iommu/iommufd/viommu.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES + */ + +#include "iommufd_private.h" + +void iommufd_viommu_destroy(struct iommufd_object *obj) +{ + struct iommufd_viommu *viommu = + container_of(obj, struct iommufd_viommu, obj); + + refcount_dec(&viommu->hwpt->common.obj.users); +} + +int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{ + struct iommu_viommu_alloc *cmd = ucmd->cmd; + struct iommufd_hwpt_paging *hwpt_paging; + struct iommufd_viommu *viommu; + struct iommufd_device *idev; + int rc; + + if (cmd->flags) + return -EOPNOTSUPP; + + idev = iommufd_get_device(ucmd, cmd->dev_id); + if (IS_ERR(idev)) + return PTR_ERR(idev); + + hwpt_paging = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id); + if (IS_ERR(hwpt_paging)) { + rc = PTR_ERR(hwpt_paging); + goto out_put_idev; + } + + if (!hwpt_paging->nest_parent) { + rc = -EINVAL; + goto out_put_hwpt; + } + + if (cmd->type != IOMMU_VIOMMU_TYPE_DEFAULT) { + rc = -EOPNOTSUPP; + goto out_put_hwpt; + } + + viommu = iommufd_object_alloc(ucmd->ictx, viommu, IOMMUFD_OBJ_VIOMMU); + if (IS_ERR(viommu)) { + rc = PTR_ERR(viommu); + goto out_put_hwpt; + } + + viommu->type = cmd->type; + viommu->ictx = ucmd->ictx; + viommu->hwpt = hwpt_paging; + + refcount_inc(&viommu->hwpt->common.obj.users); + + cmd->out_viommu_id = viommu->obj.id; + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); + if (rc) + goto out_abort; + iommufd_object_finalize(ucmd->ictx, &viommu->obj); + goto out_put_hwpt; + +out_abort: + iommufd_object_abort_and_destroy(ucmd->ictx, &viommu->obj); +out_put_hwpt: + iommufd_put_object(ucmd->ictx, &hwpt_paging->common.obj); +out_put_idev: + iommufd_put_object(ucmd->ictx, &idev->obj); + return rc; +} diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index cd4920886ad0..ac77903b5cc4 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -51,6 +51,7 @@ enum { IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP = 0x8c, IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d, IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e, + IOMMUFD_CMD_VIOMMU_ALLOC = 0x8f, };
/** @@ -852,4 +853,33 @@ struct iommu_fault_alloc { __u32 out_fault_fd; }; #define IOMMU_FAULT_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_FAULT_QUEUE_ALLOC) + +/** + * enum iommu_viommu_type - Virtual IOMMU Type + * @IOMMU_VIOMMU_TYPE_DEFAULT: Core-managed VIOMMU type + */ +enum iommu_viommu_type { + IOMMU_VIOMMU_TYPE_DEFAULT = 0, +}; + +/** + * struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC) + * @size: sizeof(struct iommu_viommu_alloc) + * @flags: Must be 0 + * @type: Type of the virtual IOMMU. Must be defined in enum iommu_viommu_type + * @dev_id: The device to allocate this virtual IOMMU for + * @hwpt_id: ID of a nesting parent HWPT to associate to + * @out_viommu_id: Output virtual IOMMU ID for the allocated object + * + * Allocate a virtual IOMMU object that holds a (shared) nesting parent HWPT + */ +struct iommu_viommu_alloc { + __u32 size; + __u32 flags; + __u32 type; + __u32 dev_id; + __u32 hwpt_id; + __u32 out_viommu_id; +}; +#define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC) #endif
On 2024/8/28 0:59, Nicolin Chen wrote:
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
- struct iommu_viommu_alloc *cmd = ucmd->cmd;
- struct iommufd_hwpt_paging *hwpt_paging;
- struct iommufd_viommu *viommu;
- struct iommufd_device *idev;
- int rc;
- if (cmd->flags)
return -EOPNOTSUPP;
- idev = iommufd_get_device(ucmd, cmd->dev_id);
Why does a device reference count is needed here? When is this reference count released after the VIOMMU is allocated?
- if (IS_ERR(idev))
return PTR_ERR(idev);
- hwpt_paging = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
- if (IS_ERR(hwpt_paging)) {
rc = PTR_ERR(hwpt_paging);
goto out_put_idev;
- }
- if (!hwpt_paging->nest_parent) {
rc = -EINVAL;
goto out_put_hwpt;
- }
- if (cmd->type != IOMMU_VIOMMU_TYPE_DEFAULT) {
rc = -EOPNOTSUPP;
goto out_put_hwpt;
- }
- viommu = iommufd_object_alloc(ucmd->ictx, viommu, IOMMUFD_OBJ_VIOMMU);
- if (IS_ERR(viommu)) {
rc = PTR_ERR(viommu);
goto out_put_hwpt;
- }
- viommu->type = cmd->type;
- viommu->ictx = ucmd->ictx;
- viommu->hwpt = hwpt_paging;
- refcount_inc(&viommu->hwpt->common.obj.users);
- cmd->out_viommu_id = viommu->obj.id;
- rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
- if (rc)
goto out_abort;
- iommufd_object_finalize(ucmd->ictx, &viommu->obj);
- goto out_put_hwpt;
+out_abort:
- iommufd_object_abort_and_destroy(ucmd->ictx, &viommu->obj);
+out_put_hwpt:
- iommufd_put_object(ucmd->ictx, &hwpt_paging->common.obj);
+out_put_idev:
- iommufd_put_object(ucmd->ictx, &idev->obj);
- return rc;
+}
Thanks, baolu
On Sun, Sep 01, 2024 at 10:39:17AM +0800, Baolu Lu wrote:
On 2024/8/28 0:59, Nicolin Chen wrote:
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
struct iommu_viommu_alloc *cmd = ucmd->cmd;
struct iommufd_hwpt_paging *hwpt_paging;
struct iommufd_viommu *viommu;
struct iommufd_device *idev;
int rc;
if (cmd->flags)
return -EOPNOTSUPP;
idev = iommufd_get_device(ucmd, cmd->dev_id);
Why does a device reference count is needed here? When is this reference count released after the VIOMMU is allocated?
Hmm, it was used to get dev->iommu->iommu_dev to pin the VIOMMU to a physical IOMMU instance (in v1). Jason suggested to remove that, yet I didn't realize that this idev is now completely useless.
With that being said, a parent HWPT could be shared across VIOMUs allocated for the same VM. So, I think we do need a dev pointer to know which physical instance the VIOMMU allocates for, especially for a driver-managed VIOMMU.
Perhaps we should add back the iommu_dev and properly refcount it.
Thanks Nicolin
On Sun, Sep 01, 2024 at 10:27:09PM -0700, Nicolin Chen wrote:
On Sun, Sep 01, 2024 at 10:39:17AM +0800, Baolu Lu wrote:
On 2024/8/28 0:59, Nicolin Chen wrote:
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
struct iommu_viommu_alloc *cmd = ucmd->cmd;
struct iommufd_hwpt_paging *hwpt_paging;
struct iommufd_viommu *viommu;
struct iommufd_device *idev;
int rc;
if (cmd->flags)
return -EOPNOTSUPP;
idev = iommufd_get_device(ucmd, cmd->dev_id);
Why does a device reference count is needed here? When is this reference count released after the VIOMMU is allocated?
Hmm, it was used to get dev->iommu->iommu_dev to pin the VIOMMU to a physical IOMMU instance (in v1). Jason suggested to remove that, yet I didn't realize that this idev is now completely useless.
With that being said, a parent HWPT could be shared across VIOMUs allocated for the same VM. So, I think we do need a dev pointer to know which physical instance the VIOMMU allocates for, especially for a driver-managed VIOMMU.
Eventually you need a way to pin the physical iommu, without pinning any idevs. Not sure how best to do that
Jason
On Wed, Sep 04, 2024 at 01:26:21PM -0300, Jason Gunthorpe wrote:
On Sun, Sep 01, 2024 at 10:27:09PM -0700, Nicolin Chen wrote:
On Sun, Sep 01, 2024 at 10:39:17AM +0800, Baolu Lu wrote:
On 2024/8/28 0:59, Nicolin Chen wrote:
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
struct iommu_viommu_alloc *cmd = ucmd->cmd;
struct iommufd_hwpt_paging *hwpt_paging;
struct iommufd_viommu *viommu;
struct iommufd_device *idev;
int rc;
if (cmd->flags)
return -EOPNOTSUPP;
idev = iommufd_get_device(ucmd, cmd->dev_id);
Why does a device reference count is needed here? When is this reference count released after the VIOMMU is allocated?
Hmm, it was used to get dev->iommu->iommu_dev to pin the VIOMMU to a physical IOMMU instance (in v1). Jason suggested to remove that, yet I didn't realize that this idev is now completely useless.
With that being said, a parent HWPT could be shared across VIOMUs allocated for the same VM. So, I think we do need a dev pointer to know which physical instance the VIOMMU allocates for, especially for a driver-managed VIOMMU.
Eventually you need a way to pin the physical iommu, without pinning any idevs. Not sure how best to do that
Just trying to clarify "without pinning any idevs", does it mean we shouldn't pass in an idev_id to get dev->iommu->iommu_dev?
Otherwise, iommu_probe_device_lock and iommu_device_lock in the iommu.c are good enough to lock dev->iommu and iommu->list. And I think we just need an iommu helper refcounting the dev_iommu (or iommu_device) as we previously discussed.
Thanks Nicolin
On Wed, Sep 04, 2024 at 10:29:26AM -0700, Nicolin Chen wrote:
On Wed, Sep 04, 2024 at 01:26:21PM -0300, Jason Gunthorpe wrote:
On Sun, Sep 01, 2024 at 10:27:09PM -0700, Nicolin Chen wrote:
On Sun, Sep 01, 2024 at 10:39:17AM +0800, Baolu Lu wrote:
On 2024/8/28 0:59, Nicolin Chen wrote:
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
struct iommu_viommu_alloc *cmd = ucmd->cmd;
struct iommufd_hwpt_paging *hwpt_paging;
struct iommufd_viommu *viommu;
struct iommufd_device *idev;
int rc;
if (cmd->flags)
return -EOPNOTSUPP;
idev = iommufd_get_device(ucmd, cmd->dev_id);
Why does a device reference count is needed here? When is this reference count released after the VIOMMU is allocated?
Hmm, it was used to get dev->iommu->iommu_dev to pin the VIOMMU to a physical IOMMU instance (in v1). Jason suggested to remove that, yet I didn't realize that this idev is now completely useless.
With that being said, a parent HWPT could be shared across VIOMUs allocated for the same VM. So, I think we do need a dev pointer to know which physical instance the VIOMMU allocates for, especially for a driver-managed VIOMMU.
Eventually you need a way to pin the physical iommu, without pinning any idevs. Not sure how best to do that
Just trying to clarify "without pinning any idevs", does it mean we shouldn't pass in an idev_id to get dev->iommu->iommu_dev?
From userspace we have no choice but to use an idev_id to locate the physical iommu
But since we want to support hotplug it is rather problematic if that idev is permanently locked down.
Otherwise, iommu_probe_device_lock and iommu_device_lock in the iommu.c are good enough to lock dev->iommu and iommu->list. And I think we just need an iommu helper refcounting the dev_iommu (or iommu_device) as we previously discussed.
If you have a ref on an idev then the iommu_dev has to be stable, so you can just incr some refcount and then drop the idev stuff.
Jason
On Wed, Sep 04, 2024 at 08:37:07PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 04, 2024 at 10:29:26AM -0700, Nicolin Chen wrote:
On Wed, Sep 04, 2024 at 01:26:21PM -0300, Jason Gunthorpe wrote:
On Sun, Sep 01, 2024 at 10:27:09PM -0700, Nicolin Chen wrote:
On Sun, Sep 01, 2024 at 10:39:17AM +0800, Baolu Lu wrote:
On 2024/8/28 0:59, Nicolin Chen wrote:
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
struct iommu_viommu_alloc *cmd = ucmd->cmd;
struct iommufd_hwpt_paging *hwpt_paging;
struct iommufd_viommu *viommu;
struct iommufd_device *idev;
int rc;
if (cmd->flags)
return -EOPNOTSUPP;
idev = iommufd_get_device(ucmd, cmd->dev_id);
Why does a device reference count is needed here? When is this reference count released after the VIOMMU is allocated?
Hmm, it was used to get dev->iommu->iommu_dev to pin the VIOMMU to a physical IOMMU instance (in v1). Jason suggested to remove that, yet I didn't realize that this idev is now completely useless.
With that being said, a parent HWPT could be shared across VIOMUs allocated for the same VM. So, I think we do need a dev pointer to know which physical instance the VIOMMU allocates for, especially for a driver-managed VIOMMU.
Eventually you need a way to pin the physical iommu, without pinning any idevs. Not sure how best to do that
Just trying to clarify "without pinning any idevs", does it mean we shouldn't pass in an idev_id to get dev->iommu->iommu_dev?
From userspace we have no choice but to use an idev_id to locate the physical iommu
But since we want to support hotplug it is rather problematic if that idev is permanently locked down.
Agreed. Thanks for clarification.
Otherwise, iommu_probe_device_lock and iommu_device_lock in the iommu.c are good enough to lock dev->iommu and iommu->list. And I think we just need an iommu helper refcounting the dev_iommu (or iommu_device) as we previously discussed.
If you have a ref on an idev then the iommu_dev has to be stable, so you can just incr some refcount and then drop the idev stuff.
Yes. The small routine would be like: (1) Lock/get idev, dev->iommu, and dev->iommu->list (2) Increase the refcount at dev_iommu (3) Save the dev_iommu to viommu (4) Unlock/put those in (1).
Thank you Nicolin
On Wed, Sep 04, 2024 at 08:37:07PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 04, 2024 at 10:29:26AM -0700, Nicolin Chen wrote:
On Wed, Sep 04, 2024 at 01:26:21PM -0300, Jason Gunthorpe wrote:
On Sun, Sep 01, 2024 at 10:27:09PM -0700, Nicolin Chen wrote:
On Sun, Sep 01, 2024 at 10:39:17AM +0800, Baolu Lu wrote:
On 2024/8/28 0:59, Nicolin Chen wrote:
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) +{
struct iommu_viommu_alloc *cmd = ucmd->cmd;
struct iommufd_hwpt_paging *hwpt_paging;
struct iommufd_viommu *viommu;
struct iommufd_device *idev;
int rc;
if (cmd->flags)
return -EOPNOTSUPP;
idev = iommufd_get_device(ucmd, cmd->dev_id);
Why does a device reference count is needed here? When is this reference count released after the VIOMMU is allocated?
Hmm, it was used to get dev->iommu->iommu_dev to pin the VIOMMU to a physical IOMMU instance (in v1). Jason suggested to remove that, yet I didn't realize that this idev is now completely useless.
With that being said, a parent HWPT could be shared across VIOMUs allocated for the same VM. So, I think we do need a dev pointer to know which physical instance the VIOMMU allocates for, especially for a driver-managed VIOMMU.
Eventually you need a way to pin the physical iommu, without pinning any idevs. Not sure how best to do that
Just trying to clarify "without pinning any idevs", does it mean we shouldn't pass in an idev_id to get dev->iommu->iommu_dev?
From userspace we have no choice but to use an idev_id to locate the physical iommu
But since we want to support hotplug it is rather problematic if that idev is permanently locked down.
Otherwise, iommu_probe_device_lock and iommu_device_lock in the iommu.c are good enough to lock dev->iommu and iommu->list. And I think we just need an iommu helper refcounting the dev_iommu (or iommu_device) as we previously discussed.
If you have a ref on an idev then the iommu_dev has to be stable, so you can just incr some refcount and then drop the idev stuff.
Looks like a refcount could only WARN on an unbalanced iommu_dev in iommu_device_unregister() and iommu_device_unregister_bus(), either of which returns void so no way of doing a retry. And their callers would also likely free the entire memory of the driver-level struct where iommu_dev usually locates.. I feel it gets less meaningful to add the refcount if the lifecycle cannot be guaranteed.
You mentioned that actually only the iommufd selftest might hit such a corner case, so perhaps we should do something in the selftest code v.s. the iommu core. What do you think?
Thanks Nicolin
On Wed, Sep 11, 2024 at 08:39:57PM -0700, Nicolin Chen wrote:
You mentioned that actually only the iommufd selftest might hit such a corner case, so perhaps we should do something in the selftest code v.s. the iommu core. What do you think?
Maybe, if there were viommu allocation callbacks maybe those can pin the memory in the selftest..
Jason
On Tue, Aug 27, 2024 at 09:59:39AM -0700, Nicolin Chen wrote:
+/**
- struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC)
- @size: sizeof(struct iommu_viommu_alloc)
- @flags: Must be 0
- @type: Type of the virtual IOMMU. Must be defined in enum iommu_viommu_type
- @dev_id: The device to allocate this virtual IOMMU for
@dev_id: The device's physical IOMMU will be used to back t he vIOMMU
- @hwpt_id: ID of a nesting parent HWPT to associate to
A nesting parent HWPT that will provide translation for an vIOMMU DMA
- @out_viommu_id: Output virtual IOMMU ID for the allocated object
- Allocate a virtual IOMMU object that holds a (shared) nesting parent HWPT
Allocate a virtual IOMMU object that represents the underlying physical IOMMU's virtualization support. The vIOMMU object is a security isolated slice of the physical IOMMU HW that is unique to a specific VM. Operations global to the IOMMU are connected to the vIOMMU, such as: - Security namespace for guest owned ID, eg guest controlled cache tags - Virtualization of various platforms IDs like RIDs and others - direct assigned invalidation queues - direct assigned interrupts - non-affiliated event reporting - Delivery of paravirtualized invalidation
Jason
On Thu, Sep 05, 2024 at 12:53:02PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:39AM -0700, Nicolin Chen wrote:
+/**
- struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC)
- @size: sizeof(struct iommu_viommu_alloc)
- @flags: Must be 0
- @type: Type of the virtual IOMMU. Must be defined in enum iommu_viommu_type
- @dev_id: The device to allocate this virtual IOMMU for
@dev_id: The device's physical IOMMU will be used to back t he vIOMMU
- @hwpt_id: ID of a nesting parent HWPT to associate to
A nesting parent HWPT that will provide translation for an vIOMMU DMA
- @out_viommu_id: Output virtual IOMMU ID for the allocated object
- Allocate a virtual IOMMU object that holds a (shared) nesting parent HWPT
Allocate a virtual IOMMU object that represents the underlying physical IOMMU's virtualization support. The vIOMMU object is a security isolated slice of the physical IOMMU HW that is unique to a specific VM. Operations global to the IOMMU are connected to the vIOMMU, such as:
- Security namespace for guest owned ID, eg guest controlled cache tags
- Virtualization of various platforms IDs like RIDs and others
- direct assigned invalidation queues
- direct assigned interrupts
- non-affiliated event reporting
- Delivery of paravirtualized invalidation
Ack.
Looks like you prefer using "vIOMMU" v.s. "VIOMMU"? I would go through all the patches (QEMU including) to keep that aligned.
Thanks Nicolin
On Thu, Sep 05, 2024 at 10:10:38AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 12:53:02PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:39AM -0700, Nicolin Chen wrote:
+/**
- struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC)
- @size: sizeof(struct iommu_viommu_alloc)
- @flags: Must be 0
- @type: Type of the virtual IOMMU. Must be defined in enum iommu_viommu_type
- @dev_id: The device to allocate this virtual IOMMU for
@dev_id: The device's physical IOMMU will be used to back t he vIOMMU
- @hwpt_id: ID of a nesting parent HWPT to associate to
A nesting parent HWPT that will provide translation for an vIOMMU DMA
- @out_viommu_id: Output virtual IOMMU ID for the allocated object
- Allocate a virtual IOMMU object that holds a (shared) nesting parent HWPT
Allocate a virtual IOMMU object that represents the underlying physical IOMMU's virtualization support. The vIOMMU object is a security isolated slice of the physical IOMMU HW that is unique to a specific VM. Operations global to the IOMMU are connected to the vIOMMU, such as:
- Security namespace for guest owned ID, eg guest controlled cache tags
- Virtualization of various platforms IDs like RIDs and others
- direct assigned invalidation queues
- direct assigned interrupts
- non-affiliated event reporting
- Delivery of paravirtualized invalidation
Ack.
Also write something about the HWPT..
Looks like you prefer using "vIOMMU" v.s. "VIOMMU"? I would go through all the patches (QEMU including) to keep that aligned.
Yeah, VIOMMU just for all-caps constants
Jason
On Thu, Sep 05, 2024 at 02:41:00PM -0300, Jason Gunthorpe wrote:
- @out_viommu_id: Output virtual IOMMU ID for the allocated object
- Allocate a virtual IOMMU object that holds a (shared) nesting parent HWPT
Allocate a virtual IOMMU object that represents the underlying physical IOMMU's virtualization support. The vIOMMU object is a security isolated slice of the physical IOMMU HW that is unique to a specific VM. Operations global to the IOMMU are connected to the vIOMMU, such as:
- Security namespace for guest owned ID, eg guest controlled cache tags
- Virtualization of various platforms IDs like RIDs and others
- direct assigned invalidation queues
- direct assigned interrupts
- non-affiliated event reporting
- Delivery of paravirtualized invalidation
Ack.
Also write something about the HWPT..
Assuming it's about sharing parent HWPT, ack.
Nicolin
With a viommu object wrapping a potentially shareable S2 domain, a nested domain should be allocated by associating to a viommu instead. Driver can store this viommu pointer somewhere, so as to later use it calling viommu helpers for virtual device ID lookup and viommu invalidation.
For drivers without a viommu support, keep the parent domain input, which should be just viommu->hwpt->common.domain otherwise.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/amd/iommu.c | 1 + drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 + drivers/iommu/intel/iommu.c | 1 + drivers/iommu/iommufd/hw_pagetable.c | 5 +++-- drivers/iommu/iommufd/selftest.c | 1 + include/linux/iommu.h | 2 ++ 6 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index b19e8c0f48fa..e31f7a5fc650 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -2432,6 +2432,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned int type) static struct iommu_domain * amd_iommu_domain_alloc_user(struct device *dev, u32 flags, struct iommu_domain *parent, + struct iommufd_viommu *viommu, const struct iommu_user_data *user_data)
{ diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index fa75372e1aa9..6d40f1e150cb 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3347,6 +3347,7 @@ arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags, static struct iommu_domain * arm_smmu_domain_alloc_user(struct device *dev, u32 flags, struct iommu_domain *parent, + struct iommufd_viommu *viommu, const struct iommu_user_data *user_data) { struct arm_smmu_master *master = dev_iommu_priv_get(dev); diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 9ff8b83c19a3..0590528799d8 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -3703,6 +3703,7 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) static struct iommu_domain * intel_iommu_domain_alloc_user(struct device *dev, u32 flags, struct iommu_domain *parent, + struct iommufd_viommu *viommu, const struct iommu_user_data *user_data) { struct device_domain_info *info = dev_iommu_priv_get(dev); diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index aefde4443671..c21bb59c4022 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -137,7 +137,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
if (ops->domain_alloc_user) { hwpt->domain = ops->domain_alloc_user(idev->dev, flags, NULL, - user_data); + NULL, user_data); if (IS_ERR(hwpt->domain)) { rc = PTR_ERR(hwpt->domain); hwpt->domain = NULL; @@ -239,7 +239,8 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
hwpt->domain = ops->domain_alloc_user(idev->dev, flags & ~IOMMU_HWPT_FAULT_ID_VALID, - parent->common.domain, user_data); + parent->common.domain, + NULL, user_data); if (IS_ERR(hwpt->domain)) { rc = PTR_ERR(hwpt->domain); hwpt->domain = NULL; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index b60687f57bef..4a23530ea027 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -318,6 +318,7 @@ __mock_domain_alloc_nested(struct mock_iommu_domain *mock_parent, static struct iommu_domain * mock_domain_alloc_user(struct device *dev, u32 flags, struct iommu_domain *parent, + struct iommufd_viommu *viommu, const struct iommu_user_data *user_data) { struct mock_iommu_domain *mock_parent; diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c16ffc31ac70..f62aad8a9e75 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -42,6 +42,7 @@ struct notifier_block; struct iommu_sva; struct iommu_dma_cookie; struct iommu_fault_param; +struct iommufd_viommu;
#define IOMMU_FAULT_PERM_READ (1 << 0) /* read */ #define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */ @@ -564,6 +565,7 @@ struct iommu_ops { struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type); struct iommu_domain *(*domain_alloc_user)( struct device *dev, u32 flags, struct iommu_domain *parent, + struct iommufd_viommu *viommu, const struct iommu_user_data *user_data); struct iommu_domain *(*domain_alloc_paging)(struct device *dev); struct iommu_domain *(*domain_alloc_sva)(struct device *dev,
On Tue, Aug 27, 2024 at 09:59:40AM -0700, Nicolin Chen wrote:
With a viommu object wrapping a potentially shareable S2 domain, a nested domain should be allocated by associating to a viommu instead. Driver can store this viommu pointer somewhere, so as to later use it calling viommu helpers for virtual device ID lookup and viommu invalidation.
For drivers without a viommu support, keep the parent domain input, which should be just viommu->hwpt->common.domain otherwise.
I've been thinking of add an op for nested allocation since every driver immediately jumps to a special function for nested allocation anyhow without sharing any code.
Adding a new parameter that is nested only seems like a good point to try to do that..
Jason
On Thu, Sep 05, 2024 at 12:54:54PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:40AM -0700, Nicolin Chen wrote:
With a viommu object wrapping a potentially shareable S2 domain, a nested domain should be allocated by associating to a viommu instead. Driver can store this viommu pointer somewhere, so as to later use it calling viommu helpers for virtual device ID lookup and viommu invalidation.
For drivers without a viommu support, keep the parent domain input, which should be just viommu->hwpt->common.domain otherwise.
I've been thinking of add an op for nested allocation since every driver immediately jumps to a special function for nested allocation anyhow without sharing any code.
Adding a new parameter that is nested only seems like a good point to try to do that..
Yea, it makes sense to have a domain_alloc_nested, for hwpt_nested exclusively. Then domain_alloc_user would be for hwpt_paging only.
Thanks Nicolin
Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can act like a nested parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.
Also, associate a viommu to an allocating nested HWPT.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/hw_pagetable.c | 24 ++++++++++++++++++++++-- drivers/iommu/iommufd/iommufd_private.h | 1 + include/uapi/linux/iommufd.h | 12 ++++++------ 3 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index c21bb59c4022..06adbcc304bc 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -57,6 +57,9 @@ void iommufd_hwpt_nested_destroy(struct iommufd_object *obj) container_of(obj, struct iommufd_hwpt_nested, common.obj);
__iommufd_hwpt_destroy(&hwpt_nested->common); + + if (hwpt_nested->viommu) + refcount_dec(&hwpt_nested->viommu->obj.users); refcount_dec(&hwpt_nested->parent->common.obj.users); }
@@ -213,6 +216,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, */ static struct iommufd_hwpt_nested * iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx, + struct iommufd_viommu *viommu, struct iommufd_hwpt_paging *parent, struct iommufd_device *idev, u32 flags, const struct iommu_user_data *user_data) @@ -234,13 +238,16 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx, return ERR_CAST(hwpt_nested); hwpt = &hwpt_nested->common;
+ if (viommu) + refcount_inc(&viommu->obj.users); + hwpt_nested->viommu = viommu; refcount_inc(&parent->common.obj.users); hwpt_nested->parent = parent;
hwpt->domain = ops->domain_alloc_user(idev->dev, flags & ~IOMMU_HWPT_FAULT_ID_VALID, parent->common.domain, - NULL, user_data); + viommu, user_data); if (IS_ERR(hwpt->domain)) { rc = PTR_ERR(hwpt->domain); hwpt->domain = NULL; @@ -307,7 +314,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) struct iommufd_hwpt_nested *hwpt_nested;
hwpt_nested = iommufd_hwpt_nested_alloc( - ucmd->ictx, + ucmd->ictx, NULL, container_of(pt_obj, struct iommufd_hwpt_paging, common.obj), idev, cmd->flags, &user_data); @@ -316,6 +323,19 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) goto out_unlock; } hwpt = &hwpt_nested->common; + } else if (pt_obj->type == IOMMUFD_OBJ_VIOMMU) { + struct iommufd_hwpt_nested *hwpt_nested; + struct iommufd_viommu *viommu; + + viommu = container_of(pt_obj, struct iommufd_viommu, obj); + hwpt_nested = iommufd_hwpt_nested_alloc( + ucmd->ictx, viommu, viommu->hwpt, idev, + cmd->flags, &user_data); + if (IS_ERR(hwpt_nested)) { + rc = PTR_ERR(hwpt_nested); + goto out_unlock; + } + hwpt = &hwpt_nested->common; } else { rc = -EINVAL; goto out_put_pt; diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 154f7ba5f45c..1f2a1c133b9a 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -313,6 +313,7 @@ struct iommufd_hwpt_paging { struct iommufd_hwpt_nested { struct iommufd_hw_pagetable common; struct iommufd_hwpt_paging *parent; + struct iommufd_viommu *viommu; };
static inline bool hwpt_is_paging(struct iommufd_hw_pagetable *hwpt) diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index ac77903b5cc4..51ce6a019c34 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -430,7 +430,7 @@ enum iommu_hwpt_data_type { * @size: sizeof(struct iommu_hwpt_alloc) * @flags: Combination of enum iommufd_hwpt_alloc_flags * @dev_id: The device to allocate this HWPT for - * @pt_id: The IOAS or HWPT to connect this HWPT to + * @pt_id: The IOAS or HWPT or VIOMMU to connect this HWPT to * @out_hwpt_id: The ID of the new HWPT * @__reserved: Must be 0 * @data_type: One of enum iommu_hwpt_data_type @@ -449,11 +449,11 @@ enum iommu_hwpt_data_type { * IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a * nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags. * - * A user-managed nested HWPT will be created from a given parent HWPT via - * @pt_id, in which the parent HWPT must be allocated previously via the - * same ioctl from a given IOAS (@pt_id). In this case, the @data_type - * must be set to a pre-defined type corresponding to an I/O page table - * type supported by the underlying IOMMU hardware. + * A user-managed nested HWPT will be created from a given VIOMMU (wrapping a + * parent HWPT) or a parent HWPT via @pt_id, in which the parent HWPT must be + * allocated previously via the same ioctl from a given IOAS (@pt_id). In this + * case, the @data_type must be set to a pre-defined type corresponding to an + * I/O page table type supported by the underlying IOMMU hardware. * * If the @data_type is set to IOMMU_HWPT_DATA_NONE, @data_len and * @data_uptr should be zero. Otherwise, both @data_len and @data_uptr
On Tue, Aug 27, 2024 at 09:59:41AM -0700, Nicolin Chen wrote:
Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can act like a nested parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.
Also, associate a viommu to an allocating nested HWPT.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/hw_pagetable.c | 24 ++++++++++++++++++++++-- drivers/iommu/iommufd/iommufd_private.h | 1 + include/uapi/linux/iommufd.h | 12 ++++++------ 3 files changed, 29 insertions(+), 8 deletions(-)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
On 2024/8/28 00:59, Nicolin Chen wrote:
Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can act like a nested parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.
Also, associate a viommu to an allocating nested HWPT.
it still not quite clear to me what vIOMMU obj stands for. Here, it is a wrapper of s2 hpwt IIUC. But in the cover letter, vIOMMU obj can instanced per the vIOMMU units in VM. Does it mean each vIOMMU of VM can only have one s2 HWPT?
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/hw_pagetable.c | 24 ++++++++++++++++++++++-- drivers/iommu/iommufd/iommufd_private.h | 1 + include/uapi/linux/iommufd.h | 12 ++++++------ 3 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index c21bb59c4022..06adbcc304bc 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -57,6 +57,9 @@ void iommufd_hwpt_nested_destroy(struct iommufd_object *obj) container_of(obj, struct iommufd_hwpt_nested, common.obj); __iommufd_hwpt_destroy(&hwpt_nested->common);
- if (hwpt_nested->viommu)
refcount_dec(&hwpt_nested->parent->common.obj.users); }refcount_dec(&hwpt_nested->viommu->obj.users);
@@ -213,6 +216,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, */ static struct iommufd_hwpt_nested * iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
struct iommufd_viommu *viommu, struct iommufd_hwpt_paging *parent, struct iommufd_device *idev, u32 flags, const struct iommu_user_data *user_data)
@@ -234,13 +238,16 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx, return ERR_CAST(hwpt_nested); hwpt = &hwpt_nested->common;
- if (viommu)
refcount_inc(&viommu->obj.users);
- hwpt_nested->viommu = viommu; refcount_inc(&parent->common.obj.users); hwpt_nested->parent = parent;
hwpt->domain = ops->domain_alloc_user(idev->dev, flags & ~IOMMU_HWPT_FAULT_ID_VALID, parent->common.domain,
NULL, user_data);
if (IS_ERR(hwpt->domain)) { rc = PTR_ERR(hwpt->domain); hwpt->domain = NULL;viommu, user_data);
@@ -307,7 +314,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) struct iommufd_hwpt_nested *hwpt_nested; hwpt_nested = iommufd_hwpt_nested_alloc(
ucmd->ictx,
ucmd->ictx, NULL, container_of(pt_obj, struct iommufd_hwpt_paging, common.obj), idev, cmd->flags, &user_data);
@@ -316,6 +323,19 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) goto out_unlock; } hwpt = &hwpt_nested->common;
- } else if (pt_obj->type == IOMMUFD_OBJ_VIOMMU) {
struct iommufd_hwpt_nested *hwpt_nested;
struct iommufd_viommu *viommu;
viommu = container_of(pt_obj, struct iommufd_viommu, obj);
hwpt_nested = iommufd_hwpt_nested_alloc(
ucmd->ictx, viommu, viommu->hwpt, idev,
cmd->flags, &user_data);
if (IS_ERR(hwpt_nested)) {
rc = PTR_ERR(hwpt_nested);
goto out_unlock;
}
} else { rc = -EINVAL; goto out_put_pt;hwpt = &hwpt_nested->common;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 154f7ba5f45c..1f2a1c133b9a 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -313,6 +313,7 @@ struct iommufd_hwpt_paging { struct iommufd_hwpt_nested { struct iommufd_hw_pagetable common; struct iommufd_hwpt_paging *parent;
- struct iommufd_viommu *viommu; };
static inline bool hwpt_is_paging(struct iommufd_hw_pagetable *hwpt) diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index ac77903b5cc4..51ce6a019c34 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -430,7 +430,7 @@ enum iommu_hwpt_data_type {
- @size: sizeof(struct iommu_hwpt_alloc)
- @flags: Combination of enum iommufd_hwpt_alloc_flags
- @dev_id: The device to allocate this HWPT for
- @pt_id: The IOAS or HWPT to connect this HWPT to
- @pt_id: The IOAS or HWPT or VIOMMU to connect this HWPT to
- @out_hwpt_id: The ID of the new HWPT
- @__reserved: Must be 0
- @data_type: One of enum iommu_hwpt_data_type
@@ -449,11 +449,11 @@ enum iommu_hwpt_data_type {
- IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a
- nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags.
- A user-managed nested HWPT will be created from a given parent HWPT via
- @pt_id, in which the parent HWPT must be allocated previously via the
- same ioctl from a given IOAS (@pt_id). In this case, the @data_type
- must be set to a pre-defined type corresponding to an I/O page table
- type supported by the underlying IOMMU hardware.
- A user-managed nested HWPT will be created from a given VIOMMU (wrapping a
- parent HWPT) or a parent HWPT via @pt_id, in which the parent HWPT must be
- allocated previously via the same ioctl from a given IOAS (@pt_id). In this
- case, the @data_type must be set to a pre-defined type corresponding to an
- I/O page table type supported by the underlying IOMMU hardware.
- If the @data_type is set to IOMMU_HWPT_DATA_NONE, @data_len and
- @data_uptr should be zero. Otherwise, both @data_len and @data_uptr
On Thu, Sep 26, 2024 at 04:50:46PM +0800, Yi Liu wrote:
On 2024/8/28 00:59, Nicolin Chen wrote:
Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can act like a nested parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.
Also, associate a viommu to an allocating nested HWPT.
it still not quite clear to me what vIOMMU obj stands for. Here, it is a wrapper of s2 hpwt IIUC. But in the cover letter, vIOMMU obj can instanced per the vIOMMU units in VM.
Yea, the implementation in this version is merely a wrapper. I had a general introduction of vIOMMU in the other reply. And I will put something similar in the next version of the series, so the idea would be bigger than a wrapper.
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here: - If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT. - If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, September 27, 2024 4:11 AM
On Thu, Sep 26, 2024 at 04:50:46PM +0800, Yi Liu wrote:
On 2024/8/28 00:59, Nicolin Chen wrote:
Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can act
like
a nested parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its
kdoc.
Also, associate a viommu to an allocating nested HWPT.
it still not quite clear to me what vIOMMU obj stands for. Here, it is a wrapper of s2 hpwt IIUC. But in the cover letter, vIOMMU obj can instanced per the vIOMMU units in VM.
Yea, the implementation in this version is merely a wrapper. I had a general introduction of vIOMMU in the other reply. And I will put something similar in the next version of the series, so the idea would be bigger than a wrapper.
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
this is not consistent with previous discussion.
even for 1 vIOMMU per VM there could be multiple vIOMMU objects created in the kernel in case the devices connected to the VM-visible vIOMMU locate behind different physical SMMUs.
we don't expect one vIOMMU object to span multiple physical ones.
On Fri, Sep 27, 2024 at 12:43:16AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, September 27, 2024 4:11 AM
On Thu, Sep 26, 2024 at 04:50:46PM +0800, Yi Liu wrote:
On 2024/8/28 00:59, Nicolin Chen wrote:
Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can act
like
a nested parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its
kdoc.
Also, associate a viommu to an allocating nested HWPT.
it still not quite clear to me what vIOMMU obj stands for. Here, it is a wrapper of s2 hpwt IIUC. But in the cover letter, vIOMMU obj can instanced per the vIOMMU units in VM.
Yea, the implementation in this version is merely a wrapper. I had a general introduction of vIOMMU in the other reply. And I will put something similar in the next version of the series, so the idea would be bigger than a wrapper.
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
this is not consistent with previous discussion.
even for 1 vIOMMU per VM there could be multiple vIOMMU objects created in the kernel in case the devices connected to the VM-visible vIOMMU locate behind different physical SMMUs.
we don't expect one vIOMMU object to span multiple physical ones.
I think it's consistent, yet we had different perspectives for a virtual IOMMU instance in the VM: Jason's suggested design for a VM is to have 1-to-1 mapping between virtual IOMMU instances and physical IOMMU instances. So, one vIOMMU is backed by one pIOMMU only, i.e. one vIOMMU object in the kernel.
Your case seems to be the model where a VM has one giant virtual IOMMU instance backed by multiple physical IOMMUs, in which case all the passthrough devices, regardless their associated pIOMMUs, are connected to this shared virtual IOMMU. And yes, this shared virtual IOMMU can have multiple vIOMMU objects.
Regarding these two models, I had listed their pros/cons at (2): https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@nvidia.com/
(Not 100% sure) VT-d might not have something like vCMDQ, so it can stay in the shared model to simplify certain things, though I feel it may face some similar situation like mapping multiple physical MMIO regions to a single virtual region (undoable!) if some day intel has some similar HW-accelerated feature?
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, September 27, 2024 9:26 AM
On Fri, Sep 27, 2024 at 12:43:16AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, September 27, 2024 4:11 AM
On Thu, Sep 26, 2024 at 04:50:46PM +0800, Yi Liu wrote:
On 2024/8/28 00:59, Nicolin Chen wrote:
Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can
act
like
a nested parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update
its
kdoc.
Also, associate a viommu to an allocating nested HWPT.
it still not quite clear to me what vIOMMU obj stands for. Here, it is a wrapper of s2 hpwt IIUC. But in the cover letter, vIOMMU obj can
instanced
per the vIOMMU units in VM.
Yea, the implementation in this version is merely a wrapper. I had a general introduction of vIOMMU in the other reply. And I will put something similar in the next version of the series, so the idea would be bigger than a wrapper.
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
this is not consistent with previous discussion.
even for 1 vIOMMU per VM there could be multiple vIOMMU objects created in the kernel in case the devices connected to the VM-visible vIOMMU locate behind different physical SMMUs.
we don't expect one vIOMMU object to span multiple physical ones.
I think it's consistent, yet we had different perspectives for a virtual IOMMU instance in the VM: Jason's suggested design for a VM is to have 1-to-1 mapping between virtual IOMMU instances and physical IOMMU instances. So, one vIOMMU is backed by one pIOMMU only, i.e. one vIOMMU object in the kernel.
Your case seems to be the model where a VM has one giant virtual IOMMU instance backed by multiple physical IOMMUs, in which case all the passthrough devices, regardless their associated pIOMMUs, are connected to this shared virtual IOMMU. And yes, this shared virtual IOMMU can have multiple vIOMMU objects.
yes.
sorry that I should not use "inconsistent" in the last reply. It's more about completeness for what the design allows. 😊
Regarding these two models, I had listed their pros/cons at (2): https://lore.kernel.org/qemu- devel/cover.1719361174.git.nicolinc@nvidia.com/
(Not 100% sure) VT-d might not have something like vCMDQ, so it can stay in the shared model to simplify certain things, though I feel it may face some similar situation like mapping multiple physical MMIO regions to a single virtual region (undoable!) if some day intel has some similar HW-accelerated feature?
yes if VT-d has hw acceleration then it'd be similar to SMMU.
On Fri, Sep 27, 2024 at 02:23:16AM +0000, Tian, Kevin wrote:
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
this is not consistent with previous discussion.
even for 1 vIOMMU per VM there could be multiple vIOMMU objects created in the kernel in case the devices connected to the VM-visible vIOMMU locate behind different physical SMMUs.
we don't expect one vIOMMU object to span multiple physical ones.
I think it's consistent, yet we had different perspectives for a virtual IOMMU instance in the VM: Jason's suggested design for a VM is to have 1-to-1 mapping between virtual IOMMU instances and physical IOMMU instances. So, one vIOMMU is backed by one pIOMMU only, i.e. one vIOMMU object in the kernel.
Your case seems to be the model where a VM has one giant virtual IOMMU instance backed by multiple physical IOMMUs, in which case all the passthrough devices, regardless their associated pIOMMUs, are connected to this shared virtual IOMMU. And yes, this shared virtual IOMMU can have multiple vIOMMU objects.
yes.
sorry that I should not use "inconsistent" in the last reply. It's more about completeness for what the design allows. 😊
No worries. I'll add more narratives to the next version, likely with another detailed update to the iommufd documentation. This discussion made me realize that we need to clearly write it down.
Thanks! Nic
On 2024/9/27 04:10, Nicolin Chen wrote:
On Thu, Sep 26, 2024 at 04:50:46PM +0800, Yi Liu wrote:
On 2024/8/28 00:59, Nicolin Chen wrote:
Now a VIOMMU can wrap a shareable nested parent HWPT. So, it can act like a nested parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.
Also, associate a viommu to an allocating nested HWPT.
it still not quite clear to me what vIOMMU obj stands for. Here, it is a wrapper of s2 hpwt IIUC. But in the cover letter, vIOMMU obj can instanced per the vIOMMU units in VM.
Yea, the implementation in this version is merely a wrapper. I had a general introduction of vIOMMU in the other reply. And I will put something similar in the next version of the series, so the idea would be bigger than a wrapper.
yep. would be good to see it. Otherwise, it is really confusion what vIOMMU obj exactly means in concept. :)
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
So if you have two devices assigned to a VM, then you may have two vIOMMUs or one vIOMMU exposed to guest. This depends on whether the two devices are behind the same physical IOMMU. If it's two vIOMMUs, the two can share the s2 hwpt if their physical IOMMU is compatible. is it?
To achieve the above, you need to know if the physical IOMMUs of the assigned devices, hence be able to tell if physical IOMMUs are the same and if they are compatible. How would userspace know such infos?
On Fri, Sep 27, 2024 at 01:38:08PM +0800, Yi Liu wrote:
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
So if you have two devices assigned to a VM, then you may have two vIOMMUs or one vIOMMU exposed to guest. This depends on whether the two devices are behind the same physical IOMMU. If it's two vIOMMUs, the two can share the s2 hwpt if their physical IOMMU is compatible. is it?
Yes.
To achieve the above, you need to know if the physical IOMMUs of the assigned devices, hence be able to tell if physical IOMMUs are the same and if they are compatible. How would userspace know such infos?
My draft implementation with QEMU does something like this: - List all viommu-matched iommu nodes under /sys/class/iommu: LINKs - Get PCI device's /sys/bus/pci/devices/0000:00:00.0/iommu: LINK0 - Compare the LINK0 against the LINKs
We so far don't have an ID for physical IOMMU instance, which can be an alternative to return via the hw_info call, otherwise.
QEMU then does the routing to assign PCI buses and IORT (or DT). This part is suggested now to move to libvirt though. So, I think at the end of the day, libvirt would run the sys check and assign a device to the corresponding pci bus backed by the correct IOMMU.
This gives an example showing two devices behind iommu0 and third device behind iommu1 are assigned to a VM: -device pxb-pcie.id=pcie.viommu0,bus=pcie.0.... \ # bus for viommu0 -device pxb-pcie.id=pcie.viommu1,bus=pcie.0.... \ # bus for viommu1 -device pcie-root-port,id=pcie.viommu0p0,bus=pcie.viommu0... \ -device pcie-root-port,id=pcie.viommu0p1,bus=pcie.viommu0... \ -device pcie-root-port,id=pcie.viommu1p0,bus=pcie.viommu1... \ -device vfio-pci,bus=pcie.viommu0p0... \ # connect to bus for viommu0 -device vfio-pci,bus=pcie.viommu0p1... \ # connect to bus for viommu0 -device vfio-pci,bus=pcie.viommu1p0... # connect to bus for viommu1
For compatibility to share a stage-2 HWPT, basically we would do a device attach to one of the stage-2 HWPT from the list that VMM should keep. This attach has all the compatibility test, down to the IOMMU driver. If it fails, just allocate a new stage-2 HWPT.
Thanks Nic
On Thu, Sep 26, 2024 at 11:02:37PM -0700, Nicolin Chen wrote:
On Fri, Sep 27, 2024 at 01:38:08PM +0800, Yi Liu wrote:
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
So if you have two devices assigned to a VM, then you may have two vIOMMUs or one vIOMMU exposed to guest. This depends on whether the two devices are behind the same physical IOMMU. If it's two vIOMMUs, the two can share the s2 hwpt if their physical IOMMU is compatible. is it?
Yes.
To achieve the above, you need to know if the physical IOMMUs of the assigned devices, hence be able to tell if physical IOMMUs are the same and if they are compatible. How would userspace know such infos?
My draft implementation with QEMU does something like this:
- List all viommu-matched iommu nodes under /sys/class/iommu: LINKs
- Get PCI device's /sys/bus/pci/devices/0000:00:00.0/iommu: LINK0
- Compare the LINK0 against the LINKs
We so far don't have an ID for physical IOMMU instance, which can be an alternative to return via the hw_info call, otherwise.
We could return the sys/class/iommu string from some get_info or something
For compatibility to share a stage-2 HWPT, basically we would do a device attach to one of the stage-2 HWPT from the list that VMM should keep. This attach has all the compatibility test, down to the IOMMU driver. If it fails, just allocate a new stage-2 HWPT.
Ideally just creating the viommu should validate the passed in hwpt is compatible without attaching.
Jason
On Fri, Sep 27, 2024 at 08:59:25AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 26, 2024 at 11:02:37PM -0700, Nicolin Chen wrote:
On Fri, Sep 27, 2024 at 01:38:08PM +0800, Yi Liu wrote:
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
So if you have two devices assigned to a VM, then you may have two vIOMMUs or one vIOMMU exposed to guest. This depends on whether the two devices are behind the same physical IOMMU. If it's two vIOMMUs, the two can share the s2 hwpt if their physical IOMMU is compatible. is it?
Yes.
To achieve the above, you need to know if the physical IOMMUs of the assigned devices, hence be able to tell if physical IOMMUs are the same and if they are compatible. How would userspace know such infos?
My draft implementation with QEMU does something like this:
- List all viommu-matched iommu nodes under /sys/class/iommu: LINKs
- Get PCI device's /sys/bus/pci/devices/0000:00:00.0/iommu: LINK0
- Compare the LINK0 against the LINKs
We so far don't have an ID for physical IOMMU instance, which can be an alternative to return via the hw_info call, otherwise.
We could return the sys/class/iommu string from some get_info or something
I had a patch doing an ida alloc for each iommu_dev and returning the ID via hw_info. It wasn't useful at that time, as we went for fail-n-retry for S2 HWPT allocations on multi-pIOMMU platforms.
Perhaps that could be cleaner than returning a string?
For compatibility to share a stage-2 HWPT, basically we would do a device attach to one of the stage-2 HWPT from the list that VMM should keep. This attach has all the compatibility test, down to the IOMMU driver. If it fails, just allocate a new stage-2 HWPT.
Ideally just creating the viommu should validate the passed in hwpt is compatible without attaching.
I think I should add a validation between hwpt->domain->owner and dev_iommu_ops(idev->dev) then!
Thanks Nicolin
On 2024/9/27 14:02, Nicolin Chen wrote:
On Fri, Sep 27, 2024 at 01:38:08PM +0800, Yi Liu wrote:
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
So if you have two devices assigned to a VM, then you may have two vIOMMUs or one vIOMMU exposed to guest. This depends on whether the two devices are behind the same physical IOMMU. If it's two vIOMMUs, the two can share the s2 hwpt if their physical IOMMU is compatible. is it?
Yes.
To achieve the above, you need to know if the physical IOMMUs of the assigned devices, hence be able to tell if physical IOMMUs are the same and if they are compatible. How would userspace know such infos?
My draft implementation with QEMU does something like this:
- List all viommu-matched iommu nodes under /sys/class/iommu: LINKs
- Get PCI device's /sys/bus/pci/devices/0000:00:00.0/iommu: LINK0
- Compare the LINK0 against the LINKs
We so far don't have an ID for physical IOMMU instance, which can be an alternative to return via the hw_info call, otherwise.
intel platform has a kind of ID for the physical IOMMUs.
ls /sys/class/iommu/ dmar0 dmar1 dmar10 dmar11 dmar12 dmar13 dmar14 dmar15 dmar16 dmar17 dmar18 dmar19 dmar2 dmar3 dmar4 dmar5 dmar6 dmar7 dmar8 dmar9 iommufd_selftest_iommu.0
QEMU then does the routing to assign PCI buses and IORT (or DT). This part is suggested now to move to libvirt though. So, I think at the end of the day, libvirt would run the sys check and assign a device to the corresponding pci bus backed by the correct IOMMU.
and also give the correct viommu for the device.
This gives an example showing two devices behind iommu0 and third device behind iommu1 are assigned to a VM: -device pxb-pcie.id=pcie.viommu0,bus=pcie.0.... \ # bus for viommu0 -device pxb-pcie.id=pcie.viommu1,bus=pcie.0.... \ # bus for viommu1 -device pcie-root-port,id=pcie.viommu0p0,bus=pcie.viommu0... \ -device pcie-root-port,id=pcie.viommu0p1,bus=pcie.viommu0... \ -device pcie-root-port,id=pcie.viommu1p0,bus=pcie.viommu1... \ -device vfio-pci,bus=pcie.viommu0p0... \ # connect to bus for viommu0 -device vfio-pci,bus=pcie.viommu0p1... \ # connect to bus for viommu0 -device vfio-pci,bus=pcie.viommu1p0... # connect to bus for viommu1
is the viommu# an "-object" or just hints to describe the relationship between device and viommu and build the IORT?
I'm considering how it would look like if the QEMU Intel vIOMMU is going to use the viommu obj. Currently, we only support one virtual VT-d due to some considerations like hot-plug. Per your conversation with Kevin, it seems to be supported. So there is no strict connection between vIOMMU and vIOMMU obj. But the vIOMMU obj can only be connected with one pIOMMU. right?
https://lore.kernel.org/linux-iommu/ZvYJl1AQWXWX0BQL@Asurada-Nvidia/
For compatibility to share a stage-2 HWPT, basically we would do a device attach to one of the stage-2 HWPT from the list that VMM should keep. This attach has all the compatibility test, down to the IOMMU driver. If it fails, just allocate a new stage-2 HWPT.
yeah. I think this was covered by Zhenzhong's QEMU series.
On Fri, Sep 27, 2024 at 08:12:40PM +0800, Yi Liu wrote:
External email: Use caution opening links or attachments
On 2024/9/27 14:02, Nicolin Chen wrote:
On Fri, Sep 27, 2024 at 01:38:08PM +0800, Yi Liu wrote:
Does it mean each vIOMMU of VM can only have one s2 HWPT?
Giving some examples here:
- If a VM has 1 vIOMMU, there will be 1 vIOMMU object in the kernel holding one S2 HWPT.
- If a VM has 2 vIOMMUs, there will be 2 vIOMMU objects in the kernel that can hold two different S2 HWPTs, or share one S2 HWPT (saving memory).
So if you have two devices assigned to a VM, then you may have two vIOMMUs or one vIOMMU exposed to guest. This depends on whether the two devices are behind the same physical IOMMU. If it's two vIOMMUs, the two can share the s2 hwpt if their physical IOMMU is compatible. is it?
Yes.
To achieve the above, you need to know if the physical IOMMUs of the assigned devices, hence be able to tell if physical IOMMUs are the same and if they are compatible. How would userspace know such infos?
My draft implementation with QEMU does something like this:
- List all viommu-matched iommu nodes under /sys/class/iommu: LINKs
- Get PCI device's /sys/bus/pci/devices/0000:00:00.0/iommu: LINK0
- Compare the LINK0 against the LINKs
We so far don't have an ID for physical IOMMU instance, which can be an alternative to return via the hw_info call, otherwise.
intel platform has a kind of ID for the physical IOMMUs.
ls /sys/class/iommu/ dmar0 dmar1 dmar10 dmar11 dmar12 dmar13 dmar14 dmar15 dmar16 dmar17 dmar18 dmar19 dmar2 dmar3 dmar4 dmar5 dmar6 dmar7 dmar8 dmar9 iommufd_selftest_iommu.0
Wow, that's a lot of IOMMU devices. I somehow had an impression that Intel uses one physical IOMMU..
Yea, we need something in the core. I had one patch previously: https://github.com/nicolinc/iommufd/commit/b7520901184fd9fa127abb88c1f0be16b...
QEMU then does the routing to assign PCI buses and IORT (or DT). This part is suggested now to move to libvirt though. So, I think at the end of the day, libvirt would run the sys check and assign a device to the corresponding pci bus backed by the correct IOMMU.
and also give the correct viommu for the device.
In this design, a pxb bus is exclusively created for a viommu instance, meaning so long as device is assigned to the correct bus number, it'll be linked to the correct viommu.
This gives an example showing two devices behind iommu0 and third device behind iommu1 are assigned to a VM: -device pxb-pcie.id=pcie.viommu0,bus=pcie.0.... \ # bus for viommu0 -device pxb-pcie.id=pcie.viommu1,bus=pcie.0.... \ # bus for viommu1 -device pcie-root-port,id=pcie.viommu0p0,bus=pcie.viommu0... \ -device pcie-root-port,id=pcie.viommu0p1,bus=pcie.viommu0... \ -device pcie-root-port,id=pcie.viommu1p0,bus=pcie.viommu1... \ -device vfio-pci,bus=pcie.viommu0p0... \ # connect to bus for viommu0 -device vfio-pci,bus=pcie.viommu0p1... \ # connect to bus for viommu0 -device vfio-pci,bus=pcie.viommu1p0... # connect to bus for viommu1
is the viommu# an "-object" or just hints to describe the relationship between device and viommu and build the IORT?
Yes. Eric actually suggested something better for the relationship between pxb-pcie with viommu:
-device pxb-pcie,bus_nr=100,id=pci.12,numa_node=0,bus=pcie.0,addr=0x3,iommu=<id> from: https://lore.kernel.org/qemu-devel/9c3e95c2-1035-4a55-89a3-97165ef32f18@redh...
This would likely help the IORT or Device Tree building.
Currently, ARM VIRT machine doesn't create a vSMMU via a "-device" string, i.e. not a plugable module yet. I recall Intel does. So, you guys are one step ahead.
I'm considering how it would look like if the QEMU Intel vIOMMU is going to use the viommu obj. Currently, we only support one virtual VT-d due to some considerations like hot-plug. Per your conversation with Kevin, it seems to be supported. So there is no strict connection between vIOMMU and vIOMMU obj. But the vIOMMU obj can only be connected with one pIOMMU. right?
Yes. Most of my earlier vSMMU versions did something similar, e.g. one shared vSMMU instance in the QEMU holding a list of S2 hwpts. With this new iommufd viommu object, it would be a list of viommu objs. Eric suggested that HostIOMMUDevice could store any pIOMMU info. So, compatibility check can be done with that (or the old fashioned way of trying an device attach).
The invalidation on the other hand needs to identify each trapped invalidation request to distribute it to the correct viommu. This is also one of the cons of this shared viommu model: invalidation inefficiency -- there can be some cases where we fail to identify which viommu to distribute so we have to broadcast to all viommus. With a multi-viommu-instance model, invalidations are distributed naturally by the guest kernel.
Thanks Nicolin
Use IOMMU_VIOMMU_TYPE_DEFAULT to cover the new IOMMU_VIOMMU_ALLOC ioctl.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- tools/testing/selftests/iommu/iommufd.c | 35 +++++++++++++++++++ tools/testing/selftests/iommu/iommufd_utils.h | 28 +++++++++++++++ 2 files changed, 63 insertions(+)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 6343f4053bd4..5c770e94f299 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -554,6 +554,41 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested) } }
+TEST_F(iommufd_ioas, viommu_default) +{ + uint32_t dev_id = self->device_id; + uint32_t viommu_id = 0; + uint32_t hwpt_id = 0; + + if (dev_id) { + /* Negative test -- invalid hwpt */ + test_err_viommu_alloc(ENOENT, dev_id, hwpt_id, + IOMMU_VIOMMU_TYPE_DEFAULT, &viommu_id); + + /* Negative test -- not a nested parent hwpt */ + test_cmd_hwpt_alloc(dev_id, self->ioas_id, 0, &hwpt_id); + test_err_viommu_alloc(EINVAL, dev_id, hwpt_id, + IOMMU_VIOMMU_TYPE_DEFAULT, &viommu_id); + test_ioctl_destroy(hwpt_id); + + /* Allocate a nested parent HWP */ + test_cmd_hwpt_alloc(dev_id, self->ioas_id, + IOMMU_HWPT_ALLOC_NEST_PARENT, + &hwpt_id); + /* Negative test -- unsupported viommu type */ + test_err_viommu_alloc(EOPNOTSUPP, dev_id, hwpt_id, + 0xdead, &viommu_id); + /* Allocate a default type of viommu */ + test_cmd_viommu_alloc(dev_id, hwpt_id, + IOMMU_VIOMMU_TYPE_DEFAULT, &viommu_id); + test_ioctl_destroy(viommu_id); + test_ioctl_destroy(hwpt_id); + } else { + test_err_viommu_alloc(ENOENT, dev_id, hwpt_id, + IOMMU_VIOMMU_TYPE_DEFAULT, &viommu_id); + } +} + TEST_F(iommufd_ioas, hwpt_attach) { /* Create a device attached directly to a hwpt */ diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 40f6f14ce136..307d097db9dd 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -762,3 +762,31 @@ static int _test_cmd_trigger_iopf(int fd, __u32 device_id, __u32 fault_fd)
#define test_cmd_trigger_iopf(device_id, fault_fd) \ ASSERT_EQ(0, _test_cmd_trigger_iopf(self->fd, device_id, fault_fd)) + +static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, + __u32 type, __u32 flags, __u32 *viommu_id) +{ + struct iommu_viommu_alloc cmd = { + .size = sizeof(cmd), + .flags = flags, + .type = type, + .dev_id = device_id, + .hwpt_id = hwpt_id, + }; + int ret; + + ret = ioctl(fd, IOMMU_VIOMMU_ALLOC, &cmd); + if (ret) + return ret; + if (viommu_id) + *viommu_id = cmd.out_viommu_id; + return 0; +} + +#define test_cmd_viommu_alloc(device_id, hwpt_id, type, viommu_id) \ + ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \ + type, 0, viommu_id)) +#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, viommu_id) \ + EXPECT_ERRNO(_errno, _test_cmd_viommu_alloc(self->fd, device_id, \ + hwpt_id, type, 0, \ + viommu_id))
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id that should be linked to a physical device id via an idev pointer.
Continue the support IOMMU_VIOMMU_TYPE_DEFAULT for a core-managed viommu. Provide a lookup function for drivers to load device pointer by a virtual device id.
Add a rw_semaphore protection around the vdev_id list. Any future ioctl handlers that potentially access the list must grab the lock too.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/device.c | 12 +++ drivers/iommu/iommufd/iommufd_private.h | 21 ++++ drivers/iommu/iommufd/main.c | 6 ++ drivers/iommu/iommufd/viommu.c | 121 ++++++++++++++++++++++++ include/uapi/linux/iommufd.h | 40 ++++++++ 5 files changed, 200 insertions(+)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 5fd3dd420290..3ad759971b32 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -136,6 +136,18 @@ void iommufd_device_destroy(struct iommufd_object *obj) struct iommufd_device *idev = container_of(obj, struct iommufd_device, obj);
+ /* Unlocked since there should be no race in a destroy() */ + if (idev->vdev_id) { + struct iommufd_vdev_id *vdev_id = idev->vdev_id; + struct iommufd_viommu *viommu = vdev_id->viommu; + struct iommufd_vdev_id *old; + + old = xa_cmpxchg(&viommu->vdev_ids, vdev_id->id, vdev_id, NULL, + GFP_KERNEL); + WARN_ON(old != vdev_id); + kfree(vdev_id); + idev->vdev_id = NULL; + } iommu_device_release_dma_owner(idev->dev); iommufd_put_group(idev->igroup); if (!iommufd_selftest_is_mock_dev(idev->dev)) diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 1f2a1c133b9a..2c6e168c5300 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -416,6 +416,7 @@ struct iommufd_device { struct iommufd_object obj; struct iommufd_ctx *ictx; struct iommufd_group *igroup; + struct iommufd_vdev_id *vdev_id; struct list_head group_item; /* always the physical device */ struct device *dev; @@ -533,11 +534,31 @@ struct iommufd_viommu { struct iommufd_ctx *ictx; struct iommufd_hwpt_paging *hwpt;
+ /* The locking order is vdev_ids_rwsem -> igroup::lock */ + struct rw_semaphore vdev_ids_rwsem; + struct xarray vdev_ids; + unsigned int type; };
+struct iommufd_vdev_id { + struct iommufd_viommu *viommu; + struct iommufd_device *idev; + u64 id; +}; + +static inline struct iommufd_viommu * +iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id) +{ + return container_of(iommufd_get_object(ucmd->ictx, id, + IOMMUFD_OBJ_VIOMMU), + struct iommufd_viommu, obj); +} + int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj); +int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd); +int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd);
#ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 288ee51b6829..199ad90fa36b 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -334,6 +334,8 @@ union ucmd_buffer { struct iommu_option option; struct iommu_vfio_ioas vfio_ioas; struct iommu_viommu_alloc viommu; + struct iommu_viommu_set_vdev_id set_vdev_id; + struct iommu_viommu_unset_vdev_id unset_vdev_id; #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif @@ -387,6 +389,10 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { __reserved), IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl, struct iommu_viommu_alloc, out_viommu_id), + IOCTL_OP(IOMMU_VIOMMU_SET_VDEV_ID, iommufd_viommu_set_vdev_id, + struct iommu_viommu_set_vdev_id, vdev_id), + IOCTL_OP(IOMMU_VIOMMU_UNSET_VDEV_ID, iommufd_viommu_unset_vdev_id, + struct iommu_viommu_unset_vdev_id, vdev_id), #ifdef CONFIG_IOMMUFD_TEST IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last), #endif diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 200653a4bf57..8ffcd72b16b8 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -8,6 +8,15 @@ void iommufd_viommu_destroy(struct iommufd_object *obj) { struct iommufd_viommu *viommu = container_of(obj, struct iommufd_viommu, obj); + struct iommufd_vdev_id *vdev_id; + unsigned long index; + + xa_for_each(&viommu->vdev_ids, index, vdev_id) { + /* Unlocked since there should be no race in a destroy() */ + vdev_id->idev->vdev_id = NULL; + kfree(vdev_id); + } + xa_destroy(&viommu->vdev_ids);
refcount_dec(&viommu->hwpt->common.obj.users); } @@ -53,6 +62,9 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) viommu->ictx = ucmd->ictx; viommu->hwpt = hwpt_paging;
+ xa_init(&viommu->vdev_ids); + init_rwsem(&viommu->vdev_ids_rwsem); + refcount_inc(&viommu->hwpt->common.obj.users);
cmd->out_viommu_id = viommu->obj.id; @@ -70,3 +82,112 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &idev->obj); return rc; } + +int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd) +{ + struct iommu_viommu_set_vdev_id *cmd = ucmd->cmd; + struct iommufd_vdev_id *vdev_id, *curr; + struct iommufd_viommu *viommu; + struct iommufd_device *idev; + int rc = 0; + + if (cmd->vdev_id > ULONG_MAX) + return -EINVAL; + + viommu = iommufd_get_viommu(ucmd, cmd->viommu_id); + if (IS_ERR(viommu)) + return PTR_ERR(viommu); + + idev = iommufd_get_device(ucmd, cmd->dev_id); + if (IS_ERR(idev)) { + rc = PTR_ERR(idev); + goto out_put_viommu; + } + + down_write(&viommu->vdev_ids_rwsem); + mutex_lock(&idev->igroup->lock); + if (idev->vdev_id) { + rc = -EEXIST; + goto out_unlock_igroup; + } + + vdev_id = kzalloc(sizeof(*vdev_id), GFP_KERNEL); + if (!vdev_id) { + rc = -ENOMEM; + goto out_unlock_igroup; + } + + vdev_id->idev = idev; + vdev_id->viommu = viommu; + vdev_id->id = cmd->vdev_id; + + curr = xa_cmpxchg(&viommu->vdev_ids, cmd->vdev_id, NULL, vdev_id, + GFP_KERNEL); + if (curr) { + rc = xa_err(curr) ? : -EBUSY; + goto out_free; + } + + idev->vdev_id = vdev_id; + goto out_unlock_igroup; + +out_free: + kfree(vdev_id); +out_unlock_igroup: + mutex_unlock(&idev->igroup->lock); + up_write(&viommu->vdev_ids_rwsem); + iommufd_put_object(ucmd->ictx, &idev->obj); +out_put_viommu: + iommufd_put_object(ucmd->ictx, &viommu->obj); + return rc; +} + +int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd) +{ + struct iommu_viommu_unset_vdev_id *cmd = ucmd->cmd; + struct iommufd_viommu *viommu; + struct iommufd_vdev_id *old; + struct iommufd_device *idev; + int rc = 0; + + if (cmd->vdev_id > ULONG_MAX) + return -EINVAL; + + viommu = iommufd_get_viommu(ucmd, cmd->viommu_id); + if (IS_ERR(viommu)) + return PTR_ERR(viommu); + + idev = iommufd_get_device(ucmd, cmd->dev_id); + if (IS_ERR(idev)) { + rc = PTR_ERR(idev); + goto out_put_viommu; + } + + down_write(&viommu->vdev_ids_rwsem); + mutex_lock(&idev->igroup->lock); + if (!idev->vdev_id) { + rc = -ENOENT; + goto out_unlock_igroup; + } + if (idev->vdev_id->id != cmd->vdev_id) { + rc = -EINVAL; + goto out_unlock_igroup; + } + + old = xa_cmpxchg(&viommu->vdev_ids, idev->vdev_id->id, + idev->vdev_id, NULL, GFP_KERNEL); + if (xa_is_err(old)) { + rc = xa_err(old); + goto out_unlock_igroup; + } + kfree(old); + idev->vdev_id = NULL; + +out_unlock_igroup: + mutex_unlock(&idev->igroup->lock); + up_write(&viommu->vdev_ids_rwsem); + iommufd_put_object(ucmd->ictx, &idev->obj); +out_put_viommu: + iommufd_put_object(ucmd->ictx, &viommu->obj); + return rc; +} diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 51ce6a019c34..1816e89c922d 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -52,6 +52,8 @@ enum { IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d, IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e, IOMMUFD_CMD_VIOMMU_ALLOC = 0x8f, + IOMMUFD_CMD_VIOMMU_SET_VDEV_ID = 0x90, + IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID = 0x91, };
/** @@ -882,4 +884,42 @@ struct iommu_viommu_alloc { __u32 out_viommu_id; }; #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC) + +/** + * struct iommu_viommu_set_vdev_id - ioctl(IOMMU_VIOMMU_SET_VDEV_ID) + * @size: sizeof(struct iommu_viommu_set_vdev_id) + * @viommu_id: viommu ID to associate with the device to store its virtual ID + * @dev_id: device ID to set its virtual ID + * @__reserved: Must be 0 + * @vdev_id: Virtual device ID + * + * Set a viommu-specific virtual ID of a device + */ +struct iommu_viommu_set_vdev_id { + __u32 size; + __u32 viommu_id; + __u32 dev_id; + __u32 __reserved; + __aligned_u64 vdev_id; +}; +#define IOMMU_VIOMMU_SET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_VDEV_ID) + +/** + * struct iommu_viommu_unset_vdev_id - ioctl(IOMMU_VIOMMU_UNSET_VDEV_ID) + * @size: sizeof(struct iommu_viommu_unset_vdev_id) + * @viommu_id: viommu ID associated with the device to delete its virtual ID + * @dev_id: device ID to unset its virtual ID + * @__reserved: Must be 0 + * @vdev_id: Virtual device ID (for verification) + * + * Unset a viommu-specific virtual ID of a device + */ +struct iommu_viommu_unset_vdev_id { + __u32 size; + __u32 viommu_id; + __u32 dev_id; + __u32 __reserved; + __aligned_u64 vdev_id; +}; +#define IOMMU_VIOMMU_UNSET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID) #endif
On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id that should be linked to a physical device id via an idev pointer.
Given some of the other discussions around CC I suspect we should rename these to 'create/destroy virtual device' with an eye that eventually they would be extended like other ops with per-CC platform data.
ie this would be the interface to tell the CC trusted world that a secure device is being added to a VM with some additional flags..
Right now it only conveys the vRID parameter of the virtual device being created.
A following question is if these objects should have their own IDs in the iommufd space too, and then unset is not unset but just a normal destroy object. If so then the thing you put in the ids xarray would also just be a normal object struct.
This is probably worth doing if this is going to grow more CC stuff later.
Jason
On Thu, Sep 05, 2024 at 01:03:53PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id that should be linked to a physical device id via an idev pointer.
Given some of the other discussions around CC I suspect we should rename these to 'create/destroy virtual device' with an eye that eventually they would be extended like other ops with per-CC platform data.
ie this would be the interface to tell the CC trusted world that a secure device is being added to a VM with some additional flags..
Right now it only conveys the vRID parameter of the virtual device being created.
A following question is if these objects should have their own IDs in the iommufd space too, and then unset is not unset but just a normal destroy object. If so then the thing you put in the ids xarray would also just be a normal object struct.
This is probably worth doing if this is going to grow more CC stuff later.
Having to admit that I have been struggling to find a better name than set_vdev_id, I also thought about something similar to that "create/destroy virtual device', yet was not that confident since we only have virtual device ID in its data structure. Also, the virtual device sounds a bit confusing, given we already have idev.
That being said, if we have a clear picture that in the long term we would extend it to hold more information, I think it could be a smart move.
Perhaps virtual device can have its own "attach" to vIOMMU? Or would you still prefer attaching via proxy hwpt_nested?
Thanks Nicolin
On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
we only have virtual device ID in its data structure. Also, the virtual device sounds a bit confusing, given we already have idev.
idev is "iommufd device" which is the physical device
The virtual device is the host side handle of a device in a VM.
That being said, if we have a clear picture that in the long term we would extend it to hold more information, I think it could be a smart move.
Perhaps virtual device can have its own "attach" to vIOMMU? Or would you still prefer attaching via proxy hwpt_nested?
I was thinking just creating it against a vIOMMU is an effective "attach" and the virtual device is permanently tied to the vIOMMU at creation time.
Is there more to attach?
I think some CC stuff had a few more verbs in the lifecycle though
Jason
On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
we only have virtual device ID in its data structure. Also, the virtual device sounds a bit confusing, given we already have idev.
idev is "iommufd device" which is the physical device
The virtual device is the host side handle of a device in a VM.
Yea, we need that narrative in kdoc to clearly separate them.
That being said, if we have a clear picture that in the long term we would extend it to hold more information, I think it could be a smart move.
Perhaps virtual device can have its own "attach" to vIOMMU? Or would you still prefer attaching via proxy hwpt_nested?
I was thinking just creating it against a vIOMMU is an effective "attach" and the virtual device is permanently tied to the vIOMMU at creation time.
Ah, right! The create is per-viommu, so it's being attached.
Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, September 6, 2024 4:15 AM
On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
That being said, if we have a clear picture that in the long term we would extend it to hold more information, I think it could be a smart move.
Perhaps virtual device can have its own "attach" to vIOMMU? Or would you still prefer attaching via proxy hwpt_nested?
I was thinking just creating it against a vIOMMU is an effective "attach" and the virtual device is permanently tied to the vIOMMU at creation time.
Ah, right! The create is per-viommu, so it's being attached.
presumably we also need check compatibility between the idev which the virtual device is created against and the stage-2 pgtable, as a normal attach required?
On Wed, Sep 11, 2024 at 06:19:10AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, September 6, 2024 4:15 AM
On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
That being said, if we have a clear picture that in the long term we would extend it to hold more information, I think it could be a smart move.
Perhaps virtual device can have its own "attach" to vIOMMU? Or would you still prefer attaching via proxy hwpt_nested?
I was thinking just creating it against a vIOMMU is an effective "attach" and the virtual device is permanently tied to the vIOMMU at creation time.
Ah, right! The create is per-viommu, so it's being attached.
presumably we also need check compatibility between the idev which the virtual device is created against and the stage-2 pgtable, as a normal attach required?
If that's required, it can be a part of "create virtual device", where idev and viommu (holding s2 hwpt) would be all available?
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:12 PM
On Wed, Sep 11, 2024 at 06:19:10AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, September 6, 2024 4:15 AM
On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
That being said, if we have a clear picture that in the long term we would extend it to hold more information, I think it could be a smart move.
Perhaps virtual device can have its own "attach" to vIOMMU? Or would you still prefer attaching via proxy hwpt_nested?
I was thinking just creating it against a vIOMMU is an effective "attach" and the virtual device is permanently tied to the vIOMMU at creation time.
Ah, right! The create is per-viommu, so it's being attached.
presumably we also need check compatibility between the idev which the virtual device is created against and the stage-2 pgtable, as a normal attach required?
If that's required, it can be a part of "create virtual device", where idev and viommu (holding s2 hwpt) would be all available?
yes
On Wed, Sep 11, 2024 at 07:18:52AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:12 PM
On Wed, Sep 11, 2024 at 06:19:10AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, September 6, 2024 4:15 AM
On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
That being said, if we have a clear picture that in the long term we would extend it to hold more information, I think it could be a smart move.
Perhaps virtual device can have its own "attach" to vIOMMU? Or would you still prefer attaching via proxy hwpt_nested?
I was thinking just creating it against a vIOMMU is an effective "attach" and the virtual device is permanently tied to the vIOMMU at creation time.
Ah, right! The create is per-viommu, so it's being attached.
presumably we also need check compatibility between the idev which the virtual device is created against and the stage-2 pgtable, as a normal attach required?
If that's required, it can be a part of "create virtual device", where idev and viommu (holding s2 hwpt) would be all available?
yes
Oh, I misread your question actually. I think it's about a matching validation between dev->iommu->iommu_dev and vIOMMU->iommu_dev.
Thanks Nicolin
On Thu, Sep 05, 2024 at 10:38:23AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:03:53PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id that should be linked to a physical device id via an idev pointer.
Given some of the other discussions around CC I suspect we should rename these to 'create/destroy virtual device' with an eye that eventually they would be extended like other ops with per-CC platform data.
ie this would be the interface to tell the CC trusted world that a secure device is being added to a VM with some additional flags..
Right now it only conveys the vRID parameter of the virtual device being created.
A following question is if these objects should have their own IDs in the iommufd space too, and then unset is not unset but just a normal destroy object. If so then the thing you put in the ids xarray would also just be a normal object struct.
I found that adding it as a new object makes things a lot of easier since a vdevice can take refcounts of both viommu and idev. So both destroy() callbacks wouldn't be bothered.
While confirming if I am missing something from the review comments, I am not quite sure what is "the thing you put in the ids xarray".. I only added a vRID xarray per viommu, yet that doesn't seem to be able to merge into the normal object struct. Mind elaborating?
Thanks Nicolin
This is probably worth doing if this is going to grow more CC stuff later.
Having to admit that I have been struggling to find a better name than set_vdev_id, I also thought about something similar to that "create/destroy virtual device', yet was not that confident since we only have virtual device ID in its data structure. Also, the virtual device sounds a bit confusing, given we already have idev.
That being said, if we have a clear picture that in the long term we would extend it to hold more information, I think it could be a smart move.
Perhaps virtual device can have its own "attach" to vIOMMU? Or would you still prefer attaching via proxy hwpt_nested?
Thanks Nicolin
On Tue, Oct 01, 2024 at 01:54:05AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 10:38:23AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:03:53PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id that should be linked to a physical device id via an idev pointer.
Given some of the other discussions around CC I suspect we should rename these to 'create/destroy virtual device' with an eye that eventually they would be extended like other ops with per-CC platform data.
ie this would be the interface to tell the CC trusted world that a secure device is being added to a VM with some additional flags..
Right now it only conveys the vRID parameter of the virtual device being created.
A following question is if these objects should have their own IDs in the iommufd space too, and then unset is not unset but just a normal destroy object. If so then the thing you put in the ids xarray would also just be a normal object struct.
I found that adding it as a new object makes things a lot of easier since a vdevice can take refcounts of both viommu and idev. So both destroy() callbacks wouldn't be bothered.
While confirming if I am missing something from the review comments, I am not quite sure what is "the thing you put in the ids xarray".. I only added a vRID xarray per viommu, yet that doesn't seem to be able to merge into the normal object struct. Mind elaborating?
I would think to point the vRID xarray directly to the new iommufd vdevice object
Jason
On Tue, Oct 01, 2024 at 10:46:20AM -0300, Jason Gunthorpe wrote:
On Tue, Oct 01, 2024 at 01:54:05AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 10:38:23AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:03:53PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id that should be linked to a physical device id via an idev pointer.
Given some of the other discussions around CC I suspect we should rename these to 'create/destroy virtual device' with an eye that eventually they would be extended like other ops with per-CC platform data.
ie this would be the interface to tell the CC trusted world that a secure device is being added to a VM with some additional flags..
Right now it only conveys the vRID parameter of the virtual device being created.
A following question is if these objects should have their own IDs in the iommufd space too, and then unset is not unset but just a normal destroy object. If so then the thing you put in the ids xarray would also just be a normal object struct.
I found that adding it as a new object makes things a lot of easier since a vdevice can take refcounts of both viommu and idev. So both destroy() callbacks wouldn't be bothered.
While confirming if I am missing something from the review comments, I am not quite sure what is "the thing you put in the ids xarray".. I only added a vRID xarray per viommu, yet that doesn't seem to be able to merge into the normal object struct. Mind elaborating?
I would think to point the vRID xarray directly to the new iommufd vdevice object
Oh, I think I already have that then: ictx->xarray: objIds <-> { IOAS|HWPT|VIOMMU|VDEVICE|.. }->objs viommu->xarray: vRids <-> { VDEVICE } pointers
Thanks Nicolin
Hi Nicolin,
On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id that should be linked to a physical device id via an idev pointer.
Continue the support IOMMU_VIOMMU_TYPE_DEFAULT for a core-managed viommu. Provide a lookup function for drivers to load device pointer by a virtual device id.
Add a rw_semaphore protection around the vdev_id list. Any future ioctl handlers that potentially access the list must grab the lock too.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/device.c | 12 +++ drivers/iommu/iommufd/iommufd_private.h | 21 ++++ drivers/iommu/iommufd/main.c | 6 ++ drivers/iommu/iommufd/viommu.c | 121 ++++++++++++++++++++++++ include/uapi/linux/iommufd.h | 40 ++++++++ 5 files changed, 200 insertions(+)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 5fd3dd420290..3ad759971b32 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -136,6 +136,18 @@ void iommufd_device_destroy(struct iommufd_object *obj) struct iommufd_device *idev = container_of(obj, struct iommufd_device, obj);
- /* Unlocked since there should be no race in a destroy() */
- if (idev->vdev_id) {
struct iommufd_vdev_id *vdev_id = idev->vdev_id;
struct iommufd_viommu *viommu = vdev_id->viommu;
struct iommufd_vdev_id *old;
old = xa_cmpxchg(&viommu->vdev_ids, vdev_id->id, vdev_id, NULL,
GFP_KERNEL);
WARN_ON(old != vdev_id);
kfree(vdev_id);
idev->vdev_id = NULL;
- } iommu_device_release_dma_owner(idev->dev); iommufd_put_group(idev->igroup); if (!iommufd_selftest_is_mock_dev(idev->dev))
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 1f2a1c133b9a..2c6e168c5300 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -416,6 +416,7 @@ struct iommufd_device { struct iommufd_object obj; struct iommufd_ctx *ictx; struct iommufd_group *igroup;
- struct iommufd_vdev_id *vdev_id; struct list_head group_item; /* always the physical device */ struct device *dev;
@@ -533,11 +534,31 @@ struct iommufd_viommu { struct iommufd_ctx *ictx; struct iommufd_hwpt_paging *hwpt;
- /* The locking order is vdev_ids_rwsem -> igroup::lock */
- struct rw_semaphore vdev_ids_rwsem;
- struct xarray vdev_ids;
- unsigned int type;
}; +struct iommufd_vdev_id {
- struct iommufd_viommu *viommu;
- struct iommufd_device *idev;
- u64 id;
+};
+static inline struct iommufd_viommu * +iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id) +{
- return container_of(iommufd_get_object(ucmd->ictx, id,
IOMMUFD_OBJ_VIOMMU),
struct iommufd_viommu, obj);
+}
int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj); +int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd); +int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd); #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 288ee51b6829..199ad90fa36b 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -334,6 +334,8 @@ union ucmd_buffer { struct iommu_option option; struct iommu_vfio_ioas vfio_ioas; struct iommu_viommu_alloc viommu;
- struct iommu_viommu_set_vdev_id set_vdev_id;
- struct iommu_viommu_unset_vdev_id unset_vdev_id;
#ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif @@ -387,6 +389,10 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { __reserved), IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl, struct iommu_viommu_alloc, out_viommu_id),
- IOCTL_OP(IOMMU_VIOMMU_SET_VDEV_ID, iommufd_viommu_set_vdev_id,
struct iommu_viommu_set_vdev_id, vdev_id),
- IOCTL_OP(IOMMU_VIOMMU_UNSET_VDEV_ID, iommufd_viommu_unset_vdev_id,
struct iommu_viommu_unset_vdev_id, vdev_id),
#ifdef CONFIG_IOMMUFD_TEST IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last), #endif diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 200653a4bf57..8ffcd72b16b8 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -8,6 +8,15 @@ void iommufd_viommu_destroy(struct iommufd_object *obj) { struct iommufd_viommu *viommu = container_of(obj, struct iommufd_viommu, obj);
- struct iommufd_vdev_id *vdev_id;
- unsigned long index;
- xa_for_each(&viommu->vdev_ids, index, vdev_id) {
/* Unlocked since there should be no race in a destroy() */
vdev_id->idev->vdev_id = NULL;
kfree(vdev_id);
- }
- xa_destroy(&viommu->vdev_ids);
refcount_dec(&viommu->hwpt->common.obj.users); } @@ -53,6 +62,9 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) viommu->ictx = ucmd->ictx; viommu->hwpt = hwpt_paging;
- xa_init(&viommu->vdev_ids);
- init_rwsem(&viommu->vdev_ids_rwsem);
- refcount_inc(&viommu->hwpt->common.obj.users);
cmd->out_viommu_id = viommu->obj.id; @@ -70,3 +82,112 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &idev->obj); return rc; }
+int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd) +{
- struct iommu_viommu_set_vdev_id *cmd = ucmd->cmd;
- struct iommufd_vdev_id *vdev_id, *curr;
- struct iommufd_viommu *viommu;
- struct iommufd_device *idev;
- int rc = 0;
- if (cmd->vdev_id > ULONG_MAX)
return -EINVAL;
- viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
- if (IS_ERR(viommu))
return PTR_ERR(viommu);
- idev = iommufd_get_device(ucmd, cmd->dev_id);
- if (IS_ERR(idev)) {
rc = PTR_ERR(idev);
goto out_put_viommu;
- }
- down_write(&viommu->vdev_ids_rwsem);
- mutex_lock(&idev->igroup->lock);
- if (idev->vdev_id) {
rc = -EEXIST;
goto out_unlock_igroup;
- }
- vdev_id = kzalloc(sizeof(*vdev_id), GFP_KERNEL);
- if (!vdev_id) {
rc = -ENOMEM;
goto out_unlock_igroup;
- }
- vdev_id->idev = idev;
- vdev_id->viommu = viommu;
- vdev_id->id = cmd->vdev_id;
My understanding of IOMMUFD is very little, but AFAICT, that means that it’s assumed that each device can only have one stream ID(RID)?
As I can see in patch 17 in arm_smmu_convert_viommu_vdev_id(), it converts the virtual ID to a physical one using master->streams[0].id.
Is that correct or am I missing something?
As I am looking at similar problem for paravirtual IOMMU with pKVM, where the UAPI would be something similar to:
GET_NUM_END_POINTS(dev) => nr_sids
SET_END_POINT_VSID(dev, sid_index, vsid)
Similar to what VFIO does with IRQs.
As a device can have many SIDs.
Thanks, Mostafa
- curr = xa_cmpxchg(&viommu->vdev_ids, cmd->vdev_id, NULL, vdev_id,
GFP_KERNEL);
- if (curr) {
rc = xa_err(curr) ? : -EBUSY;
goto out_free;
- }
- idev->vdev_id = vdev_id;
- goto out_unlock_igroup;
+out_free:
- kfree(vdev_id);
+out_unlock_igroup:
- mutex_unlock(&idev->igroup->lock);
- up_write(&viommu->vdev_ids_rwsem);
- iommufd_put_object(ucmd->ictx, &idev->obj);
+out_put_viommu:
- iommufd_put_object(ucmd->ictx, &viommu->obj);
- return rc;
+}
+int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd) +{
- struct iommu_viommu_unset_vdev_id *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vdev_id *old;
- struct iommufd_device *idev;
- int rc = 0;
- if (cmd->vdev_id > ULONG_MAX)
return -EINVAL;
- viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
- if (IS_ERR(viommu))
return PTR_ERR(viommu);
- idev = iommufd_get_device(ucmd, cmd->dev_id);
- if (IS_ERR(idev)) {
rc = PTR_ERR(idev);
goto out_put_viommu;
- }
- down_write(&viommu->vdev_ids_rwsem);
- mutex_lock(&idev->igroup->lock);
- if (!idev->vdev_id) {
rc = -ENOENT;
goto out_unlock_igroup;
- }
- if (idev->vdev_id->id != cmd->vdev_id) {
rc = -EINVAL;
goto out_unlock_igroup;
- }
- old = xa_cmpxchg(&viommu->vdev_ids, idev->vdev_id->id,
idev->vdev_id, NULL, GFP_KERNEL);
- if (xa_is_err(old)) {
rc = xa_err(old);
goto out_unlock_igroup;
- }
- kfree(old);
- idev->vdev_id = NULL;
+out_unlock_igroup:
- mutex_unlock(&idev->igroup->lock);
- up_write(&viommu->vdev_ids_rwsem);
- iommufd_put_object(ucmd->ictx, &idev->obj);
+out_put_viommu:
- iommufd_put_object(ucmd->ictx, &viommu->obj);
- return rc;
+} diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 51ce6a019c34..1816e89c922d 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -52,6 +52,8 @@ enum { IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d, IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e, IOMMUFD_CMD_VIOMMU_ALLOC = 0x8f,
- IOMMUFD_CMD_VIOMMU_SET_VDEV_ID = 0x90,
- IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID = 0x91,
}; /** @@ -882,4 +884,42 @@ struct iommu_viommu_alloc { __u32 out_viommu_id; }; #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
+/**
- struct iommu_viommu_set_vdev_id - ioctl(IOMMU_VIOMMU_SET_VDEV_ID)
- @size: sizeof(struct iommu_viommu_set_vdev_id)
- @viommu_id: viommu ID to associate with the device to store its virtual ID
- @dev_id: device ID to set its virtual ID
- @__reserved: Must be 0
- @vdev_id: Virtual device ID
- Set a viommu-specific virtual ID of a device
- */
+struct iommu_viommu_set_vdev_id {
- __u32 size;
- __u32 viommu_id;
- __u32 dev_id;
- __u32 __reserved;
- __aligned_u64 vdev_id;
+}; +#define IOMMU_VIOMMU_SET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_VDEV_ID)
+/**
- struct iommu_viommu_unset_vdev_id - ioctl(IOMMU_VIOMMU_UNSET_VDEV_ID)
- @size: sizeof(struct iommu_viommu_unset_vdev_id)
- @viommu_id: viommu ID associated with the device to delete its virtual ID
- @dev_id: device ID to unset its virtual ID
- @__reserved: Must be 0
- @vdev_id: Virtual device ID (for verification)
- Unset a viommu-specific virtual ID of a device
- */
+struct iommu_viommu_unset_vdev_id {
- __u32 size;
- __u32 viommu_id;
- __u32 dev_id;
- __u32 __reserved;
- __aligned_u64 vdev_id;
+}; +#define IOMMU_VIOMMU_UNSET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID)
#endif
2.43.0
On Fri, Sep 27, 2024 at 01:50:52PM +0000, Mostafa Saleh wrote:
My understanding of IOMMUFD is very little, but AFAICT, that means that it’s assumed that each device can only have one stream ID(RID)?
As I can see in patch 17 in arm_smmu_convert_viommu_vdev_id(), it converts the virtual ID to a physical one using master->streams[0].id.
Is that correct or am I missing something?
As I am looking at similar problem for paravirtual IOMMU with pKVM, where the UAPI would be something similar to:
GET_NUM_END_POINTS(dev) => nr_sids
SET_END_POINT_VSID(dev, sid_index, vsid)
Similar to what VFIO does with IRQs.
As a device can have many SIDs.
We don't support multi SID through this interface, at least in this version.
To do multi-sid you have to inform the VM of all the different pSIDs the device has and then setup the vSID/pSID translation to map them all to the HW invalidation logic.
Which is alot more steps, and we have no use case right now. Multi-sid is also not something I expect to see in any modern PCI device, and this is VFIO PCI...
Jason
On Fri, Sep 27, 2024 at 11:01:41AM -0300, Jason Gunthorpe wrote:
On Fri, Sep 27, 2024 at 01:50:52PM +0000, Mostafa Saleh wrote:
My understanding of IOMMUFD is very little, but AFAICT, that means that it’s assumed that each device can only have one stream ID(RID)?
As I can see in patch 17 in arm_smmu_convert_viommu_vdev_id(), it converts the virtual ID to a physical one using master->streams[0].id.
Is that correct or am I missing something?
As I am looking at similar problem for paravirtual IOMMU with pKVM, where the UAPI would be something similar to:
GET_NUM_END_POINTS(dev) => nr_sids
SET_END_POINT_VSID(dev, sid_index, vsid)
Similar to what VFIO does with IRQs.
As a device can have many SIDs.
We don't support multi SID through this interface, at least in this version.
To do multi-sid you have to inform the VM of all the different pSIDs the device has and then setup the vSID/pSID translation to map them all to the HW invalidation logic.
Why would the VM need to know the pSID? The way I view this is quite close to how irq works, the VM only views the GSI which is the virtualized number. The VMM then would need to configure vSID->pSID translation, also without knowing the actual pSID, just how many SIDs are there per-device; very similar to how it configures IRQs through VFIO_DEVICE_GET_INFO/VFIO_DEVICE_SET_IRQS.
And as long as we only allow 1:1 vSID to pSID mapping, I guess it would be easy to implement.
Which is alot more steps, and we have no use case right now. Multi-sid is also not something I expect to see in any modern PCI device, and this is VFIO PCI...
Ah, I thought IOMMUFD would be used instead of VFIO_TYPE1*, which should cover platform devices (VFIO-platform) or am I missing something?
And multi-SIDs is common in platform devices and this would be quite restricting, and I was hoping to support the pKVM vIOMMU through IOMMUFD interface.
If possible, can the UAPI be designed with this in mind, even if not implemented now?
Thanks, Mostafa
Jason
On Fri, Sep 27, 2024 at 02:22:20PM +0000, Mostafa Saleh wrote:
We don't support multi SID through this interface, at least in this version.
To do multi-sid you have to inform the VM of all the different pSIDs the device has and then setup the vSID/pSID translation to map them all to the HW invalidation logic.
Why would the VM need to know the pSID?
It doesn't need to know the pSID exactly, but it needs to know all the pSIDs that exist and have them be labeled with vSIDs.
With cmdq direct assignment the VM has to issue an ATS invalidation for each and every physical device using its vSID. There is no HW path to handle a 1:N v/p SID relationship.
Ah, I thought IOMMUFD would be used instead of VFIO_TYPE1*, which should cover platform devices (VFIO-platform) or am I missing something?
It does work with platform, but AFAIK nobody is interested in that so it hasn't been any focus. Enabling multi-sid nesting, sub stream ids, etc is some additional work I think.
But what I mean is the really hard use case for the vSID/pSID mapping is ATS invalidation and you won't use ATS invalidation on platform so multi-sid likely doesn't matter.
STE/CD invalidation could possibly be pushed down through the per-domain ioctl and replicated to all domain attachments. We don't have code in the series to do that, but it could work from a uAPI perspective.
If possible, can the UAPI be designed with this in mind, even if not implemented now?
It is reasonable to ask. I think things are extensible enough. I'd imagine we can add a flag 'secondary ID' and then a new field 'secondary ID index' to the vdev operations when someone wants to take this on.
Jason
On Fri, Sep 27, 2024 at 3:58 PM Jason Gunthorpe jgg@nvidia.com wrote:
On Fri, Sep 27, 2024 at 02:22:20PM +0000, Mostafa Saleh wrote:
We don't support multi SID through this interface, at least in this version.
To do multi-sid you have to inform the VM of all the different pSIDs the device has and then setup the vSID/pSID translation to map them all to the HW invalidation logic.
Why would the VM need to know the pSID?
It doesn't need to know the pSID exactly, but it needs to know all the pSIDs that exist and have them be labeled with vSIDs.
With cmdq direct assignment the VM has to issue an ATS invalidation for each and every physical device using its vSID. There is no HW path to handle a 1:N v/p SID relationship.
I see, that's for the cmdq assignment.
Ah, I thought IOMMUFD would be used instead of VFIO_TYPE1*, which should cover platform devices (VFIO-platform) or am I missing something?
It does work with platform, but AFAIK nobody is interested in that so it hasn't been any focus. Enabling multi-sid nesting, sub stream ids, etc is some additional work I think.
But what I mean is the really hard use case for the vSID/pSID mapping is ATS invalidation and you won't use ATS invalidation on platform so multi-sid likely doesn't matter.
STE/CD invalidation could possibly be pushed down through the per-domain ioctl and replicated to all domain attachments. We don't have code in the series to do that, but it could work from a uAPI perspective.
If possible, can the UAPI be designed with this in mind, even if not implemented now?
It is reasonable to ask. I think things are extensible enough. I'd imagine we can add a flag 'secondary ID' and then a new field 'secondary ID index' to the vdev operations when someone wants to take this on.
Makes sense, I can take this when I start doing the pKVM work with IOMMUFD, in case it wasn't supported by then.
Thanks, Mostafa
Jason
On 28/8/24 02:59, Nicolin Chen wrote:
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id that should be linked to a physical device id via an idev pointer.
Continue the support IOMMU_VIOMMU_TYPE_DEFAULT for a core-managed viommu. Provide a lookup function for drivers to load device pointer by a virtual device id.
Add a rw_semaphore protection around the vdev_id list. Any future ioctl handlers that potentially access the list must grab the lock too.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/device.c | 12 +++ drivers/iommu/iommufd/iommufd_private.h | 21 ++++ drivers/iommu/iommufd/main.c | 6 ++ drivers/iommu/iommufd/viommu.c | 121 ++++++++++++++++++++++++ include/uapi/linux/iommufd.h | 40 ++++++++ 5 files changed, 200 insertions(+)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 5fd3dd420290..3ad759971b32 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -136,6 +136,18 @@ void iommufd_device_destroy(struct iommufd_object *obj) struct iommufd_device *idev = container_of(obj, struct iommufd_device, obj);
- /* Unlocked since there should be no race in a destroy() */
- if (idev->vdev_id) {
struct iommufd_vdev_id *vdev_id = idev->vdev_id;
struct iommufd_viommu *viommu = vdev_id->viommu;
struct iommufd_vdev_id *old;
old = xa_cmpxchg(&viommu->vdev_ids, vdev_id->id, vdev_id, NULL,
GFP_KERNEL);
WARN_ON(old != vdev_id);
kfree(vdev_id);
idev->vdev_id = NULL;
- } iommu_device_release_dma_owner(idev->dev); iommufd_put_group(idev->igroup); if (!iommufd_selftest_is_mock_dev(idev->dev))
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 1f2a1c133b9a..2c6e168c5300 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -416,6 +416,7 @@ struct iommufd_device { struct iommufd_object obj; struct iommufd_ctx *ictx; struct iommufd_group *igroup;
- struct iommufd_vdev_id *vdev_id; struct list_head group_item; /* always the physical device */ struct device *dev;
@@ -533,11 +534,31 @@ struct iommufd_viommu { struct iommufd_ctx *ictx; struct iommufd_hwpt_paging *hwpt;
- /* The locking order is vdev_ids_rwsem -> igroup::lock */
- struct rw_semaphore vdev_ids_rwsem;
- struct xarray vdev_ids;
- unsigned int type; };
+struct iommufd_vdev_id {
- struct iommufd_viommu *viommu;
- struct iommufd_device *idev;
- u64 id;
+};
+static inline struct iommufd_viommu * +iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id) +{
- return container_of(iommufd_get_object(ucmd->ictx, id,
IOMMUFD_OBJ_VIOMMU),
struct iommufd_viommu, obj);
+}
- int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj);
+int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd); +int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd); #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 288ee51b6829..199ad90fa36b 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -334,6 +334,8 @@ union ucmd_buffer { struct iommu_option option; struct iommu_vfio_ioas vfio_ioas; struct iommu_viommu_alloc viommu;
- struct iommu_viommu_set_vdev_id set_vdev_id;
- struct iommu_viommu_unset_vdev_id unset_vdev_id; #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif
@@ -387,6 +389,10 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { __reserved), IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl, struct iommu_viommu_alloc, out_viommu_id),
- IOCTL_OP(IOMMU_VIOMMU_SET_VDEV_ID, iommufd_viommu_set_vdev_id,
struct iommu_viommu_set_vdev_id, vdev_id),
- IOCTL_OP(IOMMU_VIOMMU_UNSET_VDEV_ID, iommufd_viommu_unset_vdev_id,
#ifdef CONFIG_IOMMUFD_TEST IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last), #endifstruct iommu_viommu_unset_vdev_id, vdev_id),
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 200653a4bf57..8ffcd72b16b8 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -8,6 +8,15 @@ void iommufd_viommu_destroy(struct iommufd_object *obj) { struct iommufd_viommu *viommu = container_of(obj, struct iommufd_viommu, obj);
- struct iommufd_vdev_id *vdev_id;
- unsigned long index;
- xa_for_each(&viommu->vdev_ids, index, vdev_id) {
/* Unlocked since there should be no race in a destroy() */
vdev_id->idev->vdev_id = NULL;
kfree(vdev_id);
- }
- xa_destroy(&viommu->vdev_ids);
refcount_dec(&viommu->hwpt->common.obj.users); } @@ -53,6 +62,9 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) viommu->ictx = ucmd->ictx; viommu->hwpt = hwpt_paging;
- xa_init(&viommu->vdev_ids);
- init_rwsem(&viommu->vdev_ids_rwsem);
- refcount_inc(&viommu->hwpt->common.obj.users);
cmd->out_viommu_id = viommu->obj.id; @@ -70,3 +82,112 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &idev->obj); return rc; }
+int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd) +{
- struct iommu_viommu_set_vdev_id *cmd = ucmd->cmd;
- struct iommufd_vdev_id *vdev_id, *curr;
- struct iommufd_viommu *viommu;
- struct iommufd_device *idev;
- int rc = 0;
- if (cmd->vdev_id > ULONG_MAX)
return -EINVAL;
- viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
- if (IS_ERR(viommu))
return PTR_ERR(viommu);
- idev = iommufd_get_device(ucmd, cmd->dev_id);
- if (IS_ERR(idev)) {
rc = PTR_ERR(idev);
goto out_put_viommu;
- }
- down_write(&viommu->vdev_ids_rwsem);
- mutex_lock(&idev->igroup->lock);
- if (idev->vdev_id) {
rc = -EEXIST;
goto out_unlock_igroup;
- }
- vdev_id = kzalloc(sizeof(*vdev_id), GFP_KERNEL);
- if (!vdev_id) {
rc = -ENOMEM;
goto out_unlock_igroup;
- }
- vdev_id->idev = idev;
- vdev_id->viommu = viommu;
- vdev_id->id = cmd->vdev_id;
- curr = xa_cmpxchg(&viommu->vdev_ids, cmd->vdev_id, NULL, vdev_id,
GFP_KERNEL);
- if (curr) {
rc = xa_err(curr) ? : -EBUSY;
goto out_free;
- }
- idev->vdev_id = vdev_id;
- goto out_unlock_igroup;
+out_free:
- kfree(vdev_id);
+out_unlock_igroup:
- mutex_unlock(&idev->igroup->lock);
- up_write(&viommu->vdev_ids_rwsem);
- iommufd_put_object(ucmd->ictx, &idev->obj);
+out_put_viommu:
- iommufd_put_object(ucmd->ictx, &viommu->obj);
- return rc;
+}
+int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd) +{
- struct iommu_viommu_unset_vdev_id *cmd = ucmd->cmd;
- struct iommufd_viommu *viommu;
- struct iommufd_vdev_id *old;
- struct iommufd_device *idev;
- int rc = 0;
- if (cmd->vdev_id > ULONG_MAX)
return -EINVAL;
- viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
- if (IS_ERR(viommu))
return PTR_ERR(viommu);
- idev = iommufd_get_device(ucmd, cmd->dev_id);
- if (IS_ERR(idev)) {
rc = PTR_ERR(idev);
goto out_put_viommu;
- }
- down_write(&viommu->vdev_ids_rwsem);
- mutex_lock(&idev->igroup->lock);
- if (!idev->vdev_id) {
rc = -ENOENT;
goto out_unlock_igroup;
- }
- if (idev->vdev_id->id != cmd->vdev_id) {
rc = -EINVAL;
goto out_unlock_igroup;
- }
- old = xa_cmpxchg(&viommu->vdev_ids, idev->vdev_id->id,
idev->vdev_id, NULL, GFP_KERNEL);
- if (xa_is_err(old)) {
rc = xa_err(old);
goto out_unlock_igroup;
- }
- kfree(old);
- idev->vdev_id = NULL;
+out_unlock_igroup:
- mutex_unlock(&idev->igroup->lock);
- up_write(&viommu->vdev_ids_rwsem);
- iommufd_put_object(ucmd->ictx, &idev->obj);
+out_put_viommu:
- iommufd_put_object(ucmd->ictx, &viommu->obj);
- return rc;
+} diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 51ce6a019c34..1816e89c922d 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -52,6 +52,8 @@ enum { IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d, IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e, IOMMUFD_CMD_VIOMMU_ALLOC = 0x8f,
- IOMMUFD_CMD_VIOMMU_SET_VDEV_ID = 0x90,
- IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID = 0x91, };
/** @@ -882,4 +884,42 @@ struct iommu_viommu_alloc { __u32 out_viommu_id; }; #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
+/**
- struct iommu_viommu_set_vdev_id - ioctl(IOMMU_VIOMMU_SET_VDEV_ID)
- @size: sizeof(struct iommu_viommu_set_vdev_id)
- @viommu_id: viommu ID to associate with the device to store its virtual ID
- @dev_id: device ID to set its virtual ID
- @__reserved: Must be 0
- @vdev_id: Virtual device ID
- Set a viommu-specific virtual ID of a device
- */
+struct iommu_viommu_set_vdev_id {
- __u32 size;
- __u32 viommu_id;
- __u32 dev_id;
Is this ID from vfio_device_bind_iommufd.out_devid?
- __u32 __reserved;
- __aligned_u64 vdev_id;
What is the nature of this id? It is not the guest's BDFn, is it? The code suggests it is ARM's "SID" == "stream ID" and "a device might be able to generate multiple StreamIDs" (how, why?) 🤯 And these streams seem to have nothing to do with PCIe IDE streams, right?
For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface to pass the guest's BDFs for a specific host device (which is passed through) and nothing in the kernel has any knowledge of it atm, is this the right place, or another ioctl() is needed here?
Sorry, I am too ignorant about ARM :)
+}; +#define IOMMU_VIOMMU_SET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_VDEV_ID)
+/**
- struct iommu_viommu_unset_vdev_id - ioctl(IOMMU_VIOMMU_UNSET_VDEV_ID)
- @size: sizeof(struct iommu_viommu_unset_vdev_id)
- @viommu_id: viommu ID associated with the device to delete its virtual ID
- @dev_id: device ID to unset its virtual ID
- @__reserved: Must be 0
- @vdev_id: Virtual device ID (for verification)
- Unset a viommu-specific virtual ID of a device
- */
+struct iommu_viommu_unset_vdev_id {
- __u32 size;
- __u32 viommu_id;
- __u32 dev_id;
- __u32 __reserved;
- __aligned_u64 vdev_id;
+}; +#define IOMMU_VIOMMU_UNSET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID) #endif
Nit: "git format-patch -O orderfile" makes patches nicer by putting the documentation first (.h before .c, in this case) with the "ordefile" looking like this:
=== *.txt configure *Makefile* *.json *.h *.c ===
Thanks,
On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
+/**
- struct iommu_viommu_set_vdev_id - ioctl(IOMMU_VIOMMU_SET_VDEV_ID)
- @size: sizeof(struct iommu_viommu_set_vdev_id)
- @viommu_id: viommu ID to associate with the device to store its virtual ID
- @dev_id: device ID to set its virtual ID
- @__reserved: Must be 0
- @vdev_id: Virtual device ID
- Set a viommu-specific virtual ID of a device
- */
+struct iommu_viommu_set_vdev_id {
__u32 size;
__u32 viommu_id;
__u32 dev_id;
Is this ID from vfio_device_bind_iommufd.out_devid?
Yes.
__u32 __reserved;
__aligned_u64 vdev_id;
What is the nature of this id? It is not the guest's BDFn, is it? The
Not exactly but certainly can be related. Explaining below..
code suggests it is ARM's "SID" == "stream ID" and "
Yes. That's the first use case of that.
a device might be able to generate multiple StreamIDs" (how, why?) 🤯 And these streams seem to have nothing to do with PCIe IDE streams, right?
PCI device only has one stream ID per its SMMU.
So the Stream ID is more like a channel ID or client ID from the SMMU (IOMMU) view. A PCI device's Stream ID can be calculated from the BDF numbers + the Stream-ID base of that PCI bus.
That said, this is all about IOMMU. So, it is likely more natural to forward an IOMMU-specific ID (vStream ID for a vSMMU) v.s. BDF.
For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface to pass the guest's BDFs for a specific host device (which is passed through) and nothing in the kernel has any knowledge of it atm, is this the right place, or another ioctl() is needed here?
Sorry, I am too ignorant about ARM :)
We are reworking this ioctl to an IOMMU_VDEVICE_ALLOC cmd, meaning a virtual device allocation. A virtual device is another bond when an iommufd_device connects to an iommufd_viommu in the VM. The name "vDEVICE" and "virtual device" still need to go through discussion, so they aren't finalized. But the idea here is to have a structure to gather all virtualization information of the intersection of the device and the vIOMMU in the VM.
On the other hand, BDF is very PCI specific yet IOMMU independent. E.g. it could exist for a PCI device even without a vIOMMU in the VM, i.e. there is no vDEVICE in such case. Right?
So, if your use case relies on IOMMU and it is even a part of the IOMMU virtualization features, I think you are looking at the right place. And we should discuss how to incorporate that. Otherwise, I feel the struct vfio_pci might be the one to extend?
+}; +#define IOMMU_VIOMMU_SET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_VDEV_ID)
+/**
- struct iommu_viommu_unset_vdev_id - ioctl(IOMMU_VIOMMU_UNSET_VDEV_ID)
- @size: sizeof(struct iommu_viommu_unset_vdev_id)
- @viommu_id: viommu ID associated with the device to delete its virtual ID
- @dev_id: device ID to unset its virtual ID
- @__reserved: Must be 0
- @vdev_id: Virtual device ID (for verification)
- Unset a viommu-specific virtual ID of a device
- */
+struct iommu_viommu_unset_vdev_id {
__u32 size;
__u32 viommu_id;
__u32 dev_id;
__u32 __reserved;
__aligned_u64 vdev_id;
+}; +#define IOMMU_VIOMMU_UNSET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID) #endif
Nit: "git format-patch -O orderfile" makes patches nicer by putting the documentation first (.h before .c, in this case) with the "ordefile" looking like this:
=== *.txt configure *Makefile* *.json *.h
*.c
Interesting :)
Will try it!
Thanks Nicolin
On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
- __u32 __reserved;
- __aligned_u64 vdev_id;
What is the nature of this id?
It should be the vIOMMU's HW representation for the virtual device.
On ARM it is the stream id, the index into the Stream Table
On AMD it would be the "DeviceID" the index in the Device Table
On Intel it is an index into the context table
The primary usage is to transform virtual invalidations from the guest into physical invalidations.
For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface to pass the guest's BDFs for a specific host device (which is passed through) and nothing in the kernel has any knowledge of it atm, is this the right place, or another ioctl() is needed here?
We probably need to add the vRID as well to this struct for that reason.
The vdev_id is the iommu handle, and there is a platform specific transformation between Bus/Device/Function and the iommu handle. In some cases this is math, in some cases it is ACPI/DT tables or something.
So I think the kernel should not make an assumption about the relationship.
Jason
On Fri, Oct 04, 2024 at 08:41:47AM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface to pass the guest's BDFs for a specific host device (which is passed through) and nothing in the kernel has any knowledge of it atm, is this the right place, or another ioctl() is needed here?
We probably need to add the vRID as well to this struct for that reason.
"vRID"/"vBDF" doesn't sound very generic to me to put in this structure, though PCI devices are and very likely will be the only users of this Virtual Device for a while. Any good idea?
Also, I am wondering if the uAPI structure of Virtual Device should have a driver-specific data structure. And the vdev_id should be in the driver-specific struct. So, it could stay in corresponding naming, "Stream ID", "Device ID" or "Context ID" v.s. a generic "Virtual ID" in the top-level structure? Then, other info like CCA can be put in the driver-level structure of SMMU's.
The vdev_id is the iommu handle, and there is a platform specific transformation between Bus/Device/Function and the iommu handle. In some cases this is math, in some cases it is ACPI/DT tables or something.
So I think the kernel should not make an assumption about the relationship.
Agreed. That also implies that a vRID is quite independent to the IOMMU right? So, I think that the reason of adding a vRID to the virtual deivce uAPI/structure should be IOMMU requiring it?
Thanks Nicolin
On Fri, Oct 04, 2024 at 11:13:46AM -0700, Nicolin Chen wrote:
On Fri, Oct 04, 2024 at 08:41:47AM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface to pass the guest's BDFs for a specific host device (which is passed through) and nothing in the kernel has any knowledge of it atm, is this the right place, or another ioctl() is needed here?
We probably need to add the vRID as well to this struct for that reason.
"vRID"/"vBDF" doesn't sound very generic to me to put in this structure, though PCI devices are and very likely will be the only users of this Virtual Device for a while. Any good idea?
It isn't necessarily bad to have a pci field as long as we can somehow understand when it is used.
Also, I am wondering if the uAPI structure of Virtual Device should have a driver-specific data structure. And the vdev_id should be in the driver-specific struct. So, it could stay in corresponding naming, "Stream ID", "Device ID" or "Context ID" v.s. a generic "Virtual ID" in the top-level structure? Then, other info like CCA can be put in the driver-level structure of SMMU's.
I'd to avoid a iommu-driver specific structure here, but I fear we will have a "lowervisor" (sigh) specific structure for the widely varied CC/pkvm/etc world.
Agreed. That also implies that a vRID is quite independent to the IOMMU right? So, I think that the reason of adding a vRID to the virtual deivce uAPI/structure should be IOMMU requiring it?
I would like to use this API to link in the CC/pkvm/etc world, and use it to create not just the vIOMMU components but link up to the "lowervisor" components as well, since it is all the same stuff basically.
Jason
On Fri, Oct 04, 2024 at 03:50:19PM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 11:13:46AM -0700, Nicolin Chen wrote:
On Fri, Oct 04, 2024 at 08:41:47AM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface to pass the guest's BDFs for a specific host device (which is passed through) and nothing in the kernel has any knowledge of it atm, is this the right place, or another ioctl() is needed here?
We probably need to add the vRID as well to this struct for that reason.
"vRID"/"vBDF" doesn't sound very generic to me to put in this structure, though PCI devices are and very likely will be the only users of this Virtual Device for a while. Any good idea?
It isn't necessarily bad to have a pci field as long as we can somehow understand when it is used.
OK.
Also, I am wondering if the uAPI structure of Virtual Device should have a driver-specific data structure. And the vdev_id should be in the driver-specific struct. So, it could stay in corresponding naming, "Stream ID", "Device ID" or "Context ID" v.s. a generic "Virtual ID" in the top-level structure? Then, other info like CCA can be put in the driver-level structure of SMMU's.
I'd to avoid a iommu-driver specific structure here, but I fear we will have a "lowervisor" (sigh) specific structure for the widely varied CC/pkvm/etc world.
The design of the structure also impacts how we implement the API between iommufd and the drivers. Right now, forwarding the ID via a function parameter is fine, but we would need a user structure once we have more stuff to forward.
With that, I wonder what is better for the initial version of this structure, a generic virtual ID or a driver-named ID like "Stream ID"? The latter might be more understandable/flexible, so we won't need to justify a generic virtual ID along the way if something changes in the nature?
Agreed. That also implies that a vRID is quite independent to the IOMMU right? So, I think that the reason of adding a vRID to the virtual deivce uAPI/structure should be IOMMU requiring it?
I would like to use this API to link in the CC/pkvm/etc world, and use it to create not just the vIOMMU components but link up to the "lowervisor" components as well, since it is all the same stuff basically.
That sounds wider than what I defined it for in my patch: * struct iommu_vdevice_alloc - ioctl(IOMMU_VDEVICE_ALLOC) * ... * Allocate a virtual device instance (for a physical device) against a vIOMMU. * This instance holds the device's information in a VM, related to its vIOMMU.
Would you please help rephrase it? It'd be also helpful for me to update the doc.
Though I feel slightly odd if we define it wider than "vIOMMU" since this is an iommufd header...
Thanks Nicolin
On Fri, Oct 04, 2024 at 12:25:19PM -0700, Nicolin Chen wrote:
With that, I wonder what is better for the initial version of this structure, a generic virtual ID or a driver-named ID like "Stream ID"? The latter might be more understandable/flexible, so we won't need to justify a generic virtual ID along the way if something changes in the nature?
I think the name could be a bit more specific "viommu_device_id" maybe? And elaborate in the kdoc that this is about the identifier that the iommu HW itself uses.
That sounds wider than what I defined it for in my patch:
- struct iommu_vdevice_alloc - ioctl(IOMMU_VDEVICE_ALLOC)
- ...
- Allocate a virtual device instance (for a physical device) against a vIOMMU.
- This instance holds the device's information in a VM, related to its vIOMMU.
Would you please help rephrase it? It'd be also helpful for me to update the doc.
I think that is still OK for the moment.
Though I feel slightly odd if we define it wider than "vIOMMU" since this is an iommufd header...
The notion I have is that vIOMMU would expand to encompass not just the physical hypervisor controled vIOMMU but also the vIOMMU controlled by the trusted "lowervisor" in a pkvm/cc/whatever world.
Alexey is working on vIOMMU support in CC which has the trusted world do some of the trusted vIOMMU components. I'm hoping the other people in this area will look at his design and make it fit nicely to everyone.
Jason
On Fri, Oct 04, 2024 at 05:17:46PM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 12:25:19PM -0700, Nicolin Chen wrote:
With that, I wonder what is better for the initial version of this structure, a generic virtual ID or a driver-named ID like "Stream ID"? The latter might be more understandable/flexible, so we won't need to justify a generic virtual ID along the way if something changes in the nature?
I think the name could be a bit more specific "viommu_device_id" maybe? And elaborate in the kdoc that this is about the identifier that the iommu HW itself uses.
A "vIOMMU Device ID" might sound a bit confusing with an ID of a vIOMMU itself :-/
At this moment, I named it "virt_id" in uAPI with a description: " * @virt_id: Virtual device ID per vIOMMU"
I could add to that (or just in the documentation): "e.g. ARM's Stream ID, AMD's DeviceID, Intel's Context Table ID" to clarify it further.
That sounds wider than what I defined it for in my patch:
- struct iommu_vdevice_alloc - ioctl(IOMMU_VDEVICE_ALLOC)
- ...
- Allocate a virtual device instance (for a physical device) against a vIOMMU.
- This instance holds the device's information in a VM, related to its vIOMMU.
Would you please help rephrase it? It'd be also helpful for me to update the doc.
I think that is still OK for the moment.
Though I feel slightly odd if we define it wider than "vIOMMU" since this is an iommufd header...
The notion I have is that vIOMMU would expand to encompass not just the physical hypervisor controled vIOMMU but also the vIOMMU controlled by the trusted "lowervisor" in a pkvm/cc/whatever world.
Alexey is working on vIOMMU support in CC which has the trusted world do some of the trusted vIOMMU components. I'm hoping the other people in this area will look at his design and make it fit nicely to everyone.
Oh, I didn't connect the dots that lowervisor must rely on the vIOMMU too -- I'd need to check the CC stuff in detail. In that case, having it in vIOMMU uAPI totally makes sense.
Thanks Nicolin
On Fri, Oct 04, 2024 at 01:33:33PM -0700, Nicolin Chen wrote:
On Fri, Oct 04, 2024 at 05:17:46PM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 12:25:19PM -0700, Nicolin Chen wrote:
With that, I wonder what is better for the initial version of this structure, a generic virtual ID or a driver-named ID like "Stream ID"? The latter might be more understandable/flexible, so we won't need to justify a generic virtual ID along the way if something changes in the nature?
I think the name could be a bit more specific "viommu_device_id" maybe? And elaborate in the kdoc that this is about the identifier that the iommu HW itself uses.
A "vIOMMU Device ID" might sound a bit confusing with an ID of a vIOMMU itself :-/
At this moment, I named it "virt_id" in uAPI with a description: " * @virt_id: Virtual device ID per vIOMMU"
I could add to that (or just in the documentation): "e.g. ARM's Stream ID, AMD's DeviceID, Intel's Context Table ID" to clarify it further.
Yeah probably best
Alexey is working on vIOMMU support in CC which has the trusted world do some of the trusted vIOMMU components. I'm hoping the other people in this area will look at his design and make it fit nicely to everyone.
Oh, I didn't connect the dots that lowervisor must rely on the vIOMMU too -- I'd need to check the CC stuff in detail. In that case, having it in vIOMMU uAPI totally makes sense.
I think we are still getting through this, and I do wonder how to manage it with so much stuff still in flux and sometimes private.
At least for AMD's CC case where there is an entanglement with the physical IOMMU it seems like it makes sense.
Even in the pKVM type case I think you end up ripping the translation away from the "physical" iommu and there should be some coordination to ensure this handover is clean and undone when iommufd is closed. But I wonder how the S2 works there..
Jason
A core-managed VIOMMU maintains an xarray to store a list of virtual ids to mock_devs.
Add test cases to cover the new IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctls.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- tools/testing/selftests/iommu/iommufd.c | 27 +++++++++++- tools/testing/selftests/iommu/iommufd_utils.h | 42 +++++++++++++++++++ 2 files changed, 67 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 5c770e94f299..f383f3bc7c8b 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -556,9 +556,12 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
TEST_F(iommufd_ioas, viommu_default) { + struct iommu_hwpt_selftest data = { + .iotlb = IOMMU_TEST_IOTLB_DEFAULT, + }; + uint32_t nested_hwpt_id = 0, hwpt_id = 0; uint32_t dev_id = self->device_id; uint32_t viommu_id = 0; - uint32_t hwpt_id = 0;
if (dev_id) { /* Negative test -- invalid hwpt */ @@ -575,17 +578,37 @@ TEST_F(iommufd_ioas, viommu_default) test_cmd_hwpt_alloc(dev_id, self->ioas_id, IOMMU_HWPT_ALLOC_NEST_PARENT, &hwpt_id); + test_cmd_mock_domain_replace(self->stdev_id, hwpt_id); + /* Negative test -- unsupported viommu type */ test_err_viommu_alloc(EOPNOTSUPP, dev_id, hwpt_id, 0xdead, &viommu_id); - /* Allocate a default type of viommu */ + + /* Allocate a default type of viommu and a nested hwpt on top */ test_cmd_viommu_alloc(dev_id, hwpt_id, IOMMU_VIOMMU_TYPE_DEFAULT, &viommu_id); + test_cmd_hwpt_alloc_nested(self->device_id, viommu_id, 0, + &nested_hwpt_id, + IOMMU_HWPT_DATA_SELFTEST, &data, + sizeof(data)); + test_cmd_mock_domain_replace(self->stdev_id, nested_hwpt_id); + + /* Set vdev_id to 0x99, unset it, and set to 0x88 */ + test_cmd_viommu_set_vdev_id(viommu_id, dev_id, 0x99); + test_err_viommu_set_vdev_id(EEXIST, viommu_id, dev_id, 0x99); + test_err_viommu_unset_vdev_id(EINVAL, viommu_id, dev_id, 0x88); + test_cmd_viommu_unset_vdev_id(viommu_id, dev_id, 0x99); + test_cmd_viommu_set_vdev_id(viommu_id, dev_id, 0x88); + + test_cmd_mock_domain_replace(self->stdev_id, hwpt_id); + test_ioctl_destroy(nested_hwpt_id); + test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id); test_ioctl_destroy(viommu_id); test_ioctl_destroy(hwpt_id); } else { test_err_viommu_alloc(ENOENT, dev_id, hwpt_id, IOMMU_VIOMMU_TYPE_DEFAULT, &viommu_id); + test_err_viommu_set_vdev_id(ENOENT, viommu_id, dev_id, 0x99); } }
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 307d097db9dd..be722ea88358 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -790,3 +790,45 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id, EXPECT_ERRNO(_errno, _test_cmd_viommu_alloc(self->fd, device_id, \ hwpt_id, type, 0, \ viommu_id)) + +static int _test_cmd_viommu_set_vdev_id(int fd, __u32 viommu_id, + __u32 idev_id, __u64 vdev_id) +{ + struct iommu_viommu_set_vdev_id cmd = { + .size = sizeof(cmd), + .dev_id = idev_id, + .viommu_id = viommu_id, + .vdev_id = vdev_id, + }; + + return ioctl(fd, IOMMU_VIOMMU_SET_VDEV_ID, &cmd); +} + +#define test_cmd_viommu_set_vdev_id(viommu_id, idev_id, vdev_id) \ + ASSERT_EQ(0, _test_cmd_viommu_set_vdev_id(self->fd, viommu_id, \ + idev_id, vdev_id)) +#define test_err_viommu_set_vdev_id(_errno, viommu_id, idev_id, vdev_id) \ + EXPECT_ERRNO(_errno, \ + _test_cmd_viommu_set_vdev_id(self->fd, viommu_id, \ + idev_id, vdev_id)) + +static int _test_cmd_viommu_unset_vdev_id(int fd, __u32 viommu_id, + __u32 idev_id, __u64 vdev_id) +{ + struct iommu_viommu_unset_vdev_id cmd = { + .size = sizeof(cmd), + .dev_id = idev_id, + .viommu_id = viommu_id, + .vdev_id = vdev_id, + }; + + return ioctl(fd, IOMMU_VIOMMU_UNSET_VDEV_ID, &cmd); +} + +#define test_cmd_viommu_unset_vdev_id(viommu_id, idev_id, vdev_id) \ + ASSERT_EQ(0, _test_cmd_viommu_unset_vdev_id(self->fd, viommu_id, \ + idev_id, vdev_id)) +#define test_err_viommu_unset_vdev_id(_errno, viommu_id, idev_id, vdev_id) \ + EXPECT_ERRNO(_errno, \ + _test_cmd_viommu_unset_vdev_id(self->fd, viommu_id, \ + idev_id, vdev_id))
Add a default_viommu_ops with a new op for cache invaldiation, similar to the cache_invalidate_user op in structure iommu_domain_ops, but wider. An IOMMU driver that allocated a nested domain with a core-managed viommu is able to use the same viommu pointer for this cache invalidation API.
ARM SMMUv3 for example supports IOTLB and ATC device cache invaldiations. The IOTLB invalidation is per-VMID, held currently by a parent S2 domain. The ATC invalidation is per device (Stream ID) that should be tranlsated by a virtual device ID lookup table. Either case fits the viommu context.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_private.h | 3 +++ drivers/iommu/iommufd/viommu.c | 3 +++ include/linux/iommu.h | 5 +++++ include/linux/iommufd.h | 19 +++++++++++++++++++ 4 files changed, 30 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 2c6e168c5300..7831b0ca6528 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -5,6 +5,7 @@ #define __IOMMUFD_PRIVATE_H
#include <linux/iommu.h> +#include <linux/iommufd.h> #include <linux/iova_bitmap.h> #include <linux/refcount.h> #include <linux/rwsem.h> @@ -538,6 +539,8 @@ struct iommufd_viommu { struct rw_semaphore vdev_ids_rwsem; struct xarray vdev_ids;
+ const struct iommufd_viommu_ops *ops; + unsigned int type; };
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 8ffcd72b16b8..a4ba8bff4a26 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -27,6 +27,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) struct iommufd_hwpt_paging *hwpt_paging; struct iommufd_viommu *viommu; struct iommufd_device *idev; + struct iommu_domain *domain; int rc;
if (cmd->flags) @@ -46,6 +47,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) rc = -EINVAL; goto out_put_hwpt; } + domain = hwpt_paging->common.domain;
if (cmd->type != IOMMU_VIOMMU_TYPE_DEFAULT) { rc = -EOPNOTSUPP; @@ -61,6 +63,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) viommu->type = cmd->type; viommu->ictx = ucmd->ictx; viommu->hwpt = hwpt_paging; + viommu->ops = domain->ops->default_viommu_ops;
xa_init(&viommu->vdev_ids); init_rwsem(&viommu->vdev_ids_rwsem); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index f62aad8a9e75..8c1034cc3f7e 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -43,6 +43,7 @@ struct iommu_sva; struct iommu_dma_cookie; struct iommu_fault_param; struct iommufd_viommu; +struct iommufd_viommu_ops;
#define IOMMU_FAULT_PERM_READ (1 << 0) /* read */ #define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */ @@ -633,6 +634,8 @@ struct iommu_ops { * array->entry_num to report the number of handled * invalidation requests. The driver data structure * must be defined in include/uapi/linux/iommufd.h + * @default_viommu_ops: Driver can choose to use a default core-allocated core- + * managed viommu object by providing a default viommu ops. * @iova_to_phys: translate iova to physical address * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE, * including no-snoop TLPs on PCIe or other platform @@ -665,6 +668,8 @@ struct iommu_domain_ops { phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova);
+ const struct iommufd_viommu_ops *default_viommu_ops; + bool (*enforce_cache_coherency)(struct iommu_domain *domain); int (*set_pgtable_quirks)(struct iommu_domain *domain, unsigned long quirks); diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 30f832a60ccb..85291b346348 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -13,9 +13,11 @@ struct device; struct file; struct iommu_group; +struct iommu_user_data_array; struct iommufd_access; struct iommufd_ctx; struct iommufd_device; +struct iommufd_viommu; struct page;
struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, @@ -54,6 +56,23 @@ void iommufd_access_detach(struct iommufd_access *access);
void iommufd_ctx_get(struct iommufd_ctx *ictx);
+/** + * struct iommufd_viommu_ops - viommu specific operations + * @cache_invalidate: Flush hardware cache used by a viommu. It can be used for + * any IOMMU hardware specific cache as long as a viommu has + * enough information to identify it: for example, a VMID or + * a vdev_id lookup table. + * The @array passes in the cache invalidation requests, in + * form of a driver data structure. A driver must update the + * array->entry_num to report the number of handled requests. + * The data structure of the array entry must be defined in + * include/uapi/linux/iommufd.h + */ +struct iommufd_viommu_ops { + int (*cache_invalidate)(struct iommufd_viommu *viommu, + struct iommu_user_data_array *array); +}; + #if IS_ENABLED(CONFIG_IOMMUFD) struct iommufd_ctx *iommufd_ctx_from_file(struct file *file); struct iommufd_ctx *iommufd_ctx_from_fd(int fd);
On Tue, Aug 27, 2024 at 09:59:45AM -0700, Nicolin Chen wrote:
Add a default_viommu_ops with a new op for cache invaldiation, similar to the cache_invalidate_user op in structure iommu_domain_ops, but wider. An IOMMU driver that allocated a nested domain with a core-managed viommu is able to use the same viommu pointer for this cache invalidation API.
ARM SMMUv3 for example supports IOTLB and ATC device cache invaldiations. The IOTLB invalidation is per-VMID, held currently by a parent S2 domain. The ATC invalidation is per device (Stream ID) that should be tranlsated by a virtual device ID lookup table. Either case fits the viommu context.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/iommufd_private.h | 3 +++ drivers/iommu/iommufd/viommu.c | 3 +++ include/linux/iommu.h | 5 +++++ include/linux/iommufd.h | 19 +++++++++++++++++++ 4 files changed, 30 insertions(+)
It looks OK
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
With a VIOMMU object, use space can flush any IOMMU related cache that can be directed using the viommu. It is similar to IOMMU_HWPT_INVALIDATE uAPI, but can cover a wider range than IOTLB, such as device cache or desciprtor cache.
Allow hwpt_id of the iommu_hwpt_invalidate structure to carry a viommu_id, and reuse the IOMMU_HWPT_INVALIDATE uAPI for VIOMMU invalidations. Driver can define a different structure for VIOMMU invalidations v.s. HWPT ones.
Update the uAPI, kdoc, and selftest case accordingly.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/hw_pagetable.c | 32 +++++++++++++++++++------ include/uapi/linux/iommufd.h | 9 ++++--- tools/testing/selftests/iommu/iommufd.c | 4 ++-- 3 files changed, 33 insertions(+), 12 deletions(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 06adbcc304bc..6aaec1b32abc 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -432,7 +432,7 @@ int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) .entry_len = cmd->entry_len, .entry_num = cmd->entry_num, }; - struct iommufd_hw_pagetable *hwpt; + struct iommufd_object *pt_obj; u32 done_num = 0; int rc;
@@ -446,17 +446,35 @@ int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) goto out; }
- hwpt = iommufd_get_hwpt_nested(ucmd, cmd->hwpt_id); - if (IS_ERR(hwpt)) { - rc = PTR_ERR(hwpt); + pt_obj = iommufd_get_object(ucmd->ictx, cmd->hwpt_id, IOMMUFD_OBJ_ANY); + if (IS_ERR(pt_obj)) { + rc = PTR_ERR(pt_obj); goto out; } + if (pt_obj->type == IOMMUFD_OBJ_HWPT_NESTED) { + struct iommufd_hw_pagetable *hwpt = + container_of(pt_obj, struct iommufd_hw_pagetable, obj); + + rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain, + &data_array); + } else if (pt_obj->type == IOMMUFD_OBJ_VIOMMU) { + struct iommufd_viommu *viommu = + container_of(pt_obj, struct iommufd_viommu, obj); + + if (!viommu->ops || !viommu->ops->cache_invalidate) { + rc = -EOPNOTSUPP; + goto out_put_pt; + } + rc = viommu->ops->cache_invalidate(viommu, &data_array); + } else { + rc = -EINVAL; + goto out_put_pt; + }
- rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain, - &data_array); done_num = data_array.entry_num;
- iommufd_put_object(ucmd->ictx, &hwpt->obj); +out_put_pt: + iommufd_put_object(ucmd->ictx, pt_obj); out: cmd->entry_num = done_num; if (iommufd_ucmd_respond(ucmd, sizeof(*cmd))) diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 1816e89c922d..fd7d16fd441d 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -729,7 +729,7 @@ struct iommu_hwpt_vtd_s1_invalidate { /** * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) * @size: sizeof(struct iommu_hwpt_invalidate) - * @hwpt_id: ID of a nested HWPT for cache invalidation + * @hwpt_id: ID of a nested HWPT or VIOMMU, for cache invalidation * @data_uptr: User pointer to an array of driver-specific cache invalidation * data. * @data_type: One of enum iommu_hwpt_invalidate_data_type, defining the data @@ -740,8 +740,11 @@ struct iommu_hwpt_vtd_s1_invalidate { * Output the number of requests successfully handled by kernel. * @__reserved: Must be 0. * - * Invalidate the iommu cache for user-managed page table. Modifications on a - * user-managed page table should be followed by this operation to sync cache. + * Invalidate iommu cache for user-managed page table or vIOMMU. Modifications + * on a user-managed page table should be followed by this operation, if a HWPT + * is passed in via @hwpt_id. Other caches, such as device cache or descriptor + * cache can be flushed if a VIOMMU is passed in via the @hwpt_id field. + * * Each ioctl can support one or more cache invalidation requests in the array * that has a total size of @entry_len * @entry_num. * diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index f383f3bc7c8b..12b5a8f78d4b 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -360,9 +360,9 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested) EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, parent_hwpt_id));
- /* hwpt_invalidate only supports a user-managed hwpt (nested) */ + /* hwpt_invalidate does not support a parent hwpt */ num_inv = 1; - test_err_hwpt_invalidate(ENOENT, parent_hwpt_id, inv_reqs, + test_err_hwpt_invalidate(EINVAL, parent_hwpt_id, inv_reqs, IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, sizeof(*inv_reqs), &num_inv); assert(!num_inv);
On Tue, Aug 27, 2024 at 09:59:46AM -0700, Nicolin Chen wrote:
With a VIOMMU object, use space can flush any IOMMU related cache that can be directed using the viommu. It is similar to IOMMU_HWPT_INVALIDATE uAPI, but can cover a wider range than IOTLB, such as device cache or desciprtor cache.
Allow hwpt_id of the iommu_hwpt_invalidate structure to carry a viommu_id, and reuse the IOMMU_HWPT_INVALIDATE uAPI for VIOMMU invalidations. Driver can define a different structure for VIOMMU invalidations v.s. HWPT ones.
Update the uAPI, kdoc, and selftest case accordingly.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/hw_pagetable.c | 32 +++++++++++++++++++------ include/uapi/linux/iommufd.h | 9 ++++--- tools/testing/selftests/iommu/iommufd.c | 4 ++-- 3 files changed, 33 insertions(+), 12 deletions(-)
Reviewed-by: Jason Gunthorpe jgg@nvidia.com
Jason
Driver can call the iommufd_viommu_find_device() to find a device pointer using its per-viommu virtual ID. The returned device must be protected by the pair of iommufd_viommu_lock/unlock_vdev_id() function.
Put these three functions into a new viommu_api file, to build it with the IOMMUFD_DRIVER config.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/Makefile | 2 +- drivers/iommu/iommufd/viommu_api.c | 39 ++++++++++++++++++++++++++++++ include/linux/iommufd.h | 16 ++++++++++++ 3 files changed, 56 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/iommufd/viommu_api.c
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile index df490e836b30..288ef3e895e3 100644 --- a/drivers/iommu/iommufd/Makefile +++ b/drivers/iommu/iommufd/Makefile @@ -13,4 +13,4 @@ iommufd-y := \ iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
obj-$(CONFIG_IOMMUFD) += iommufd.o -obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o +obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o viommu_api.o diff --git a/drivers/iommu/iommufd/viommu_api.c b/drivers/iommu/iommufd/viommu_api.c new file mode 100644 index 000000000000..e0ee592ce834 --- /dev/null +++ b/drivers/iommu/iommufd/viommu_api.c @@ -0,0 +1,39 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES + */ + +#include "iommufd_private.h" + +void iommufd_viommu_lock_vdev_id(struct iommufd_viommu *viommu) +{ + down_read(&viommu->vdev_ids_rwsem); +} +EXPORT_SYMBOL_NS_GPL(iommufd_viommu_lock_vdev_id, IOMMUFD); + +void iommufd_viommu_unlock_vdev_id(struct iommufd_viommu *viommu) +{ + up_read(&viommu->vdev_ids_rwsem); +} +EXPORT_SYMBOL_NS_GPL(iommufd_viommu_unlock_vdev_id, IOMMUFD); + +/* + * Find a device attached to an VIOMMU object using a virtual device ID that was + * set via an IOMMUFD_CMD_VIOMMU_SET_VDEV_ID. Callers of this function must call + * iommufd_viommu_lock_vdev_id() prior and iommufd_viommu_unlock_vdev_id() after + * + * Return device or NULL. + */ +struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id) +{ + struct iommufd_vdev_id *vdev_id; + + lockdep_assert_held(&viommu->vdev_ids_rwsem); + + xa_lock(&viommu->vdev_ids); + vdev_id = xa_load(&viommu->vdev_ids, (unsigned long)id); + xa_unlock(&viommu->vdev_ids); + if (!vdev_id || vdev_id->id != id) + return NULL; + return vdev_id->idev->dev; +} +EXPORT_SYMBOL_NS_GPL(iommufd_viommu_find_device, IOMMUFD); diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 85291b346348..364f151d281d 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -89,6 +89,9 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova, int iommufd_vfio_compat_ioas_get_id(struct iommufd_ctx *ictx, u32 *out_ioas_id); int iommufd_vfio_compat_ioas_create(struct iommufd_ctx *ictx); int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx); +void iommufd_viommu_lock_vdev_id(struct iommufd_viommu *viommu); +void iommufd_viommu_unlock_vdev_id(struct iommufd_viommu *viommu); +struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id); #else /* !CONFIG_IOMMUFD */ static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file) { @@ -129,5 +132,18 @@ static inline int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx) { return -EOPNOTSUPP; } + +void iommufd_viommu_lock_vdev_id(struct iommufd_viommu *viommu) +{ +} + +void iommufd_viommu_unlock_vdev_id(struct iommufd_viommu *viommu) +{ +} + +struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id) +{ + return NULL; +} #endif /* CONFIG_IOMMUFD */ #endif
On Tue, Aug 27, 2024 at 09:59:47AM -0700, Nicolin Chen wrote:
Driver can call the iommufd_viommu_find_device() to find a device pointer using its per-viommu virtual ID. The returned device must be protected by the pair of iommufd_viommu_lock/unlock_vdev_id() function.
Put these three functions into a new viommu_api file, to build it with the IOMMUFD_DRIVER config.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/Makefile | 2 +- drivers/iommu/iommufd/viommu_api.c | 39 ++++++++++++++++++++++++++++++ include/linux/iommufd.h | 16 ++++++++++++ 3 files changed, 56 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/iommufd/viommu_api.c
I still think this is better to just share the struct content with the driver, eventually we want to do this anyhow as the driver will want to use container_of() techniques to reach its private data.
+/*
- Find a device attached to an VIOMMU object using a virtual device ID that was
- set via an IOMMUFD_CMD_VIOMMU_SET_VDEV_ID. Callers of this function must call
- iommufd_viommu_lock_vdev_id() prior and iommufd_viommu_unlock_vdev_id() after
- Return device or NULL.
- */
+struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id) +{
- struct iommufd_vdev_id *vdev_id;
- lockdep_assert_held(&viommu->vdev_ids_rwsem);
- xa_lock(&viommu->vdev_ids);
- vdev_id = xa_load(&viommu->vdev_ids, (unsigned long)id);
- xa_unlock(&viommu->vdev_ids);
No need for this lock, xa_load is rcu safe against concurrent writer
Jason
On Thu, Sep 05, 2024 at 01:14:15PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:47AM -0700, Nicolin Chen wrote:
Driver can call the iommufd_viommu_find_device() to find a device pointer using its per-viommu virtual ID. The returned device must be protected by the pair of iommufd_viommu_lock/unlock_vdev_id() function.
Put these three functions into a new viommu_api file, to build it with the IOMMUFD_DRIVER config.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/Makefile | 2 +- drivers/iommu/iommufd/viommu_api.c | 39 ++++++++++++++++++++++++++++++ include/linux/iommufd.h | 16 ++++++++++++ 3 files changed, 56 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/iommufd/viommu_api.c
I still think this is better to just share the struct content with the driver, eventually we want to do this anyhow as the driver will want to use container_of() techniques to reach its private data.
In my mind, exposing everything to the driver is something that we have to (for driver-managed structures) v.s. we want to... Even in that case, a driver actually only need to know the size of the core structure, without touching what's inside(?).
I am a bit worried that drivers would abuse the content in the core-level structure.. Providing a set of API would encourage them to keep the core structure intact, hopefully..
+/*
- Find a device attached to an VIOMMU object using a virtual device ID that was
- set via an IOMMUFD_CMD_VIOMMU_SET_VDEV_ID. Callers of this function must call
- iommufd_viommu_lock_vdev_id() prior and iommufd_viommu_unlock_vdev_id() after
- Return device or NULL.
- */
+struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id) +{
- struct iommufd_vdev_id *vdev_id;
- lockdep_assert_held(&viommu->vdev_ids_rwsem);
- xa_lock(&viommu->vdev_ids);
- vdev_id = xa_load(&viommu->vdev_ids, (unsigned long)id);
- xa_unlock(&viommu->vdev_ids);
No need for this lock, xa_load is rcu safe against concurrent writer
I see iommufd's device.c and main.c grab xa_lock before xa_load?
Thanks Nicolin
On Thu, Sep 05, 2024 at 10:53:31AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:14:15PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:47AM -0700, Nicolin Chen wrote:
Driver can call the iommufd_viommu_find_device() to find a device pointer using its per-viommu virtual ID. The returned device must be protected by the pair of iommufd_viommu_lock/unlock_vdev_id() function.
Put these three functions into a new viommu_api file, to build it with the IOMMUFD_DRIVER config.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/Makefile | 2 +- drivers/iommu/iommufd/viommu_api.c | 39 ++++++++++++++++++++++++++++++ include/linux/iommufd.h | 16 ++++++++++++ 3 files changed, 56 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/iommufd/viommu_api.c
I still think this is better to just share the struct content with the driver, eventually we want to do this anyhow as the driver will want to use container_of() techniques to reach its private data.
In my mind, exposing everything to the driver is something that we have to (for driver-managed structures) v.s. we want to... Even in that case, a driver actually only need to know the size of the core structure, without touching what's inside(?).
I am a bit worried that drivers would abuse the content in the core-level structure.. Providing a set of API would encourage them to keep the core structure intact, hopefully..
This is always a tension in the kernel. If the core apis can be nice and tidy then it is a reasonable direction
But here I think we've cross some threshold where the APIs are complex, want to be inlined and really we just want to expose data not APIs to drivers.
No need for this lock, xa_load is rcu safe against concurrent writer
I see iommufd's device.c and main.c grab xa_lock before xa_load?
That is not to protect the xa_load, it is to protect the lifetime of pointer it returns
Jason
On Wed, Sep 11, 2024 at 08:11:03PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 05, 2024 at 10:53:31AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:14:15PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:47AM -0700, Nicolin Chen wrote:
Driver can call the iommufd_viommu_find_device() to find a device pointer using its per-viommu virtual ID. The returned device must be protected by the pair of iommufd_viommu_lock/unlock_vdev_id() function.
Put these three functions into a new viommu_api file, to build it with the IOMMUFD_DRIVER config.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/Makefile | 2 +- drivers/iommu/iommufd/viommu_api.c | 39 ++++++++++++++++++++++++++++++ include/linux/iommufd.h | 16 ++++++++++++ 3 files changed, 56 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/iommufd/viommu_api.c
I still think this is better to just share the struct content with the driver, eventually we want to do this anyhow as the driver will want to use container_of() techniques to reach its private data.
In my mind, exposing everything to the driver is something that we have to (for driver-managed structures) v.s. we want to... Even in that case, a driver actually only need to know the size of the core structure, without touching what's inside(?).
I am a bit worried that drivers would abuse the content in the core-level structure.. Providing a set of API would encourage them to keep the core structure intact, hopefully..
This is always a tension in the kernel. If the core apis can be nice and tidy then it is a reasonable direction
But here I think we've cross some threshold where the APIs are complex, want to be inlined and really we just want to expose data not APIs to drivers.
OK. I'll think of a rework. And might need another justification for a DEFAULT type of vIOMMU object to fit in.
No need for this lock, xa_load is rcu safe against concurrent writer
I see iommufd's device.c and main.c grab xa_lock before xa_load?
That is not to protect the xa_load, it is to protect the lifetime of pointer it returns
I see. I'd drop it.
Thanks Nicolin
On Wed, Sep 11, 2024 at 08:17:22PM -0700, Nicolin Chen wrote:
On Wed, Sep 11, 2024 at 08:11:03PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 05, 2024 at 10:53:31AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:14:15PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:47AM -0700, Nicolin Chen wrote:
Driver can call the iommufd_viommu_find_device() to find a device pointer using its per-viommu virtual ID. The returned device must be protected by the pair of iommufd_viommu_lock/unlock_vdev_id() function.
Put these three functions into a new viommu_api file, to build it with the IOMMUFD_DRIVER config.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com
drivers/iommu/iommufd/Makefile | 2 +- drivers/iommu/iommufd/viommu_api.c | 39 ++++++++++++++++++++++++++++++ include/linux/iommufd.h | 16 ++++++++++++ 3 files changed, 56 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/iommufd/viommu_api.c
I still think this is better to just share the struct content with the driver, eventually we want to do this anyhow as the driver will want to use container_of() techniques to reach its private data.
In my mind, exposing everything to the driver is something that we have to (for driver-managed structures) v.s. we want to... Even in that case, a driver actually only need to know the size of the core structure, without touching what's inside(?).
I am a bit worried that drivers would abuse the content in the core-level structure.. Providing a set of API would encourage them to keep the core structure intact, hopefully..
This is always a tension in the kernel. If the core apis can be nice and tidy then it is a reasonable direction
But here I think we've cross some threshold where the APIs are complex, want to be inlined and really we just want to expose data not APIs to drivers.
OK. I'll think of a rework. And might need another justification for a DEFAULT type of vIOMMU object to fit in.
I tried exposing the struct iommufd_viommu to drivers, and was able to drop a couple of helpers, except these two:
struct device *vdev_to_dev(struct iommufd_vdevice *vdev) { return vdev ? vdev->idev->dev : NULL; } // Without it, we need to expose struct iommufd_device.
struct iommu_domain * iommufd_viommu_to_parent_domain(struct iommufd_viommu *viommu) { if (!viommu || !viommu->hwpt) return NULL; return viommu->hwpt->common.domain; } // Without it, we need to expose struct iommufd_hwpt_page.
Thanks Nicolin
On Fri, Oct 04, 2024 at 10:19:43PM -0700, Nicolin Chen wrote:
I tried exposing the struct iommufd_viommu to drivers, and was able to drop a couple of helpers, except these two:
struct device *vdev_to_dev(struct iommufd_vdevice *vdev) { return vdev ? vdev->idev->dev : NULL; } // Without it, we need to expose struct iommufd_device.
struct iommu_domain * iommufd_viommu_to_parent_domain(struct iommufd_viommu *viommu) { if (!viommu || !viommu->hwpt) return NULL; return viommu->hwpt->common.domain; } // Without it, we need to expose struct iommufd_hwpt_page.
It seems OK, there isn't really locking entanglements or performance path on this stuff?
Jason
On Mon, Oct 07, 2024 at 12:38:37PM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 10:19:43PM -0700, Nicolin Chen wrote:
I tried exposing the struct iommufd_viommu to drivers, and was able to drop a couple of helpers, except these two:
struct device *vdev_to_dev(struct iommufd_vdevice *vdev) { return vdev ? vdev->idev->dev : NULL; } // Without it, we need to expose struct iommufd_device.
struct iommu_domain * iommufd_viommu_to_parent_domain(struct iommufd_viommu *viommu) { if (!viommu || !viommu->hwpt) return NULL; return viommu->hwpt->common.domain; } // Without it, we need to expose struct iommufd_hwpt_page.
It seems OK, there isn't really locking entanglements or performance path on this stuff?
----- The typical use case of the first one is like: dev = vdev_to_dev(xa_load(&viommu->vdevs, (unsigned long)vdev_id)); so I am asking for: /* Caller should lock via viommu->vdevs_rwsem with proper permission */
----- And for the second one: /* * Convert a viommu to the encapsulated nesting parent domain. A caller must be * aware of the life cycle of the viommu pointer: only call this function in a * callback functions of viommu_alloc or a viommu op. */ -----
Thanks Nicolin
On Mon, Oct 07, 2024 at 09:36:18AM -0700, Nicolin Chen wrote:
On Mon, Oct 07, 2024 at 12:38:37PM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 10:19:43PM -0700, Nicolin Chen wrote:
I tried exposing the struct iommufd_viommu to drivers, and was able to drop a couple of helpers, except these two:
struct device *vdev_to_dev(struct iommufd_vdevice *vdev) { return vdev ? vdev->idev->dev : NULL; } // Without it, we need to expose struct iommufd_device.
struct iommu_domain * iommufd_viommu_to_parent_domain(struct iommufd_viommu *viommu) { if (!viommu || !viommu->hwpt) return NULL; return viommu->hwpt->common.domain; } // Without it, we need to expose struct iommufd_hwpt_page.
It seems OK, there isn't really locking entanglements or performance path on this stuff?
The typical use case of the first one is like: dev = vdev_to_dev(xa_load(&viommu->vdevs, (unsigned long)vdev_id)); so I am asking for: /* Caller should lock via viommu->vdevs_rwsem with proper permission */
Why would vdev_to_dev need that locking? The viommu cannot change hwpt during its lifecycle?
Jason
On Mon, Oct 07, 2024 at 02:11:19PM -0300, Jason Gunthorpe wrote:
On Mon, Oct 07, 2024 at 09:36:18AM -0700, Nicolin Chen wrote:
On Mon, Oct 07, 2024 at 12:38:37PM -0300, Jason Gunthorpe wrote:
On Fri, Oct 04, 2024 at 10:19:43PM -0700, Nicolin Chen wrote:
I tried exposing the struct iommufd_viommu to drivers, and was able to drop a couple of helpers, except these two:
struct device *vdev_to_dev(struct iommufd_vdevice *vdev) { return vdev ? vdev->idev->dev : NULL; } // Without it, we need to expose struct iommufd_device.
struct iommu_domain * iommufd_viommu_to_parent_domain(struct iommufd_viommu *viommu) { if (!viommu || !viommu->hwpt) return NULL; return viommu->hwpt->common.domain; } // Without it, we need to expose struct iommufd_hwpt_page.
It seems OK, there isn't really locking entanglements or performance path on this stuff?
The typical use case of the first one is like: dev = vdev_to_dev(xa_load(&viommu->vdevs, (unsigned long)vdev_id)); so I am asking for: /* Caller should lock via viommu->vdevs_rwsem with proper permission */
Why would vdev_to_dev need that locking? The viommu cannot change hwpt during its lifecycle?
This is for vdev/dev v.s. hwpt. We need the lock for viommu's vdev xarray.
Yet, giving a 2nd thought, I feel the lock would be useless if a driver tries to refer the returned vdev (with this helper) after the vdev object is destroyed..
We could only note something similar that caller must be aware of the life cycle of vdev itself..
Nicolin
On Mon, Oct 07, 2024 at 10:25:01AM -0700, Nicolin Chen wrote:
This is for vdev/dev v.s. hwpt. We need the lock for viommu's vdev xarray.
Yet, giving a 2nd thought, I feel the lock would be useless if a driver tries to refer the returned vdev (with this helper) after the vdev object is destroyed..
We could only note something similar that caller must be aware of the life cycle of vdev itself..
Yes, I imagined you'd use the xa_lock for this an it solves both problems at once.
Jason
On Mon, Oct 07, 2024 at 03:28:16PM -0300, Jason Gunthorpe wrote:
On Mon, Oct 07, 2024 at 10:25:01AM -0700, Nicolin Chen wrote:
This is for vdev/dev v.s. hwpt. We need the lock for viommu's vdev xarray.
Yet, giving a 2nd thought, I feel the lock would be useless if a driver tries to refer the returned vdev (with this helper) after the vdev object is destroyed..
We could only note something similar that caller must be aware of the life cycle of vdev itself..
Yes, I imagined you'd use the xa_lock for this an it solves both problems at once.
Ah! We don't even need a rwsem then..
Thanks! Nicolin
From: Jason Gunthorpe jgg@nvidia.com
The iommu_copy_struct_from_user_array helper can be used to copy a single entry from a user array which might not be efficient if the array is big.
Add a new iommu_copy_struct_from_full_user_array to copy the entire user array at once. Update the existing iommu_copy_struct_from_user_array kdoc accordingly.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- include/linux/iommu.h | 49 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 8c1034cc3f7e..556b6d6cf2a8 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -493,7 +493,9 @@ static inline int __iommu_copy_struct_from_user_array( * @index: Index to the location in the array to copy user data from * @min_last: The last member of the data structure @kdst points in the * initial version. - * Return 0 for success, otherwise -error. + * + * Copy a single entry from a user array. Return 0 for success, otherwise + * -error. */ #define iommu_copy_struct_from_user_array(kdst, user_array, data_type, index, \ min_last) \ @@ -501,6 +503,51 @@ static inline int __iommu_copy_struct_from_user_array( kdst, user_array, data_type, index, sizeof(*(kdst)), \ offsetofend(typeof(*(kdst)), min_last))
+ +/** + * iommu_copy_struct_from_full_user_array - Copy iommu driver specific user + * space data from an iommu_user_data_array + * @kdst: Pointer to an iommu driver specific user data that is defined in + * include/uapi/linux/iommufd.h + * @kdst_entry_size: sizeof(*kdst) + * @user_array: Pointer to a struct iommu_user_data_array for a user space + * array + * @data_type: The data type of the @kdst. Must match with @user_array->type + * + * Copy the entire user array. kdst must have room for kdst_entry_size * + * user_array->entry_num bytes. Return 0 for success, otherwise -error. + */ +static inline int +iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size, + struct iommu_user_data_array *user_array, + unsigned int data_type) +{ + unsigned int i; + int ret; + + if (user_array->type != data_type) + return -EINVAL; + if (!user_array->entry_num) + return -EINVAL; + if (likely(user_array->entry_len == kdst_entry_size)) { + if (copy_from_user(kdst, user_array->uptr, + user_array->entry_num * + user_array->entry_len)) + return -EFAULT; + } + + /* Copy item by item */ + for (i = 0; i != user_array->entry_num; i++) { + ret = copy_struct_from_user( + kdst + kdst_entry_size * i, kdst_entry_size, + user_array->uptr + user_array->entry_len * i, + user_array->entry_len); + if (ret) + return ret; + } + return 0; +} + /** * struct iommu_ops - iommu ops and capabilities * @capable: check capability
Similar to the coverage of cache_invalidate_user for iotlb invalidation, add a device cache and an invalidation op to test IOMMU_VIOMMU_INVALIDATE ioctl.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 25 +++++++++ drivers/iommu/iommufd/selftest.c | 76 +++++++++++++++++++++++++++- 2 files changed, 100 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index f4bc23a92f9a..368076da10ca 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -54,6 +54,11 @@ enum { MOCK_NESTED_DOMAIN_IOTLB_NUM = 4, };
+enum { + MOCK_DEV_CACHE_ID_MAX = 3, + MOCK_DEV_CACHE_NUM = 4, +}; + struct iommu_test_cmd { __u32 size; __u32 op; @@ -152,6 +157,7 @@ struct iommu_test_hw_info { /* Should not be equal to any defined value in enum iommu_hwpt_data_type */ #define IOMMU_HWPT_DATA_SELFTEST 0xdead #define IOMMU_TEST_IOTLB_DEFAULT 0xbadbeef +#define IOMMU_TEST_DEV_CACHE_DEFAULT 0xbaddad
/** * struct iommu_hwpt_selftest @@ -180,4 +186,23 @@ struct iommu_hwpt_invalidate_selftest { __u32 iotlb_id; };
+/* Should not be equal to any defined value in enum iommu_viommu_invalidate_data_type */ +#define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST 0xdeadbeef +#define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST_INVALID 0xdadbeef + +/** + * struct iommu_viommu_invalidate_selftest - Invalidation data for Mock VIOMMU + * (IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST) + * @flags: Invalidate flags + * @cache_id: Invalidate cache entry index + * + * If IOMMU_TEST_INVALIDATE_ALL is set in @flags, @cache_id will be ignored + */ +struct iommu_viommu_invalidate_selftest { +#define IOMMU_TEST_INVALIDATE_FLAG_ALL (1 << 0) + __u32 flags; + __u32 vdev_id; + __u32 cache_id; +}; + #endif diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 4a23530ea027..8abffc7794c8 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -139,6 +139,7 @@ struct mock_dev { struct device dev; unsigned long flags; int id; + u32 cache[MOCK_DEV_CACHE_NUM]; };
struct selftest_obj { @@ -540,6 +541,74 @@ static int mock_dev_disable_feat(struct device *dev, enum iommu_dev_features fea return 0; }
+static int mock_viommu_cache_invalidate(struct iommufd_viommu *viommu, + struct iommu_user_data_array *array) +{ + struct iommu_viommu_invalidate_selftest *cmds; + struct iommu_viommu_invalidate_selftest *cur; + struct iommu_viommu_invalidate_selftest *end; + int rc; + + /* A zero-length array is allowed to validate the array type */ + if (array->entry_num == 0 && + array->type == IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST) { + array->entry_num = 0; + return 0; + } + + cmds = kcalloc(array->entry_num, sizeof(*cmds), GFP_KERNEL); + if (!cmds) + return -ENOMEM; + cur = cmds; + end = cmds + array->entry_num; + + static_assert(sizeof(*cmds) == 3 * sizeof(u32)); + rc = iommu_copy_struct_from_full_user_array( + cmds, sizeof(*cmds), array, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST); + if (rc) + goto out; + + iommufd_viommu_lock_vdev_id(viommu); + while (cur != end) { + struct mock_dev *mdev; + struct device *dev; + int i; + + if (cur->flags & ~IOMMU_TEST_INVALIDATE_FLAG_ALL) { + rc = -EOPNOTSUPP; + goto out; + } + + if (cur->cache_id > MOCK_DEV_CACHE_ID_MAX) { + rc = -EINVAL; + goto out; + } + + dev = iommufd_viommu_find_device(viommu, cur->vdev_id); + if (!dev) { + rc = -EINVAL; + goto out; + } + mdev = container_of(dev, struct mock_dev, dev); + + if (cur->flags & IOMMU_TEST_INVALIDATE_FLAG_ALL) { + /* Invalidate all cache entries and ignore cache_id */ + for (i = 0; i < MOCK_DEV_CACHE_NUM; i++) + mdev->cache[i] = 0; + } else { + mdev->cache[cur->cache_id] = 0; + } + + cur++; + } +out: + iommufd_viommu_unlock_vdev_id(viommu); + array->entry_num = cur - cmds; + kfree(cmds); + return rc; +} + static const struct iommu_ops mock_ops = { /* * IOMMU_DOMAIN_BLOCKED cannot be returned from def_domain_type() @@ -566,6 +635,9 @@ static const struct iommu_ops mock_ops = { .map_pages = mock_domain_map_pages, .unmap_pages = mock_domain_unmap_pages, .iova_to_phys = mock_domain_iova_to_phys, + .default_viommu_ops = &(struct iommufd_viommu_ops){ + .cache_invalidate = mock_viommu_cache_invalidate, + }, }, };
@@ -691,7 +763,7 @@ static void mock_dev_release(struct device *dev) static struct mock_dev *mock_dev_create(unsigned long dev_flags) { struct mock_dev *mdev; - int rc; + int rc, i;
if (dev_flags & ~(MOCK_FLAGS_DEVICE_NO_DIRTY | MOCK_FLAGS_DEVICE_HUGE_IOVA)) @@ -705,6 +777,8 @@ static struct mock_dev *mock_dev_create(unsigned long dev_flags) mdev->flags = dev_flags; mdev->dev.release = mock_dev_release; mdev->dev.bus = &iommufd_mock_bus_type.bus; + for (i = 0; i < MOCK_DEV_CACHE_NUM; i++) + mdev->cache[i] = IOMMU_TEST_DEV_CACHE_DEFAULT;
rc = ida_alloc(&mock_dev_ida, GFP_KERNEL); if (rc < 0)
Similar to IOMMU_TEST_OP_MD_CHECK_IOTLB verifying a mock_domain's iotlb, IOMMU_TEST_OP_DEV_CHECK_CACHE will be used to verify a mock_dev's cache.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/iommufd_test.h | 5 ++++ drivers/iommu/iommufd/selftest.c | 24 +++++++++++++++++++ tools/testing/selftests/iommu/iommufd.c | 7 +++++- tools/testing/selftests/iommu/iommufd_utils.h | 24 +++++++++++++++++++ 4 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 368076da10ca..56bade6146ff 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -23,6 +23,7 @@ enum { IOMMU_TEST_OP_DIRTY, IOMMU_TEST_OP_MD_CHECK_IOTLB, IOMMU_TEST_OP_TRIGGER_IOPF, + IOMMU_TEST_OP_DEV_CHECK_CACHE, };
enum { @@ -140,6 +141,10 @@ struct iommu_test_cmd { __u32 perm; __u64 addr; } trigger_iopf; + struct { + __u32 id; + __u32 cache; + } check_dev_cache; }; __u32 last; }; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 8abffc7794c8..f512874105ac 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -1035,6 +1035,26 @@ static int iommufd_test_md_check_iotlb(struct iommufd_ucmd *ucmd, return rc; }
+static int iommufd_test_dev_check_cache(struct iommufd_ucmd *ucmd, + u32 idev_id, unsigned int cache_id, + u32 cache) +{ + struct iommufd_device *idev; + struct mock_dev *mdev; + int rc = 0; + + idev = iommufd_get_device(ucmd, idev_id); + if (IS_ERR(idev)) + return PTR_ERR(idev); + mdev = container_of(idev->dev, struct mock_dev, dev); + + if (cache_id > MOCK_DEV_CACHE_ID_MAX || + mdev->cache[cache_id] != cache) + rc = -EINVAL; + iommufd_put_object(ucmd->ictx, &idev->obj); + return rc; +} + struct selftest_access { struct iommufd_access *access; struct file *file; @@ -1545,6 +1565,10 @@ int iommufd_test(struct iommufd_ucmd *ucmd) return iommufd_test_md_check_iotlb(ucmd, cmd->id, cmd->check_iotlb.id, cmd->check_iotlb.iotlb); + case IOMMU_TEST_OP_DEV_CHECK_CACHE: + return iommufd_test_dev_check_cache(ucmd, cmd->id, + cmd->check_dev_cache.id, + cmd->check_dev_cache.cache); case IOMMU_TEST_OP_CREATE_ACCESS: return iommufd_test_create_access(ucmd, cmd->id, cmd->create_access.flags); diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 12b5a8f78d4b..1b45445dbd53 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -220,6 +220,8 @@ FIXTURE_SETUP(iommufd_ioas) for (i = 0; i != variant->mock_domains; i++) { test_cmd_mock_domain(self->ioas_id, &self->stdev_id, &self->hwpt_id, &self->device_id); + test_cmd_dev_check_cache_all(self->device_id, + IOMMU_TEST_DEV_CACHE_DEFAULT); self->base_iova = MOCK_APERTURE_START; } } @@ -1442,9 +1444,12 @@ FIXTURE_SETUP(iommufd_mock_domain)
ASSERT_GE(ARRAY_SIZE(self->hwpt_ids), variant->mock_domains);
- for (i = 0; i != variant->mock_domains; i++) + for (i = 0; i != variant->mock_domains; i++) { test_cmd_mock_domain(self->ioas_id, &self->stdev_ids[i], &self->hwpt_ids[i], &self->idev_ids[i]); + test_cmd_dev_check_cache_all(self->idev_ids[0], + IOMMU_TEST_DEV_CACHE_DEFAULT); + } self->hwpt_id = self->hwpt_ids[0];
self->mmap_flags = MAP_SHARED | MAP_ANONYMOUS; diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index be722ea88358..d697a7aa55c9 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -234,6 +234,30 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id, __u32 ft_i test_cmd_hwpt_check_iotlb(hwpt_id, i, expected); \ })
+#define test_cmd_dev_check_cache(device_id, cache_id, expected) \ + ({ \ + struct iommu_test_cmd test_cmd = { \ + .size = sizeof(test_cmd), \ + .op = IOMMU_TEST_OP_DEV_CHECK_CACHE, \ + .id = device_id, \ + .check_dev_cache = { \ + .id = cache_id, \ + .cache = expected, \ + }, \ + }; \ + ASSERT_EQ(0, \ + ioctl(self->fd, \ + _IOMMU_TEST_CMD(IOMMU_TEST_OP_DEV_CHECK_CACHE),\ + &test_cmd)); \ + }) + +#define test_cmd_dev_check_cache_all(device_id, expected) \ + ({ \ + int c; \ + for (c = 0; c < MOCK_DEV_CACHE_NUM; c++) \ + test_cmd_dev_check_cache(device_id, c, expected); \ + }) + static int _test_cmd_hwpt_invalidate(int fd, __u32 hwpt_id, void *reqs, uint32_t data_type, uint32_t lreq, uint32_t *nreqs)
Add a viommu_cache test function to cover VIOMMU invalidations using the updated IOMMU_VIOMMU_INVALIDATE ioctl, with similar postive and negative cases to the existing iotlb ones.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- tools/testing/selftests/iommu/iommufd.c | 190 ++++++++++++++++++ tools/testing/selftests/iommu/iommufd_utils.h | 32 +++ 2 files changed, 222 insertions(+)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 1b45445dbd53..6f1014cc208b 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -614,6 +614,196 @@ TEST_F(iommufd_ioas, viommu_default) } }
+TEST_F(iommufd_ioas, viommu_dev_cache) +{ + struct iommu_viommu_invalidate_selftest inv_reqs[2] = {}; + struct iommu_hwpt_selftest data = { + .iotlb = IOMMU_TEST_IOTLB_DEFAULT, + }; + uint32_t nested_hwpt_id = 0, hwpt_id = 0; + uint32_t dev_id = self->device_id; + uint32_t viommu_id = 0; + uint32_t num_inv; + + if (dev_id) { + test_cmd_hwpt_alloc(dev_id, self->ioas_id, + IOMMU_HWPT_ALLOC_NEST_PARENT, &hwpt_id); + test_cmd_viommu_alloc(dev_id, hwpt_id, + IOMMU_VIOMMU_TYPE_DEFAULT, &viommu_id); + test_cmd_hwpt_alloc_nested(self->device_id, viommu_id, 0, + &nested_hwpt_id, + IOMMU_HWPT_DATA_SELFTEST, &data, + sizeof(data)); + test_cmd_mock_domain_replace(self->stdev_id, nested_hwpt_id); + test_cmd_viommu_set_vdev_id(viommu_id, dev_id, 0x99); + + test_cmd_dev_check_cache_all(dev_id, + IOMMU_TEST_DEV_CACHE_DEFAULT); + + /* Check data_type by passing zero-length array */ + num_inv = 0; + test_cmd_viommu_invalidate(viommu_id, inv_reqs, + sizeof(*inv_reqs), &num_inv); + assert(!num_inv); + + /* Negative test: Invalid data_type */ + num_inv = 1; + test_err_viommu_invalidate(EINVAL, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST_INVALID, + sizeof(*inv_reqs), &num_inv); + assert(!num_inv); + + /* Negative test: structure size sanity */ + num_inv = 1; + test_err_viommu_invalidate(EINVAL, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + sizeof(*inv_reqs) + 1, &num_inv); + assert(!num_inv); + + num_inv = 1; + test_err_viommu_invalidate(EINVAL, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + 1, &num_inv); + assert(!num_inv); + + /* Negative test: invalid flag is passed */ + num_inv = 1; + inv_reqs[0].flags = 0xffffffff; + inv_reqs[0].vdev_id = 0x99; + test_err_viommu_invalidate(EOPNOTSUPP, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + sizeof(*inv_reqs), &num_inv); + assert(!num_inv); + + /* Negative test: invalid data_uptr when array is not empty */ + num_inv = 1; + inv_reqs[0].flags = 0; + inv_reqs[0].vdev_id = 0x99; + test_err_viommu_invalidate(EINVAL, viommu_id, NULL, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + sizeof(*inv_reqs), &num_inv); + assert(!num_inv); + + /* Negative test: invalid entry_len when array is not empty */ + num_inv = 1; + inv_reqs[0].flags = 0; + inv_reqs[0].vdev_id = 0x99; + test_err_viommu_invalidate(EINVAL, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + 0, &num_inv); + assert(!num_inv); + + /* Negative test: invalid cache_id */ + num_inv = 1; + inv_reqs[0].flags = 0; + inv_reqs[0].vdev_id = 0x99; + inv_reqs[0].cache_id = MOCK_DEV_CACHE_ID_MAX + 1; + test_err_viommu_invalidate(EINVAL, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + sizeof(*inv_reqs), &num_inv); + assert(!num_inv); + + /* Negative test: invalid vdev_id */ + num_inv = 1; + inv_reqs[0].flags = 0; + inv_reqs[0].vdev_id = 0x9; + inv_reqs[0].cache_id = 0; + test_err_viommu_invalidate(EINVAL, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + sizeof(*inv_reqs), &num_inv); + assert(!num_inv); + + /* + * Invalidate the 1st cache entry but fail the 2nd request + * due to invalid flags configuration in the 2nd request. + */ + num_inv = 2; + inv_reqs[0].flags = 0; + inv_reqs[0].vdev_id = 0x99; + inv_reqs[0].cache_id = 0; + inv_reqs[1].flags = 0xffffffff; + inv_reqs[1].vdev_id = 0x99; + inv_reqs[1].cache_id = 1; + test_err_viommu_invalidate(EOPNOTSUPP, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + sizeof(*inv_reqs), &num_inv); + assert(num_inv == 1); + test_cmd_dev_check_cache(dev_id, 0, 0); + test_cmd_dev_check_cache(dev_id, 1, + IOMMU_TEST_DEV_CACHE_DEFAULT); + test_cmd_dev_check_cache(dev_id, 2, + IOMMU_TEST_DEV_CACHE_DEFAULT); + test_cmd_dev_check_cache(dev_id, 3, + IOMMU_TEST_DEV_CACHE_DEFAULT); + + /* + * Invalidate the 1st cache entry but fail the 2nd request + * due to invalid cache_id configuration in the 2nd request. + */ + num_inv = 2; + inv_reqs[0].flags = 0; + inv_reqs[0].vdev_id = 0x99; + inv_reqs[0].cache_id = 0; + inv_reqs[1].flags = 0; + inv_reqs[1].vdev_id = 0x99; + inv_reqs[1].cache_id = MOCK_DEV_CACHE_ID_MAX + 1; + test_err_viommu_invalidate(EINVAL, viommu_id, inv_reqs, + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, + sizeof(*inv_reqs), &num_inv); + assert(num_inv == 1); + test_cmd_dev_check_cache(dev_id, 0, 0); + test_cmd_dev_check_cache(dev_id, 1, + IOMMU_TEST_DEV_CACHE_DEFAULT); + test_cmd_dev_check_cache(dev_id, 2, + IOMMU_TEST_DEV_CACHE_DEFAULT); + test_cmd_dev_check_cache(dev_id, 3, + IOMMU_TEST_DEV_CACHE_DEFAULT); + + /* Invalidate the 2nd cache entry and verify */ + num_inv = 1; + inv_reqs[0].flags = 0; + inv_reqs[0].vdev_id = 0x99; + inv_reqs[0].cache_id = 1; + test_cmd_viommu_invalidate(viommu_id, inv_reqs, + sizeof(*inv_reqs), &num_inv); + assert(num_inv == 1); + test_cmd_dev_check_cache(dev_id, 0, 0); + test_cmd_dev_check_cache(dev_id, 1, 0); + test_cmd_dev_check_cache(dev_id, 2, + IOMMU_TEST_DEV_CACHE_DEFAULT); + test_cmd_dev_check_cache(dev_id, 3, + IOMMU_TEST_DEV_CACHE_DEFAULT); + + /* Invalidate the 3rd and 4th cache entries and verify */ + num_inv = 2; + inv_reqs[0].flags = 0; + inv_reqs[0].vdev_id = 0x99; + inv_reqs[0].cache_id = 2; + inv_reqs[1].flags = 0; + inv_reqs[1].vdev_id = 0x99; + inv_reqs[1].cache_id = 3; + test_cmd_viommu_invalidate(viommu_id, inv_reqs, + sizeof(*inv_reqs), &num_inv); + assert(num_inv == 2); + test_cmd_dev_check_cache_all(dev_id, 0); + + /* Invalidate all cache entries for nested_dev_id[1] and verify */ + num_inv = 1; + inv_reqs[0].vdev_id = 0x99; + inv_reqs[0].flags = IOMMU_TEST_INVALIDATE_FLAG_ALL; + test_cmd_viommu_invalidate(viommu_id, inv_reqs, + sizeof(*inv_reqs), &num_inv); + assert(num_inv == 1); + test_cmd_dev_check_cache_all(dev_id, 0); + + test_cmd_mock_domain_replace(self->stdev_id, hwpt_id); + test_ioctl_destroy(nested_hwpt_id); + test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id); + test_ioctl_destroy(viommu_id); + test_ioctl_destroy(hwpt_id); + } +} + TEST_F(iommufd_ioas, hwpt_attach) { /* Create a device attached directly to a hwpt */ diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index d697a7aa55c9..0a81827b903f 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -289,6 +289,38 @@ static int _test_cmd_hwpt_invalidate(int fd, __u32 hwpt_id, void *reqs, data_type, lreq, nreqs)); \ })
+static int _test_cmd_viommu_invalidate(int fd, __u32 viommu_id, void *reqs, + uint32_t data_type, uint32_t lreq, + uint32_t *nreqs) +{ + struct iommu_hwpt_invalidate cmd = { + .size = sizeof(cmd), + .hwpt_id = viommu_id, + .data_type = data_type, + .data_uptr = (uint64_t)reqs, + .entry_len = lreq, + .entry_num = *nreqs, + }; + int rc = ioctl(fd, IOMMU_HWPT_INVALIDATE, &cmd); + *nreqs = cmd.entry_num; + return rc; +} + +#define test_cmd_viommu_invalidate(viommu, reqs, lreq, nreqs) \ + ({ \ + ASSERT_EQ(0, \ + _test_cmd_viommu_invalidate(self->fd, viommu, reqs, \ + IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST, \ + lreq, nreqs)); \ + }) +#define test_err_viommu_invalidate(_errno, viommu_id, reqs, data_type, lreq, \ + nreqs) \ + ({ \ + EXPECT_ERRNO(_errno, _test_cmd_viommu_invalidate( \ + self->fd, viommu_id, reqs, \ + data_type, lreq, nreqs)); \ + }) + static int _test_cmd_access_replace_ioas(int fd, __u32 access_id, unsigned int ioas_id) {
Allow an IOMMU driver to convert a core-managed viommu to a nested parent domain for the info that the domain holds.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/iommufd/viommu_api.c | 14 ++++++++++++++ include/linux/iommufd.h | 8 ++++++++ 2 files changed, 22 insertions(+)
diff --git a/drivers/iommu/iommufd/viommu_api.c b/drivers/iommu/iommufd/viommu_api.c index e0ee592ce834..3772a5892a6c 100644 --- a/drivers/iommu/iommufd/viommu_api.c +++ b/drivers/iommu/iommufd/viommu_api.c @@ -37,3 +37,17 @@ struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id) return vdev_id->idev->dev; } EXPORT_SYMBOL_NS_GPL(iommufd_viommu_find_device, IOMMUFD); + +/* + * Convert a viommu to its encapsulated nest parent domain. Caller must be aware + * of the lifecycle of the viommu pointer. Only call this function in a callback + * function where viommu is passed in by the iommu/iommufd core. + */ +struct iommu_domain * +iommufd_viommu_to_parent_domain(struct iommufd_viommu *viommu) +{ + if (!viommu || !viommu->hwpt) + return NULL; + return viommu->hwpt->common.domain; +} +EXPORT_SYMBOL_NS_GPL(iommufd_viommu_to_parent_domain, IOMMUFD); diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 364f151d281d..f7c265c6de7c 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -92,6 +92,8 @@ int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx); void iommufd_viommu_lock_vdev_id(struct iommufd_viommu *viommu); void iommufd_viommu_unlock_vdev_id(struct iommufd_viommu *viommu); struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id); +struct iommu_domain * +iommufd_viommu_to_parent_domain(struct iommufd_viommu *viommu); #else /* !CONFIG_IOMMUFD */ static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file) { @@ -145,5 +147,11 @@ struct device *iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id) { return NULL; } + +static inline struct iommu_domain * +iommufd_viommu_to_parent_domain(struct iommufd_viommu *viommu) +{ + return NULL; +} #endif /* CONFIG_IOMMUFD */ #endif
Add arm_smmu_cache_invalidate_user() function for user space to invalidate IOTLB entries that are still cached by the hardware.
Add struct iommu_hwpt_arm_smmuv3_invalidate defining an invalidation entry that is simply the native format of a 128-bit TLBI command. Scan commands against the permitted command list and fix their VMID fields.
Co-developed-by: Eric Auger eric.auger@redhat.com Signed-off-by: Eric Auger eric.auger@redhat.com Co-developed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 115 ++++++++++++++++++++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 + include/uapi/linux/iommufd.h | 21 ++++ 3 files changed, 137 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6d40f1e150cb..a2af693bc7b2 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3267,10 +3267,117 @@ static void arm_smmu_domain_nested_free(struct iommu_domain *domain) kfree(container_of(domain, struct arm_smmu_nested_domain, domain)); }
+/* + * Convert, in place, the raw invalidation command into an internal format that + * can be passed to arm_smmu_cmdq_issue_cmdlist(). Internally commands are + * stored in CPU endian. + * + * Enforce the VMID on the command. + */ +static int +arm_smmu_convert_user_cmd(struct arm_smmu_domain *s2_parent, + struct iommu_hwpt_arm_smmuv3_invalidate *cmd) +{ + u16 vmid = s2_parent->s2_cfg.vmid; + + cmd->cmd[0] = le64_to_cpu(cmd->cmd[0]); + cmd->cmd[1] = le64_to_cpu(cmd->cmd[1]); + + switch (cmd->cmd[0] & CMDQ_0_OP) { + case CMDQ_OP_TLBI_NSNH_ALL: + /* Convert to NH_ALL */ + cmd->cmd[0] = CMDQ_OP_TLBI_NH_ALL | + FIELD_PREP(CMDQ_TLBI_0_VMID, vmid); + cmd->cmd[1] = 0; + break; + case CMDQ_OP_TLBI_NH_VA: + case CMDQ_OP_TLBI_NH_VAA: + case CMDQ_OP_TLBI_NH_ALL: + case CMDQ_OP_TLBI_NH_ASID: + cmd->cmd[0] &= ~CMDQ_TLBI_0_VMID; + cmd->cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, vmid); + break; + default: + return -EIO; + } + return 0; +} + +static int __arm_smmu_cache_invalidate_user(struct arm_smmu_domain *s2_parent, + struct iommu_user_data_array *array) +{ + struct arm_smmu_device *smmu = s2_parent->smmu; + struct iommu_hwpt_arm_smmuv3_invalidate *last_batch; + struct iommu_hwpt_arm_smmuv3_invalidate *cmds; + struct iommu_hwpt_arm_smmuv3_invalidate *cur; + struct iommu_hwpt_arm_smmuv3_invalidate *end; + struct arm_smmu_cmdq_ent ent; + struct arm_smmu_cmdq *cmdq; + int ret; + + /* A zero-length array is allowed to validate the array type */ + if (array->entry_num == 0 && + array->type == IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3) { + array->entry_num = 0; + return 0; + } + + cmds = kcalloc(array->entry_num, sizeof(*cmds), GFP_KERNEL); + if (!cmds) + return -ENOMEM; + cur = cmds; + end = cmds + array->entry_num; + + static_assert(sizeof(*cmds) == 2 * sizeof(u64)); + ret = iommu_copy_struct_from_full_user_array( + cmds, sizeof(*cmds), array, + IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3); + if (ret) + goto out; + + ent.opcode = cmds->cmd[0] & CMDQ_0_OP; + cmdq = arm_smmu_get_cmdq(smmu, &ent); + + last_batch = cmds; + while (cur != end) { + ret = arm_smmu_convert_user_cmd(s2_parent, cur); + if (ret) + goto out; + + /* FIXME work in blocks of CMDQ_BATCH_ENTRIES and copy each block? */ + cur++; + if (cur != end && (cur - last_batch) != CMDQ_BATCH_ENTRIES - 1) + continue; + + ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, last_batch->cmd, + cur - last_batch, true); + if (ret) { + cur--; + goto out; + } + last_batch = cur; + } +out: + array->entry_num = cur - cmds; + kfree(cmds); + return ret; +} + +static int arm_smmu_cache_invalidate_user(struct iommu_domain *domain, + struct iommu_user_data_array *array) +{ + struct arm_smmu_nested_domain *nested_domain = + container_of(domain, struct arm_smmu_nested_domain, domain); + + return __arm_smmu_cache_invalidate_user( + nested_domain->s2_parent, array); +} + static const struct iommu_domain_ops arm_smmu_nested_ops = { .get_msi_mapping_domain = arm_smmu_get_msi_mapping_domain, .attach_dev = arm_smmu_attach_dev_nested, .free = arm_smmu_domain_nested_free, + .cache_invalidate_user = arm_smmu_cache_invalidate_user, };
static struct iommu_domain * @@ -3298,6 +3405,14 @@ arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags, !(master->smmu->features & ARM_SMMU_FEAT_S2FWB)) return ERR_PTR(-EOPNOTSUPP);
+ /* + * FORCE_SYNC is not set with FEAT_NESTING. Some study of the exact HW + * defect is needed to determine if arm_smmu_cache_invalidate_user() + * needs any change to remove this. + */ + if (WARN_ON(master->smmu->options & ARM_SMMU_OPT_CMDQ_FORCE_SYNC)) + return ERR_PTR(-EOPNOTSUPP); + ret = iommu_copy_struct_from_user(&arg, user_data, IOMMU_HWPT_DATA_ARM_SMMUV3, ste); if (ret) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 79afaef18906..6c8ae70c90fe 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -523,6 +523,7 @@ struct arm_smmu_cmdq_ent { #define CMDQ_OP_TLBI_NH_ALL 0x10 #define CMDQ_OP_TLBI_NH_ASID 0x11 #define CMDQ_OP_TLBI_NH_VA 0x12 + #define CMDQ_OP_TLBI_NH_VAA 0x13 #define CMDQ_OP_TLBI_EL2_ALL 0x20 #define CMDQ_OP_TLBI_EL2_ASID 0x21 #define CMDQ_OP_TLBI_EL2_VA 0x22 diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index fd7d16fd441d..f3aefb11f681 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -685,9 +685,11 @@ struct iommu_hwpt_get_dirty_bitmap { * enum iommu_hwpt_invalidate_data_type - IOMMU HWPT Cache Invalidation * Data Type * @IOMMU_HWPT_INVALIDATE_DATA_VTD_S1: Invalidation data for VTD_S1 + * @IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3: Invalidation data for ARM SMMUv3 */ enum iommu_hwpt_invalidate_data_type { IOMMU_HWPT_INVALIDATE_DATA_VTD_S1 = 0, + IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3 = 1, };
/** @@ -726,6 +728,25 @@ struct iommu_hwpt_vtd_s1_invalidate { __u32 __reserved; };
+/** + * struct iommu_hwpt_arm_smmuv3_invalidate - ARM SMMUv3 cahce invalidation + * (IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3) + * @cmd: 128-bit cache invalidation command that runs in SMMU CMDQ. + * Must be little-endian. + * + * Supported command list: + * CMDQ_OP_TLBI_NSNH_ALL + * CMDQ_OP_TLBI_NH_VA + * CMDQ_OP_TLBI_NH_VAA + * CMDQ_OP_TLBI_NH_ALL + * CMDQ_OP_TLBI_NH_ASID + * + * -EIO will be returned if the command is not supported. + */ +struct iommu_hwpt_arm_smmuv3_invalidate { + __aligned_u64 cmd[2]; +}; + /** * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) * @size: sizeof(struct iommu_hwpt_invalidate)
On Tue, Aug 27, 2024 at 09:59:53AM -0700, Nicolin Chen wrote:
static const struct iommu_domain_ops arm_smmu_nested_ops = { .get_msi_mapping_domain = arm_smmu_get_msi_mapping_domain, .attach_dev = arm_smmu_attach_dev_nested, .free = arm_smmu_domain_nested_free,
- .cache_invalidate_user = arm_smmu_cache_invalidate_user,
};
I think we should drop this op. The original intention was to do things in parts to split up the patches, but it turns out this is functionally useless so lets not even expose it to userspace.
So the patch can maybe be split differently and combined with the next patch
Jason
On Thu, Sep 05, 2024 at 01:23:17PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:53AM -0700, Nicolin Chen wrote:
static const struct iommu_domain_ops arm_smmu_nested_ops = { .get_msi_mapping_domain = arm_smmu_get_msi_mapping_domain, .attach_dev = arm_smmu_attach_dev_nested, .free = arm_smmu_domain_nested_free,
- .cache_invalidate_user = arm_smmu_cache_invalidate_user,
};
I think we should drop this op. The original intention was to do things in parts to split up the patches, but it turns out this is functionally useless so lets not even expose it to userspace.
So the patch can maybe be split differently and combined with the next patch
Ack. I will see what I can do to submit it cleanly.
Thanks Nicolin
Add an arm_smmu_viommu_cache_invalidate() function for user space to issue cache invalidation commands via viommu.
The viommu invalidation takes the same native format of a 128-bit command, as the hwpt invalidation. Thus, reuse the same driver data structure, but make it wider to accept CMDQ_OP_ATC_INV and CMDQ_OP_CFGI_CD{_ALL}.
Scan the commands against the supported ist and fix the VMIDs and SIDs.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 76 ++++++++++++++++++++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 + include/uapi/linux/iommufd.h | 7 +- 3 files changed, 80 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index a2af693bc7b2..bddbb98da414 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3267,15 +3267,32 @@ static void arm_smmu_domain_nested_free(struct iommu_domain *domain) kfree(container_of(domain, struct arm_smmu_nested_domain, domain)); }
+static int arm_smmu_convert_viommu_vdev_id(struct iommufd_viommu *viommu, + u32 vdev_id, u32 *sid) +{ + struct arm_smmu_master *master; + struct device *dev; + + dev = iommufd_viommu_find_device(viommu, vdev_id); + if (!dev) + return -EIO; + master = dev_iommu_priv_get(dev); + + if (sid) + *sid = master->streams[0].id; + return 0; +} + /* * Convert, in place, the raw invalidation command into an internal format that * can be passed to arm_smmu_cmdq_issue_cmdlist(). Internally commands are * stored in CPU endian. * - * Enforce the VMID on the command. + * Enforce the VMID or the SID on the command. */ static int arm_smmu_convert_user_cmd(struct arm_smmu_domain *s2_parent, + struct iommufd_viommu *viommu, struct iommu_hwpt_arm_smmuv3_invalidate *cmd) { u16 vmid = s2_parent->s2_cfg.vmid; @@ -3297,13 +3314,46 @@ arm_smmu_convert_user_cmd(struct arm_smmu_domain *s2_parent, cmd->cmd[0] &= ~CMDQ_TLBI_0_VMID; cmd->cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, vmid); break; + case CMDQ_OP_ATC_INV: + case CMDQ_OP_CFGI_CD: + case CMDQ_OP_CFGI_CD_ALL: + if (viommu) { + u32 sid, vsid = FIELD_GET(CMDQ_CFGI_0_SID, cmd->cmd[0]); + + if (arm_smmu_convert_viommu_vdev_id(viommu, vsid, &sid)) + return -EIO; + cmd->cmd[0] &= ~CMDQ_CFGI_0_SID; + cmd->cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SID, sid); + break; + } + fallthrough; default: return -EIO; } return 0; }
+static inline bool +arm_smmu_must_lock_vdev_id(struct iommu_hwpt_arm_smmuv3_invalidate *cmds, + unsigned int num_cmds) +{ + int i; + + for (i = 0; i < num_cmds; i++) { + switch (cmds[i].cmd[0] & CMDQ_0_OP) { + case CMDQ_OP_ATC_INV: + case CMDQ_OP_CFGI_CD: + case CMDQ_OP_CFGI_CD_ALL: + return true; + default: + continue; + } + } + return false; +} + static int __arm_smmu_cache_invalidate_user(struct arm_smmu_domain *s2_parent, + struct iommufd_viommu *viommu, struct iommu_user_data_array *array) { struct arm_smmu_device *smmu = s2_parent->smmu; @@ -3313,6 +3363,7 @@ static int __arm_smmu_cache_invalidate_user(struct arm_smmu_domain *s2_parent, struct iommu_hwpt_arm_smmuv3_invalidate *end; struct arm_smmu_cmdq_ent ent; struct arm_smmu_cmdq *cmdq; + bool must_lock = false; int ret;
/* A zero-length array is allowed to validate the array type */ @@ -3335,12 +3386,17 @@ static int __arm_smmu_cache_invalidate_user(struct arm_smmu_domain *s2_parent, if (ret) goto out;
+ if (viommu) + must_lock = arm_smmu_must_lock_vdev_id(cmds, array->entry_num); + if (must_lock) + iommufd_viommu_lock_vdev_id(viommu); + ent.opcode = cmds->cmd[0] & CMDQ_0_OP; cmdq = arm_smmu_get_cmdq(smmu, &ent);
last_batch = cmds; while (cur != end) { - ret = arm_smmu_convert_user_cmd(s2_parent, cur); + ret = arm_smmu_convert_user_cmd(s2_parent, viommu, cur); if (ret) goto out;
@@ -3358,6 +3414,8 @@ static int __arm_smmu_cache_invalidate_user(struct arm_smmu_domain *s2_parent, last_batch = cur; } out: + if (must_lock) + iommufd_viommu_unlock_vdev_id(viommu); array->entry_num = cur - cmds; kfree(cmds); return ret; @@ -3370,7 +3428,7 @@ static int arm_smmu_cache_invalidate_user(struct iommu_domain *domain, container_of(domain, struct arm_smmu_nested_domain, domain);
return __arm_smmu_cache_invalidate_user( - nested_domain->s2_parent, array); + nested_domain->s2_parent, NULL, array); }
static const struct iommu_domain_ops arm_smmu_nested_ops = { @@ -3863,6 +3921,15 @@ static int arm_smmu_def_domain_type(struct device *dev) return 0; }
+static int arm_smmu_viommu_cache_invalidate(struct iommufd_viommu *viommu, + struct iommu_user_data_array *array) +{ + struct iommu_domain *domain = iommufd_viommu_to_parent_domain(viommu); + + return __arm_smmu_cache_invalidate_user( + to_smmu_domain(domain), viommu, array); +} + static struct iommu_ops arm_smmu_ops = { .identity_domain = &arm_smmu_identity_domain, .blocked_domain = &arm_smmu_blocked_domain, @@ -3893,6 +3960,9 @@ static struct iommu_ops arm_smmu_ops = { .iotlb_sync = arm_smmu_iotlb_sync, .iova_to_phys = arm_smmu_iova_to_phys, .free = arm_smmu_domain_free_paging, + .default_viommu_ops = &(const struct iommufd_viommu_ops) { + .cache_invalidate = arm_smmu_viommu_cache_invalidate, + } } };
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 6c8ae70c90fe..e7f6e9194a9e 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -10,6 +10,7 @@
#include <linux/bitfield.h> #include <linux/iommu.h> +#include <linux/iommufd.h> #include <linux/kernel.h> #include <linux/mmzone.h> #include <linux/sizes.h> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index f3aefb11f681..0d973486b604 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -734,13 +734,18 @@ struct iommu_hwpt_vtd_s1_invalidate { * @cmd: 128-bit cache invalidation command that runs in SMMU CMDQ. * Must be little-endian. * - * Supported command list: + * Supported command list when passing in a HWPT via @hwpt_id: * CMDQ_OP_TLBI_NSNH_ALL * CMDQ_OP_TLBI_NH_VA * CMDQ_OP_TLBI_NH_VAA * CMDQ_OP_TLBI_NH_ALL * CMDQ_OP_TLBI_NH_ASID * + * Additional to the list above, when passing in a VIOMMU via @hwpt_id: + * CMDQ_OP_ATC_INV + * CMDQ_OP_CFGI_CD + * CMDQ_OP_CFGI_CD_ALL + * * -EIO will be returned if the command is not supported. */ struct iommu_hwpt_arm_smmuv3_invalidate {
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
+static int arm_smmu_viommu_cache_invalidate(struct iommufd_viommu *viommu,
struct iommu_user_data_array *array)
+{
- struct iommu_domain *domain = iommufd_viommu_to_parent_domain(viommu);
- return __arm_smmu_cache_invalidate_user(
to_smmu_domain(domain), viommu, array);
I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and do that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(), just rework the APIs to pass the vmid down not a domain.
Jason
On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
+static int arm_smmu_viommu_cache_invalidate(struct iommufd_viommu *viommu,
struct iommu_user_data_array *array)
+{
- struct iommu_domain *domain = iommufd_viommu_to_parent_domain(viommu);
- return __arm_smmu_cache_invalidate_user(
to_smmu_domain(domain), viommu, array);
I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and do that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(), just rework the APIs to pass the vmid down not a domain.
OK. When I designed all this stuff, we still haven't made mind about sharing the s2 domain, i.e. moving the VMID, which might need a couple of more patches to achieve.
I will try making some change for that.
Thanks Nicolin
On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
+static int arm_smmu_viommu_cache_invalidate(struct iommufd_viommu *viommu,
struct iommu_user_data_array *array)
+{
- struct iommu_domain *domain = iommufd_viommu_to_parent_domain(viommu);
- return __arm_smmu_cache_invalidate_user(
to_smmu_domain(domain), viommu, array);
I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and do that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(), just rework the APIs to pass the vmid down not a domain.
OK. When I designed all this stuff, we still haven't made mind about sharing the s2 domain, i.e. moving the VMID, which might need a couple of more patches to achieve.
Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, September 6, 2024 2:22 AM
On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
+static int arm_smmu_viommu_cache_invalidate(struct
iommufd_viommu *viommu,
struct iommu_user_data_array
*array)
+{
- struct iommu_domain *domain =
iommufd_viommu_to_parent_domain(viommu);
- return __arm_smmu_cache_invalidate_user(
to_smmu_domain(domain), viommu, array);
I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and do that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(), just rework the APIs to pass the vmid down not a domain.
OK. When I designed all this stuff, we still haven't made mind about sharing the s2 domain, i.e. moving the VMID, which might need a couple of more patches to achieve.
Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time.
does it assume that a viommu object cannot span multiple physical IOMMUs so there is only one vmid per viommu?
On Wed, Sep 11, 2024 at 06:25:16AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, September 6, 2024 2:22 AM
On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
+static int arm_smmu_viommu_cache_invalidate(struct
iommufd_viommu *viommu,
struct iommu_user_data_array
*array)
+{
struct iommu_domain *domain =
iommufd_viommu_to_parent_domain(viommu);
return __arm_smmu_cache_invalidate_user(
to_smmu_domain(domain), viommu, array);
I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and do that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(), just rework the APIs to pass the vmid down not a domain.
OK. When I designed all this stuff, we still haven't made mind about sharing the s2 domain, i.e. moving the VMID, which might need a couple of more patches to achieve.
Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time.
does it assume that a viommu object cannot span multiple physical IOMMUs so there is only one vmid per viommu?
I think so. One the reasons of introducing vIOMMU is to maintain the shareability across physical IOMMUs at the s2 HWPT_PAGING.
Thanks Nicolin
On 2024/9/11 15:20, Nicolin Chen wrote:
On Wed, Sep 11, 2024 at 06:25:16AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpejgg@nvidia.com Sent: Friday, September 6, 2024 2:22 AM
On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
+static int arm_smmu_viommu_cache_invalidate(struct
iommufd_viommu *viommu,
struct iommu_user_data_array
*array)
+{
struct iommu_domain *domain =
iommufd_viommu_to_parent_domain(viommu);
return __arm_smmu_cache_invalidate_user(
to_smmu_domain(domain), viommu, array);
I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and do that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(), just rework the APIs to pass the vmid down not a domain.
OK. When I designed all this stuff, we still haven't made mind about sharing the s2 domain, i.e. moving the VMID, which might need a couple of more patches to achieve.
Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time.
does it assume that a viommu object cannot span multiple physical IOMMUs so there is only one vmid per viommu?
I think so. One the reasons of introducing vIOMMU is to maintain the shareability across physical IOMMUs at the s2 HWPT_PAGING.
My understanding of VMID is something like domain id in x86 arch's. Is my understanding correct?
If a VMID for an S2 hwpt is valid on physical IOMMU A but has already been allocated for another purpose on physical IOMMU B, how can it be shared across both IOMMUs? Or the VMID is allocated globally?
Thanks, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Wednesday, September 11, 2024 3:51 PM
On 2024/9/11 15:20, Nicolin Chen wrote:
On Wed, Sep 11, 2024 at 06:25:16AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpejgg@nvidia.com Sent: Friday, September 6, 2024 2:22 AM
On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
> +static int arm_smmu_viommu_cache_invalidate(struct
iommufd_viommu *viommu,
> + struct iommu_user_data_array
*array)
> +{ > + struct iommu_domain *domain =
iommufd_viommu_to_parent_domain(viommu);
> + > + return __arm_smmu_cache_invalidate_user( > + to_smmu_domain(domain), viommu, array); I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and
do
that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(),
just
rework the APIs to pass the vmid down not a domain.
OK. When I designed all this stuff, we still haven't made mind about sharing the s2 domain, i.e. moving the VMID, which might need a couple of more patches to achieve.
Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time.
does it assume that a viommu object cannot span multiple physical IOMMUs so there is only one vmid per viommu?
I think so. One the reasons of introducing vIOMMU is to maintain the shareability across physical IOMMUs at the s2 HWPT_PAGING.
My understanding of VMID is something like domain id in x86 arch's. Is my understanding correct?
yes
If a VMID for an S2 hwpt is valid on physical IOMMU A but has already been allocated for another purpose on physical IOMMU B, how can it be shared across both IOMMUs? Or the VMID is allocated globally?
I'm not sure that's a problem. The point is that each vIOMMU object will get a VMID from the SMMU which it's associated to (assume one vIOMMU cannot span multiple SMMU). Whether that VMID is globally allocated or per-SMMU is the policy in the SMMU driver.
It's the driver's responsibility to ensure not using a conflicting VMID when creating an vIOMMU instance.
On 2024/9/11 16:17, Tian, Kevin wrote:
If a VMID for an S2 hwpt is valid on physical IOMMU A but has already been allocated for another purpose on physical IOMMU B, how can it be shared across both IOMMUs? Or the VMID is allocated globally?
I'm not sure that's a problem. The point is that each vIOMMU object will get a VMID from the SMMU which it's associated to (assume one vIOMMU cannot span multiple SMMU). Whether that VMID is globally allocated or per-SMMU is the policy in the SMMU driver.
It's the driver's responsibility to ensure not using a conflicting VMID when creating an vIOMMU instance.
Make sense.
Thanks, baolu
On Wed, Sep 11, 2024 at 08:17:23AM +0000, Tian, Kevin wrote:
My understanding of VMID is something like domain id in x86 arch's. Is my understanding correct?
yes
If a VMID for an S2 hwpt is valid on physical IOMMU A but has already been allocated for another purpose on physical IOMMU B, how can it be shared across both IOMMUs? Or the VMID is allocated globally?
I'm not sure that's a problem. The point is that each vIOMMU object will get a VMID from the SMMU which it's associated to (assume one vIOMMU cannot span multiple SMMU). Whether that VMID is globally allocated or per-SMMU is the policy in the SMMU driver.
It's the driver's responsibility to ensure not using a conflicting VMID when creating an vIOMMU instance.
It can happen to be the same VMID across all physical SMMUs, but not necessary to be the same, i.e. two SMMUs might have two VMIDs with different ID values, allocated from the their own VMID pools, since cache entries in their own TLB can be tagged with their own VMIDs.
Does domain id for intel-iommu have to be the same? I recall there is only one iommu instance on intel chips at this moment?
Thanks Nicolin
On 9/12/24 5:08 AM, Nicolin Chen wrote:
On Wed, Sep 11, 2024 at 08:17:23AM +0000, Tian, Kevin wrote:
My understanding of VMID is something like domain id in x86 arch's. Is my understanding correct?
yes
If a VMID for an S2 hwpt is valid on physical IOMMU A but has already been allocated for another purpose on physical IOMMU B, how can it be shared across both IOMMUs? Or the VMID is allocated globally?
I'm not sure that's a problem. The point is that each vIOMMU object will get a VMID from the SMMU which it's associated to (assume one vIOMMU cannot span multiple SMMU). Whether that VMID is globally allocated or per-SMMU is the policy in the SMMU driver.
It's the driver's responsibility to ensure not using a conflicting VMID when creating an vIOMMU instance.
It can happen to be the same VMID across all physical SMMUs, but not necessary to be the same, i.e. two SMMUs might have two VMIDs with different ID values, allocated from the their own VMID pools, since cache entries in their own TLB can be tagged with their own VMIDs.
Does domain id for intel-iommu have to be the same?
No. A paging domain may have different domain IDs on different IOMMUs for Intel iommu driver.
I recall there is only one iommu instance on intel chips at this moment?
No. There might be multiple iommu instances on a chip but they share a common iommu driver ops.
Thanks, baolu
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:21 PM
On Wed, Sep 11, 2024 at 06:25:16AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, September 6, 2024 2:22 AM
On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
+static int arm_smmu_viommu_cache_invalidate(struct
iommufd_viommu *viommu,
struct iommu_user_data_array
*array)
+{
struct iommu_domain *domain =
iommufd_viommu_to_parent_domain(viommu);
return __arm_smmu_cache_invalidate_user(
to_smmu_domain(domain), viommu, array);
I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't
make
any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and
do
that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(),
just
rework the APIs to pass the vmid down not a domain.
OK. When I designed all this stuff, we still haven't made mind about sharing the s2 domain, i.e. moving the VMID, which might need a couple of more patches to achieve.
Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time.
does it assume that a viommu object cannot span multiple physical IOMMUs so there is only one vmid per viommu?
I think so. One the reasons of introducing vIOMMU is to maintain the shareability across physical IOMMUs at the s2 HWPT_PAGING.
I don't quite get it. e.g. for intel-iommu the S2 domain itself can be shared across physical IOMMUs then what is the problem preventing a vIOMMU object using that S2 to span multiple IOMMUs?
Probably there is a good reason e.g. for simplification or better aligned with hw accel stuff. But it's not explained clearly so far.
On Wed, Sep 11, 2024 at 08:13:01AM +0000, Tian, Kevin wrote:
Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time.
does it assume that a viommu object cannot span multiple physical IOMMUs so there is only one vmid per viommu?
I think so. One the reasons of introducing vIOMMU is to maintain the shareability across physical IOMMUs at the s2 HWPT_PAGING.
I don't quite get it. e.g. for intel-iommu the S2 domain itself can be shared across physical IOMMUs
SMMU does the same, but needs a VMID per pSMMU to tag that S2 domain:
vIOMMU0 (VMIDx of pSMMU0) -> shared S2 vIOMMU1 (VMIDy of pSMMU1) -> shared S2
Note: x and y might be different.
then what is the problem preventing a vIOMMU object using that S2 to span multiple IOMMUs?
Jason previously suggested the way of implementing multi-vIOMMU in a VMM to be one vIOMMU object representing a vIOMMU instance (of a physical IOMMU) in the VM. So, it'd be only one VMID per one vIOMMU object.
Sharing one vIOMMU object on the other hand needs the vIOMMU to hold a list of VMIDs for all (or attached?) physical IOMMUs. This would change what a vIOMMU object represents.
Probably there is a good reason e.g. for simplification or better aligned with hw accel stuff. But it's not explained clearly so far.
I will try emphasizing that in the next version, likely in the rst file that I am patching for HWPT_PAGING/NESTED at this point.
Thanks Nicolin
On Wed, Sep 11, 2024 at 08:13:01AM +0000, Tian, Kevin wrote:
Probably there is a good reason e.g. for simplification or better aligned with hw accel stuff. But it's not explained clearly so far.
Probably the most concrete thing is if you have a direct assignment invalidation queue (ie DMA'd directly by HW) then it only applies to a single pIOMMU and invalidation commands placed there are unavoidably limited in scope.
This creates a representation problem, if we have a vIOMMU that spans many pIOMMUs but invalidations do some subset how to do we model that. Just saying the vIOMMU is linked to the pIOMMU solves this nicely.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, September 12, 2024 7:08 AM
On Wed, Sep 11, 2024 at 08:13:01AM +0000, Tian, Kevin wrote:
Probably there is a good reason e.g. for simplification or better aligned with hw accel stuff. But it's not explained clearly so far.
Probably the most concrete thing is if you have a direct assignment invalidation queue (ie DMA'd directly by HW) then it only applies to a single pIOMMU and invalidation commands placed there are unavoidably limited in scope.
This creates a representation problem, if we have a vIOMMU that spans many pIOMMUs but invalidations do some subset how to do we model that. Just saying the vIOMMU is linked to the pIOMMU solves this nicely.
yes that is a good reason.
btw do we expect the VMM to try-and-fail when deciding whether a new vIOMMU object is required when creating a new vdev?
On Fri, Sep 13, 2024 at 02:33:59AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, September 12, 2024 7:08 AM
On Wed, Sep 11, 2024 at 08:13:01AM +0000, Tian, Kevin wrote:
Probably there is a good reason e.g. for simplification or better aligned with hw accel stuff. But it's not explained clearly so far.
Probably the most concrete thing is if you have a direct assignment invalidation queue (ie DMA'd directly by HW) then it only applies to a single pIOMMU and invalidation commands placed there are unavoidably limited in scope.
This creates a representation problem, if we have a vIOMMU that spans many pIOMMUs but invalidations do some subset how to do we model that. Just saying the vIOMMU is linked to the pIOMMU solves this nicely.
yes that is a good reason.
btw do we expect the VMM to try-and-fail when deciding whether a new vIOMMU object is required when creating a new vdev?
I think there was some suggestion the getinfo could return this, but also I think qemu needs to have a command line that matches physical so maybe it needs some sysfs?
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, September 14, 2024 10:51 PM
On Fri, Sep 13, 2024 at 02:33:59AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, September 12, 2024 7:08 AM
On Wed, Sep 11, 2024 at 08:13:01AM +0000, Tian, Kevin wrote:
Probably there is a good reason e.g. for simplification or better aligned with hw accel stuff. But it's not explained clearly so far.
Probably the most concrete thing is if you have a direct assignment invalidation queue (ie DMA'd directly by HW) then it only applies to a single pIOMMU and invalidation commands placed there are unavoidably limited in scope.
This creates a representation problem, if we have a vIOMMU that spans many pIOMMUs but invalidations do some subset how to do we model that. Just saying the vIOMMU is linked to the pIOMMU solves this nicely.
yes that is a good reason.
btw do we expect the VMM to try-and-fail when deciding whether a new vIOMMU object is required when creating a new vdev?
I think there was some suggestion the getinfo could return this, but also I think qemu needs to have a command line that matches physical so maybe it needs some sysfs?
My impression was that Qemu is moving away from directly accessing sysfs (e.g. as the reason behind allowing Libvirt to pass in an opened cdev fd to Qemu). So probably getinfo makes more sense...
On Wed, Sep 18, 2024 at 08:10:52AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, September 14, 2024 10:51 PM
On Fri, Sep 13, 2024 at 02:33:59AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, September 12, 2024 7:08 AM
On Wed, Sep 11, 2024 at 08:13:01AM +0000, Tian, Kevin wrote:
Probably there is a good reason e.g. for simplification or better aligned with hw accel stuff. But it's not explained clearly so far.
Probably the most concrete thing is if you have a direct assignment invalidation queue (ie DMA'd directly by HW) then it only applies to a single pIOMMU and invalidation commands placed there are unavoidably limited in scope.
This creates a representation problem, if we have a vIOMMU that spans many pIOMMUs but invalidations do some subset how to do we model that. Just saying the vIOMMU is linked to the pIOMMU solves this nicely.
yes that is a good reason.
btw do we expect the VMM to try-and-fail when deciding whether a new vIOMMU object is required when creating a new vdev?
I think there was some suggestion the getinfo could return this, but also I think qemu needs to have a command line that matches physical so maybe it needs some sysfs?
My impression was that Qemu is moving away from directly accessing sysfs (e.g. as the reason behind allowing Libvirt to pass in an opened cdev fd to Qemu). So probably getinfo makes more sense...
Yes, but I think libvirt needs this information before it invokes qemu..
The physical and virtual iommus need to sort of match, something should figure this out automatically I would guess.
Jason
On Wed, Sep 11, 2024 at 06:25:16AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, September 6, 2024 2:22 AM
On Thu, Sep 05, 2024 at 11:00:49AM -0700, Nicolin Chen wrote:
On Thu, Sep 05, 2024 at 01:20:39PM -0300, Jason Gunthorpe wrote:
On Tue, Aug 27, 2024 at 09:59:54AM -0700, Nicolin Chen wrote:
+static int arm_smmu_viommu_cache_invalidate(struct
iommufd_viommu *viommu,
struct iommu_user_data_array
*array)
+{
- struct iommu_domain *domain =
iommufd_viommu_to_parent_domain(viommu);
- return __arm_smmu_cache_invalidate_user(
to_smmu_domain(domain), viommu, array);
I'd like to have the viommu struct directly hold the VMID. The nested parent should be sharable between multiple viommus, it doesn't make any sense that it would hold the vmid.
This is struggling because it is trying too hard to not have the driver allocate the viommu, and I think we should just go ahead and do that. Store the vmid, today copied from the nesting parent in the vmid private struct. No need for iommufd_viommu_to_parent_domain(), just rework the APIs to pass the vmid down not a domain.
OK. When I designed all this stuff, we still haven't made mind about sharing the s2 domain, i.e. moving the VMID, which might need a couple of more patches to achieve.
Yes, many more patches, and don't try to do it now.. But we can copy the vmid from the s2 and place it in the viommu struct during allocation time.
does it assume that a viommu object cannot span multiple physical IOMMUs so there is only one vmid per viommu?
Yes, the viommu is not intended to cross physical iommus, it is intended to contain objects, like invalidation queues, that are tied to single piommus only.
If someone does want to make vIOMMU that unifies multiple pIOMMUS, and they have the feature set that would make that possible, then they will need multiple IOMMUFD VFIOMMU objects and will have to divide up the nested domains appropriately.
Jason
From: Jason Gunthorpe jgg@nvidia.com
Now, ATC invalidation can be done with the VIOMMU invalidation op. A guest owned IOMMU_DOMAIN_NESTED can do an ATS too. Allow it to pass in the EATS field via the vSTE words.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 15 ++++++++++++--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 + 2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index bddbb98da414..6627ab87a697 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3237,8 +3237,6 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain, .master = master, .old_domain = iommu_get_domain_for_dev(dev), .ssid = IOMMU_NO_PASID, - /* Currently invalidation of ATC is not supported */ - .disable_ats = true, }; struct arm_smmu_ste ste; int ret; @@ -3248,6 +3246,15 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain, return -EINVAL;
mutex_lock(&arm_smmu_asid_lock); + /* + * The VM has to control the actual ATS state at the PCI device because + * we forward the invalidations directly from the VM. If the VM doesn't + * think ATS is on it will not generate ATC flushes and the ATC will + * become incoherent. Since we can't access the actual virtual PCI ATS + * config bit here base this off the EATS value in the STE. If the EATS + * is set then the VM must generate ATC flushes. + */ + state.disable_ats = !nested_domain->enable_ats; ret = arm_smmu_attach_prepare(&state, domain); if (ret) { mutex_unlock(&arm_smmu_asid_lock); @@ -3497,8 +3504,9 @@ arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags, cfg != STRTAB_STE_0_CFG_S1_TRANS) return ERR_PTR(-EIO);
+ /* Only Full ATS or ATS UR is supported */ eats = FIELD_GET(STRTAB_STE_1_EATS, le64_to_cpu(arg.ste[1])); - if (eats != STRTAB_STE_1_EATS_ABT) + if (eats != STRTAB_STE_1_EATS_ABT && eats != STRTAB_STE_1_EATS_TRANS) return ERR_PTR(-EIO);
if (cfg != STRTAB_STE_0_CFG_S1_TRANS) @@ -3511,6 +3519,7 @@ arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags, nested_domain->domain.type = IOMMU_DOMAIN_NESTED; nested_domain->domain.ops = &arm_smmu_nested_ops; nested_domain->s2_parent = smmu_parent; + nested_domain->enable_ats = eats == STRTAB_STE_1_EATS_TRANS; nested_domain->ste[0] = arg.ste[0]; nested_domain->ste[1] = arg.ste[1] & ~cpu_to_le64(STRTAB_STE_1_EATS);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index e7f6e9194a9e..6930810b85cb 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -832,6 +832,7 @@ struct arm_smmu_domain { struct arm_smmu_nested_domain { struct iommu_domain domain; struct arm_smmu_domain *s2_parent; + u8 enable_ats : 1;
__le64 ste[2]; };
From: Jason Gunthorpe jgg@nvidia.com
The SMMUv3 spec has a note that BYPASS and ATS don't work together under the STE EATS field definition. However there is another section "13.6.4 Full ATS skipping stage 1" that explains under certain conditions BYPASS and ATS do work together if the STE is using S1DSS to select BYPASS and the CD table has the possibility for a substream.
When these comments were written the understanding was that all forms of BYPASS just didn't work and this was to be a future problem to solve.
It turns out that ATS and IDENTITY will always work just fine:
- If STE.Config = BYPASS then the PCI ATS is disabled
- If a PASID domain is attached then S1DSS = BYPASS and ATS will be enabled. This meets the requirements of 13.6.4 to automatically generate 1:1 ATS replies on the RID.
Update the comments to reflect this.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6627ab87a697..ad43351145d0 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2830,9 +2830,14 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state, * Translation Requests and Translated transactions are denied * as though ATS is disabled for the stream (STE.EATS == 0b00), * causing F_BAD_ATS_TREQ and F_TRANSL_FORBIDDEN events - * (IHI0070Ea 5.2 Stream Table Entry). Thus ATS can only be - * enabled if we have arm_smmu_domain, those always have page - * tables. + * (IHI0070Ea 5.2 Stream Table Entry). + * + * However, if we have installed a CD table and are using S1DSS + * then ATS will work in S1DSS bypass. See "13.6.4 Full ATS + * skipping stage 1". + * + * Disable ATS if we are going to create a normal 0b100 bypass + * STE. */ state->ats_enabled = !state->disable_ats && arm_smmu_ats_supported(master); @@ -3157,8 +3162,10 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain, if (arm_smmu_ssids_in_use(&master->cd_table)) { /* * If a CD table has to be present then we need to run with ATS - * on even though the RID will fail ATS queries with UR. This is - * because we have no idea what the PASID's need. + * on because we have to assume a PASID is using ATS. For + * IDENTITY this will setup things so that S1DSS=bypass which + * follows the explanation in "13.6.4 Full ATS skipping stage 1" + * and allows for ATS on the RID to work. */ state.cd_needs_ats = true; arm_smmu_attach_prepare(&state, domain);
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
[...]
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to share the
Is there restriction that multiple vIOMMU objects can be only created on a multi-IOMMU system?
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the entire context it actually means the physical 'VMID' allocated on the associated physical IOMMU, correct?
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
[...]
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to share the
Is there restriction that multiple vIOMMU objects can be only created on a multi-IOMMU system?
I think it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the entire context it actually means the physical 'VMID' allocated on the associated physical IOMMU, correct?
Quoting Jason's narratives, a VMID is a "Security namespace for guest owned ID". The allocation, using SMMU as an example, should be a part of vIOMMU instance allocation in the host SMMU driver. Then, this VMID will be used to mark the cache tags. So, it is still a software allocated ID, while HW would use it too.
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:08 PM
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
[...]
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to
share
the
Is there restriction that multiple vIOMMU objects can be only created on a multi-IOMMU system?
I think it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?
No idea. But if you stated so then there will be code to enforce it e.g. failing the attempt to create a vIOMMU object on a pIOMMU to which another vIOMMU object is already linked?
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its
own
VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the entire context it actually means the physical 'VMID' allocated on the associated physical IOMMU, correct?
Quoting Jason's narratives, a VMID is a "Security namespace for guest owned ID". The allocation, using SMMU as an example, should
the VMID alone is not a namespace. It's one ID to tag another namespace.
be a part of vIOMMU instance allocation in the host SMMU driver. Then, this VMID will be used to mark the cache tags. So, it is still a software allocated ID, while HW would use it too.
VMIDs are physical resource belonging to the host SMMU driver.
but I got your original point that it's each vIOMMU gets an unique VMID from the host SMMU driver, not exactly that each vIOMMU maintains its own VMID namespace. that'd be a different concept.
On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:08 PM
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
[...]
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to
share
the
Is there restriction that multiple vIOMMU objects can be only created on a multi-IOMMU system?
I think it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?
No idea. But if you stated so then there will be code to enforce it e.g. failing the attempt to create a vIOMMU object on a pIOMMU to which another vIOMMU object is already linked?
Yea, I can do that.
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its
own
VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the entire context it actually means the physical 'VMID' allocated on the associated physical IOMMU, correct?
Quoting Jason's narratives, a VMID is a "Security namespace for guest owned ID". The allocation, using SMMU as an example, should
the VMID alone is not a namespace. It's one ID to tag another namespace.
be a part of vIOMMU instance allocation in the host SMMU driver. Then, this VMID will be used to mark the cache tags. So, it is still a software allocated ID, while HW would use it too.
VMIDs are physical resource belonging to the host SMMU driver.
Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e. the guest.
but I got your original point that it's each vIOMMU gets an unique VMID from the host SMMU driver, not exactly that each vIOMMU maintains its own VMID namespace. that'd be a different concept.
What's a VMID namespace actually? Please educate me :)
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:41 PM
On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:08 PM
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its
own
VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the entire context it actually means the physical 'VMID' allocated on the associated physical IOMMU, correct?
Quoting Jason's narratives, a VMID is a "Security namespace for guest owned ID". The allocation, using SMMU as an example, should
the VMID alone is not a namespace. It's one ID to tag another namespace.
be a part of vIOMMU instance allocation in the host SMMU driver. Then, this VMID will be used to mark the cache tags. So, it is still a software allocated ID, while HW would use it too.
VMIDs are physical resource belonging to the host SMMU driver.
Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e. the guest.
but I got your original point that it's each vIOMMU gets an unique VMID from the host SMMU driver, not exactly that each vIOMMU maintains its own VMID namespace. that'd be a different concept.
What's a VMID namespace actually? Please educate me :)
I meant the 16bit VMID pool under each SMMU.
On Wed, Sep 11, 2024 at 08:08:04AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:41 PM
On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, September 11, 2024 3:08 PM
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its
own
VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the entire context it actually means the physical 'VMID' allocated on the associated physical IOMMU, correct?
Quoting Jason's narratives, a VMID is a "Security namespace for guest owned ID". The allocation, using SMMU as an example, should
the VMID alone is not a namespace. It's one ID to tag another namespace.
be a part of vIOMMU instance allocation in the host SMMU driver. Then, this VMID will be used to mark the cache tags. So, it is still a software allocated ID, while HW would use it too.
VMIDs are physical resource belonging to the host SMMU driver.
Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e. the guest.
but I got your original point that it's each vIOMMU gets an unique VMID from the host SMMU driver, not exactly that each vIOMMU maintains its own VMID namespace. that'd be a different concept.
What's a VMID namespace actually? Please educate me :)
I meant the 16bit VMID pool under each SMMU.
I see. Makes sense now.
Thanks Nicolin
On 11/9/24 17:08, Nicolin Chen wrote:
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
[...]
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to share the
Is there restriction that multiple vIOMMU objects can be only created on a multi-IOMMU system?
I think it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?
Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly?
On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT - device table, one entry per BDFn, the entry owns a queue. A slice of that can be passed to a VM (== queues mapped directly to the VM, and such IOMMU appears in the VM as a pretend-pcie device too). So what is [pv]IOMMU here? Thanks,
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the entire context it actually means the physical 'VMID' allocated on the associated physical IOMMU, correct?
Quoting Jason's narratives, a VMID is a "Security namespace for guest owned ID". The allocation, using SMMU as an example, should be a part of vIOMMU instance allocation in the host SMMU driver. Then, this VMID will be used to mark the cache tags. So, it is still a software allocated ID, while HW would use it too.
Thanks Nicolin
On Tue, Oct 01, 2024 at 11:55:59AM +1000, Alexey Kardashevskiy wrote:
On 11/9/24 17:08, Nicolin Chen wrote:
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
[...]
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to share the
Is there restriction that multiple vIOMMU objects can be only created on a multi-IOMMU system?
I think it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?
Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly?
On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT
- device table, one entry per BDFn, the entry owns a queue. A slice of
that can be passed to a VM (== queues mapped directly to the VM, and such IOMMU appears in the VM as a pretend-pcie device too). So what is [pv]IOMMU here? Thanks,
The "p" stands for physical: the entire IOMMU unit/instance. In the IOMMU subsystem terminology, it's a struct iommu_device. It sounds like AMD would register one iommu device per rootport?
The "v" stands for virtual: a slice of the pIOMMU that could be shared or passed through to a VM: - Intel IOMMU doesn't have passthrough queues, so it uses a shared queue (for invalidation). In this case, vIOMMU will be a pure SW structure for HW queue sharing (with the host machine and other VMs). That said, I think the channel (or the port) that Intel VT-d uses internally for a device to do a two-stage translation can be seen as a "passthrough" feature, held by a vIOMMU. - AMD IOMMU can assign passthrough queues to VMs, in which case, vIOMMU will be a structure holding all passthrough resource (of the pIOMMU) assisgned to a VM. If there is a shared resource, it can be packed into the vIOMMU struct too. FYI, vQUEUE (future series) on the other hand will represent each passthrough queue in a vIOMMU struct. The VM then, per that specific pIOMMU (rootport?), will have one vIOMMU holding a number of vQUEUEs. - ARM SMMU is sort of in the middle, depending on the impls. vIOMMU will be a structure holding both passthrough and shared resource. It can define vQUEUEs, if the impl has passthrough queues like AMD does.
Allowing a vIOMMU to hold shared resource makes it a bit of an upgraded model for IOMMU virtualization, from the existing HWPT model that now looks like a subset of the vIOMMU model.
Thanks Nicolin
On 1/10/24 13:36, Nicolin Chen wrote:
On Tue, Oct 01, 2024 at 11:55:59AM +1000, Alexey Kardashevskiy wrote:
On 11/9/24 17:08, Nicolin Chen wrote:
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 28, 2024 1:00 AM
[...]
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to share the
Is there restriction that multiple vIOMMU objects can be only created on a multi-IOMMU system?
I think it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?
Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly?
On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT
- device table, one entry per BDFn, the entry owns a queue. A slice of
that can be passed to a VM (== queues mapped directly to the VM, and such IOMMU appears in the VM as a pretend-pcie device too). So what is [pv]IOMMU here? Thanks,
The "p" stands for physical: the entire IOMMU unit/instance. In the IOMMU subsystem terminology, it's a struct iommu_device. It sounds like AMD would register one iommu device per rootport?
Yup, my test machine has 4 of these.
The "v" stands for virtual: a slice of the pIOMMU that could be shared or passed through to a VM:
- Intel IOMMU doesn't have passthrough queues, so it uses a shared queue (for invalidation). In this case, vIOMMU will be a pure SW structure for HW queue sharing (with the host machine and other VMs). That said, I think the channel (or the port) that Intel VT-d uses internally for a device to do a two-stage translation can be seen as a "passthrough" feature, held by a vIOMMU.
- AMD IOMMU can assign passthrough queues to VMs, in which case, vIOMMU will be a structure holding all passthrough resource (of the pIOMMU) assisgned to a VM. If there is a shared resource, it can be packed into the vIOMMU struct too. FYI, vQUEUE (future series) on the other hand will represent each passthrough queue in a vIOMMU struct. The VM then, per that specific pIOMMU (rootport?), will have one vIOMMU holding a number of vQUEUEs.
- ARM SMMU is sort of in the middle, depending on the impls. vIOMMU will be a structure holding both passthrough and shared resource. It can define vQUEUEs, if the impl has passthrough queues like AMD does.
Allowing a vIOMMU to hold shared resource makes it a bit of an upgraded model for IOMMU virtualization, from the existing HWPT model that now looks like a subset of the vIOMMU model.
Thanks for confirming.
I've just read in this thread that "it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?"? thought "we have every reason to do that, unless p means something different", so I decided to ask :) Thanks,
Thanks Nicolin
On Tue, Oct 01, 2024 at 03:06:57PM +1000, Alexey Kardashevskiy wrote:
I've just read in this thread that "it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?"? thought "we have every reason to do that, unless p means something different", so I decided to ask :) Thanks,
I think that was inteded as "multiple vIOMMUs per pIOMMU within a single VM".
There would always be multiple vIOMMUs per pIOMMU across VMs/etc.
Jason
Hi Nic,
On 2024/8/28 00:59, Nicolin Chen wrote:
This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
could you elaborate a bit for the last sentence in the above paragraph?
The new VIOMMU object is an additional layer, between the nested HWPT and its parent HWPT, to give to both the IOMMUFD core and an IOMMU driver an additional structure to support HW-accelerated feature: ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested0 |--->| viommu0 ------------------
---------------- | | HW-accel feats |
On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to share the stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested0 |--->| viommu0 ------------------
---------------- | | VMID0 |
----------------------------
---------------- | | paging_hwpt0 | | hwpt_nested1 |--->| viommu1 ------------------
---------------- | | VMID1 |
As an initial part-1, add ioctls to support a VIOMMU-based invalidation: IOMMUFD_CMD_VIOMMU_ALLOC to allocate a VIOMMU object IOMMUFD_CMD_VIOMMU_SET/UNSET_VDEV_ID to set/clear device's virtual ID (Resue IOMMUFD_CMD_HWPT_INVALIDATE for a VIOMMU object to flush cache by a given driver data)
Worth noting that the VDEV_ID is for a per-VIOMMU device list for drivers to look up the device's physical instance from its virtual ID in a VM. It is essential for a VIOMMU-based invalidation where the request contains a device's virtual ID for its device cache flush, e.g. ATC invalidation.
As for the implementation of the series, add an IOMMU_VIOMMU_TYPE_DEFAULT type for a core-allocated-core-managed VIOMMU object, allowing drivers to simply hook a default viommu ops for viommu-based invalidation alone. And provide some viommu helpers to drivers for VDEV_ID translation and parent domain lookup. Add VIOMMU invalidation support to ARM SMMUv3 driver for a real world use case. This adds supports of arm-smmuv-v3's CMDQ_OP_ATC_INV and CMDQ_OP_CFGI_CD/ALL commands, supplementing HWPT-based invalidations.
In the future, drivers will also be able to choose a driver-managed type to hold its own structure by adding a new type to enum iommu_viommu_type. More VIOMMU-based structures and ioctls will be introduced in part-2/3 to support a driver-managed VIOMMU, e.g. VQUEUE object for a HW accelerated queue, VIRQ (or VEVENT) object for IRQ injections. Although we repurposed the VIOMMU object from an earlier RFC discussion, for a referece: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v2 Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p1-v2
Changelog v2
- Limited vdev_id to one per idev
- Added a rw_sem to protect the vdev_id list
- Reworked driver-level APIs with proper lockings
- Added a new viommu_api file for IOMMUFD_DRIVER config
- Dropped useless iommu_dev point from the viommu structure
- Added missing index numnbers to new types in the uAPI header
- Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
- Reworked mock_viommu_cache_invalidate() using the new iommu helper
- Reordered details of set/unset_vdev_id handlers for proper lockings
- Added arm_smmu_cache_invalidate_user patch from Jason's nesting series
v1 https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks! Nicolin
Jason Gunthorpe (3): iommu: Add iommu_copy_struct_from_full_user_array helper iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED iommu/arm-smmu-v3: Update comments about ATS and bypass
Nicolin Chen (16): iommufd: Reorder struct forward declarations iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl iommu: Pass in a viommu pointer to domain_alloc_user op iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl iommufd/selftest: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID test coverage iommufd/viommu: Add cache_invalidate for IOMMU_VIOMMU_TYPE_DEFAULT iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE iommufd/viommu: Add vdev_id helpers for IOMMU drivers iommufd/selftest: Add mock_viommu_invalidate_user op iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command iommufd/selftest: Add VIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl iommufd/viommu: Add iommufd_viommu_to_parent_domain helper iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
drivers/iommu/amd/iommu.c | 1 + drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 218 ++++++++++++++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 + drivers/iommu/intel/iommu.c | 1 + drivers/iommu/iommufd/Makefile | 5 +- drivers/iommu/iommufd/device.c | 12 + drivers/iommu/iommufd/hw_pagetable.c | 59 +++- drivers/iommu/iommufd/iommufd_private.h | 37 +++ drivers/iommu/iommufd/iommufd_test.h | 30 ++ drivers/iommu/iommufd/main.c | 12 + drivers/iommu/iommufd/selftest.c | 101 ++++++- drivers/iommu/iommufd/viommu.c | 196 +++++++++++++ drivers/iommu/iommufd/viommu_api.c | 53 ++++ include/linux/iommu.h | 56 +++- include/linux/iommufd.h | 51 +++- include/uapi/linux/iommufd.h | 117 +++++++- tools/testing/selftests/iommu/iommufd.c | 259 +++++++++++++++++- tools/testing/selftests/iommu/iommufd_utils.h | 126 +++++++++ 18 files changed, 1299 insertions(+), 38 deletions(-) create mode 100644 drivers/iommu/iommufd/viommu.c create mode 100644 drivers/iommu/iommufd/viommu_api.c
On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
Hi Nic,
On 2024/8/28 00:59, Nicolin Chen wrote:
This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
could you elaborate a bit for the last sentence in the above paragraph?
Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent domain across IOMMU instances, we'd have to make sure that VMID is available on all IOMMU instances. There comes the limitation and potential resource starving, so not ideal.
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
Adding another vIOMMU wrapper on the other hand can allow us to allocate different VMIDs/DIDs for different IOMMUs.
Thanks Nic
On 2024/9/26 02:55, Nicolin Chen wrote:
On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
Hi Nic,
On 2024/8/28 00:59, Nicolin Chen wrote:
This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
could you elaborate a bit for the last sentence in the above paragraph?
Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent domain across IOMMU instances, we'd have to make sure that VMID is available on all IOMMU instances. There comes the limitation and potential resource starving, so not ideal.
got it.
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
Adding another vIOMMU wrapper on the other hand can allow us to allocate different VMIDs/DIDs for different IOMMUs.
that looks like to generalize the association of the iommu domain and the iommu units?
On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote:
On 2024/9/26 02:55, Nicolin Chen wrote:
On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
Hi Nic,
On 2024/8/28 00:59, Nicolin Chen wrote:
This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
could you elaborate a bit for the last sentence in the above paragraph?
Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent domain across IOMMU instances, we'd have to make sure that VMID is available on all IOMMU instances. There comes the limitation and potential resource starving, so not ideal.
got it.
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain.
Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances?
Adding another vIOMMU wrapper on the other hand can allow us to allocate different VMIDs/DIDs for different IOMMUs.
that looks like to generalize the association of the iommu domain and the iommu units?
A vIOMMU is a presentation/object of a physical IOMMU instance in a VM. This presentation gives a VMM some capability to take advantage of some of HW resource of the physical IOMMU: - a VMID is a small HW reousrce to tag the cache; - a vIOMMU invalidation allows to access device cache that's not straightforwardly done via an S1 HWPT invalidation; - a virtual device presentation of a physical device in a VM, related to the vIOMMU in the VM, which contains some VM-level info: virtual device ID, security level (ARM CCA), and etc; - Non-PRI IRQ forwarding to the guest VM; - HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU;
Thanks Nicolin
On 9/27/24 4:03 AM, Nicolin Chen wrote:
On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote:
On 2024/9/26 02:55, Nicolin Chen wrote:
On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
Hi Nic,
On 2024/8/28 00:59, Nicolin Chen wrote:
This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
could you elaborate a bit for the last sentence in the above paragraph?
Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent domain across IOMMU instances, we'd have to make sure that VMID is available on all IOMMU instances. There comes the limitation and potential resource starving, so not ideal.
got it.
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain.
Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances?
It's per S1 nested domain in current VT-d design. It's simple but lacks sharing of DID within a VM. We probably will change this later.
Thanks, baolu
On 2024/9/27 10:05, Baolu Lu wrote:
On 9/27/24 4:03 AM, Nicolin Chen wrote:
On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote:
On 2024/9/26 02:55, Nicolin Chen wrote:
On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
Hi Nic,
On 2024/8/28 00:59, Nicolin Chen wrote:
This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
could you elaborate a bit for the last sentence in the above paragraph?
Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent domain across IOMMU instances, we'd have to make sure that VMID is available on all IOMMU instances. There comes the limitation and potential resource starving, so not ideal.
got it.
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain.
Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances?
It's per S1 nested domain in current VT-d design. It's simple but lacks sharing of DID within a VM. We probably will change this later.
Could you share a bit more about this? I hope it is not going to share the DID if the S1 nested domains share the same S2 hwpt. For fist-stage caches, the tag is PASID, DID and address. If both PASID and DID are the same, then there is cache conflict. And the typical scenarios is the gIOVA which uses the RIDPASID. :)
On 2024/9/27 04:03, Nicolin Chen wrote:
On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote:
On 2024/9/26 02:55, Nicolin Chen wrote:
On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
Hi Nic,
On 2024/8/28 00:59, Nicolin Chen wrote:
This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
could you elaborate a bit for the last sentence in the above paragraph?
Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent domain across IOMMU instances, we'd have to make sure that VMID is available on all IOMMU instances. There comes the limitation and potential resource starving, so not ideal.
got it.
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain.
Is this because of the VMID is tied with a S2 domain?
Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances?
per S1 I think. The iotlb efficiency is low as S2 caches would be tagged with different DIDs even the page table is the same. :)
Adding another vIOMMU wrapper on the other hand can allow us to allocate different VMIDs/DIDs for different IOMMUs.
that looks like to generalize the association of the iommu domain and the iommu units?
A vIOMMU is a presentation/object of a physical IOMMU instance in a VM.
a slice of a physical IOMMU. is it? and you treat S2 hwpt as a resource of the physical IOMMU as well.
This presentation gives a VMM some capability to take advantage of some of HW resource of the physical IOMMU:
- a VMID is a small HW reousrce to tag the cache;
- a vIOMMU invalidation allows to access device cache that's not straightforwardly done via an S1 HWPT invalidation;
- a virtual device presentation of a physical device in a VM, related to the vIOMMU in the VM, which contains some VM-level info: virtual device ID, security level (ARM CCA), and etc;
- Non-PRI IRQ forwarding to the guest VM;
- HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU;
might be helpful to draw a diagram to show what the vIOMMU obj contains.:)
On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote:
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain.
Is this because of the VMID is tied with a S2 domain?
On ARM, yes. VMID is a part of S2 domain stuff.
Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances?
per S1 I think. The iotlb efficiency is low as S2 caches would be tagged with different DIDs even the page table is the same. :)
On ARM, the stage-1 is tagged with an ASID (Address Space ID) while the stage-2 is tagged with a VMID. Then an invalidation for a nested S1 domain must require the VMID from the S2. The ASID may be also required if the invalidation is specific to that address space (otherwise, broadcast per VMID.)
I feel these two might act somehow similarly to the two DIDs during nested translations?
Adding another vIOMMU wrapper on the other hand can allow us to allocate different VMIDs/DIDs for different IOMMUs.
that looks like to generalize the association of the iommu domain and the iommu units?
A vIOMMU is a presentation/object of a physical IOMMU instance in a VM.
a slice of a physical IOMMU. is it?
Yes. When multiple nested translations happen at the same time, IOMMU (just like a CPU) is shared by these slices. And so is an invalidation queue executing multiple requests.
Perhaps calling it a slice sounds more accurate, as I guess all the confusion comes from the name "vIOMMU" that might be thought to be a user space object/instance that likely holds all virtual stuff like stage-1 HWPT or so?
and you treat S2 hwpt as a resource of the physical IOMMU as well.
Yes. A parent HWPT (in the old day, we called it "kernel-manged" HWPT) is not a user space thing. This belongs to a kernel owned object.
This presentation gives a VMM some capability to take advantage of some of HW resource of the physical IOMMU:
- a VMID is a small HW reousrce to tag the cache;
- a vIOMMU invalidation allows to access device cache that's not straightforwardly done via an S1 HWPT invalidation;
- a virtual device presentation of a physical device in a VM, related to the vIOMMU in the VM, which contains some VM-level info: virtual device ID, security level (ARM CCA), and etc;
- Non-PRI IRQ forwarding to the guest VM;
- HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU;
might be helpful to draw a diagram to show what the vIOMMU obj contains.:)
That's what I plan to. Basically looks like: device---->stage1--->[ viommu [s2_hwpt, vmid, virq, HW-acc, etc.] ]
Thanks Nic
On 2024/9/27 14:32, Nicolin Chen wrote:
On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote:
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain.
Is this because of the VMID is tied with a S2 domain?
On ARM, yes. VMID is a part of S2 domain stuff.
Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances?
per S1 I think. The iotlb efficiency is low as S2 caches would be tagged with different DIDs even the page table is the same. :)
On ARM, the stage-1 is tagged with an ASID (Address Space ID) while the stage-2 is tagged with a VMID. Then an invalidation for a nested S1 domain must require the VMID from the S2. The ASID may be also required if the invalidation is specific to that address space (otherwise, broadcast per VMID.)
Looks like the nested s1 caches are tagged with both ASID and VMID.
I feel these two might act somehow similarly to the two DIDs during nested translations?
not quite the same. Is it possible that the ASID is the same for stage-1? Intel VT-d side can have the pasid to be the same. Like the gIOVA, all devices use the same ridpasid. Like the scenario I replied to Baolu[1], do er choose to use different DIDs to differentiate the caches for the two devices.
[1] https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@int...
Adding another vIOMMU wrapper on the other hand can allow us to allocate different VMIDs/DIDs for different IOMMUs.
that looks like to generalize the association of the iommu domain and the iommu units?
A vIOMMU is a presentation/object of a physical IOMMU instance in a VM.
a slice of a physical IOMMU. is it?
Yes. When multiple nested translations happen at the same time, IOMMU (just like a CPU) is shared by these slices. And so is an invalidation queue executing multiple requests.
Perhaps calling it a slice sounds more accurate, as I guess all the confusion comes from the name "vIOMMU" that might be thought to be a user space object/instance that likely holds all virtual stuff like stage-1 HWPT or so?
yeah. Maybe this confusion partly comes when you start it with the cache invalidation as well. I failed to get why a S2 hwpt needs to be part of the vIOMMU obj at the first glance.
and you treat S2 hwpt as a resource of the physical IOMMU as well.
Yes. A parent HWPT (in the old day, we called it "kernel-manged" HWPT) is not a user space thing. This belongs to a kernel owned object.
This presentation gives a VMM some capability to take advantage of some of HW resource of the physical IOMMU:
- a VMID is a small HW reousrce to tag the cache;
- a vIOMMU invalidation allows to access device cache that's not straightforwardly done via an S1 HWPT invalidation;
- a virtual device presentation of a physical device in a VM, related to the vIOMMU in the VM, which contains some VM-level info: virtual device ID, security level (ARM CCA), and etc;
- Non-PRI IRQ forwarding to the guest VM;
- HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU;
might be helpful to draw a diagram to show what the vIOMMU obj contains.:)
That's what I plan to. Basically looks like: device---->stage1--->[ viommu [s2_hwpt, vmid, virq, HW-acc, etc.] ]
ok. let's see your new doc.
On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote:
Perhaps calling it a slice sounds more accurate, as I guess all the confusion comes from the name "vIOMMU" that might be thought to be a user space object/instance that likely holds all virtual stuff like stage-1 HWPT or so?
yeah. Maybe this confusion partly comes when you start it with the cache invalidation as well. I failed to get why a S2 hwpt needs to be part of the vIOMMU obj at the first glance.
Both amd and arm have direct to VM queues for the iommu and these queues have their DMA translated by the S2.
So their viommu HW concepts come along with a requirement that there be a fixed translation for the VM, which we model by attaching a S2 HWPT to the VIOMMU object which get's linked into the IOMMU HW as the translation for the queue memory.
Jason
On 2024/9/27 20:20, Jason Gunthorpe wrote:
On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote:
Perhaps calling it a slice sounds more accurate, as I guess all the confusion comes from the name "vIOMMU" that might be thought to be a user space object/instance that likely holds all virtual stuff like stage-1 HWPT or so?
yeah. Maybe this confusion partly comes when you start it with the cache invalidation as well. I failed to get why a S2 hwpt needs to be part of the vIOMMU obj at the first glance.
Both amd and arm have direct to VM queues for the iommu and these queues have their DMA translated by the S2.
ok, this explains why the S2 should be part of the vIOMMU obj.
So their viommu HW concepts come along with a requirement that there be a fixed translation for the VM, which we model by attaching a S2 HWPT to the VIOMMU object which get's linked into the IOMMU HW as the translation for the queue memory.
Is the mapping of the S2 be static? or it an be unmapped per userspace?
On Sun, Sep 29, 2024 at 03:19:42PM +0800, Yi Liu wrote:
So their viommu HW concepts come along with a requirement that there be a fixed translation for the VM, which we model by attaching a S2 HWPT to the VIOMMU object which get's linked into the IOMMU HW as the translation for the queue memory.
Is the mapping of the S2 be static? or it an be unmapped per userspace?
In principle it should be dynamic, but I think the vCMDQ stuff will struggle to do that
Jason
On Tue, Oct 01, 2024 at 10:48:15AM -0300, Jason Gunthorpe wrote:
On Sun, Sep 29, 2024 at 03:19:42PM +0800, Yi Liu wrote:
So their viommu HW concepts come along with a requirement that there be a fixed translation for the VM, which we model by attaching a S2 HWPT to the VIOMMU object which get's linked into the IOMMU HW as the translation for the queue memory.
Is the mapping of the S2 be static? or it an be unmapped per userspace?
In principle it should be dynamic, but I think the vCMDQ stuff will struggle to do that
Yea. vCMDQ HW requires a setting of the physical address of the base address to a queue in the VM's ram space. If the S2 mapping changes (resulting a different queue location in the physical memory), VMM should notify the kernel for a HW reconfiguration.
I wonder what all the user cases are, which can cause a shifting of S2 mappings? VM migration? Any others?
Thanks Nicolin
On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote:
On 2024/9/27 14:32, Nicolin Chen wrote:
On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote:
Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain.
Is this because of the VMID is tied with a S2 domain?
On ARM, yes. VMID is a part of S2 domain stuff.
Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances?
per S1 I think. The iotlb efficiency is low as S2 caches would be tagged with different DIDs even the page table is the same. :)
On ARM, the stage-1 is tagged with an ASID (Address Space ID) while the stage-2 is tagged with a VMID. Then an invalidation for a nested S1 domain must require the VMID from the S2. The ASID may be also required if the invalidation is specific to that address space (otherwise, broadcast per VMID.)
Looks like the nested s1 caches are tagged with both ASID and VMID.
Yea, my understanding is similar. If both stages are enabled for a nested translation, VMID is tagged for S1 cache too.
I feel these two might act somehow similarly to the two DIDs during nested translations?
not quite the same. Is it possible that the ASID is the same for stage-1? Intel VT-d side can have the pasid to be the same. Like the gIOVA, all devices use the same ridpasid. Like the scenario I replied to Baolu[1], do er choose to use different DIDs to differentiate the caches for the two devices.
On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or an SVA PASID>0 domain) has a unique ASID. So it unlikely has the situation of two identical ASIDs if they are on the same vIOMMU, because the ASID pool is per IOMMU instance (whether p or v).
With two vIOMMU instances, there might be the same ASIDs but they will be tagged with different VMIDs.
[1] https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@int...
Is "gIOVA" a type of invalidation that only uses "address" out of "PASID, DID and address"? I.e. PASID and DID are not provided via the invalidation request, so it's going to broadcast all viommus?
Thanks Nicolin
On 2024/9/28 04:44, Nicolin Chen wrote:
On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote:
On 2024/9/27 14:32, Nicolin Chen wrote:
On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote:
> Baolu told me that Intel may have the same: different domain IDs > on different IOMMUs; multiple IOMMU instances on one chip: > https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@lin... > So, I think we are having the same situation here.
yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far.
An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain.
Is this because of the VMID is tied with a S2 domain?
On ARM, yes. VMID is a part of S2 domain stuff.
Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances?
per S1 I think. The iotlb efficiency is low as S2 caches would be tagged with different DIDs even the page table is the same. :)
On ARM, the stage-1 is tagged with an ASID (Address Space ID) while the stage-2 is tagged with a VMID. Then an invalidation for a nested S1 domain must require the VMID from the S2. The ASID may be also required if the invalidation is specific to that address space (otherwise, broadcast per VMID.)
Looks like the nested s1 caches are tagged with both ASID and VMID.
Yea, my understanding is similar. If both stages are enabled for a nested translation, VMID is tagged for S1 cache too.
I feel these two might act somehow similarly to the two DIDs during nested translations?
not quite the same. Is it possible that the ASID is the same for stage-1? Intel VT-d side can have the pasid to be the same. Like the gIOVA, all devices use the same ridpasid. Like the scenario I replied to Baolu[1], do er choose to use different DIDs to differentiate the caches for the two devices.
On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or an SVA PASID>0 domain) has a unique ASID.
I see. Looks like ASID is not the PASID.
So it unlikely has the situation of two identical ASIDs if they are on the same vIOMMU, because the ASID pool is per IOMMU instance (whether p or v).
With two vIOMMU instances, there might be the same ASIDs but they will be tagged with different VMIDs.
[1] https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@int...
Is "gIOVA" a type of invalidation that only uses "address" out of "PASID, DID and address"? I.e. PASID and DID are not provided via the invalidation request, so it's going to broadcast all viommus?
gIOVA is just a term v.s. vSVA. Just want to differentiate it from vSVA. :) PASID and DID are still provided in the invalidation.
On Sun, Sep 29, 2024 at 03:16:55PM +0800, Yi Liu wrote:
I feel these two might act somehow similarly to the two DIDs during nested translations?
not quite the same. Is it possible that the ASID is the same for stage-1? Intel VT-d side can have the pasid to be the same. Like the gIOVA, all devices use the same ridpasid. Like the scenario I replied to Baolu[1], do er choose to use different DIDs to differentiate the caches for the two devices.
On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or an SVA PASID>0 domain) has a unique ASID.
I see. Looks like ASID is not the PASID.
It's not. PASID is called Substream ID in SMMU term. It's used to index the PASID table. For cache invalidations, a PASID (ssid) is for ATC (dev cache) or PASID table entry invalidation only.
So it unlikely has the situation of two identical ASIDs if they are on the same vIOMMU, because the ASID pool is per IOMMU instance (whether p or v).
With two vIOMMU instances, there might be the same ASIDs but they will be tagged with different VMIDs.
[1] https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@int...
Is "gIOVA" a type of invalidation that only uses "address" out of "PASID, DID and address"? I.e. PASID and DID are not provided via the invalidation request, so it's going to broadcast all viommus?
gIOVA is just a term v.s. vSVA. Just want to differentiate it from vSVA. :) PASID and DID are still provided in the invalidation.
I am still not getting this gIOVA. What it does exactly v.s. vSVA? And should RIDPASID be IOMMU_NO_PASID?
Nicolin
On 2024/10/1 05:59, Nicolin Chen wrote:
On Sun, Sep 29, 2024 at 03:16:55PM +0800, Yi Liu wrote:
I feel these two might act somehow similarly to the two DIDs during nested translations?
not quite the same. Is it possible that the ASID is the same for stage-1? Intel VT-d side can have the pasid to be the same. Like the gIOVA, all devices use the same ridpasid. Like the scenario I replied to Baolu[1], do er choose to use different DIDs to differentiate the caches for the two devices.
On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or an SVA PASID>0 domain) has a unique ASID.
I see. Looks like ASID is not the PASID.
It's not. PASID is called Substream ID in SMMU term. It's used to index the PASID table. For cache invalidations, a PASID (ssid) is for ATC (dev cache) or PASID table entry invalidation only.
sure. Is there any relationship between PASID and ASID? Per the below link, ASID is used to tag the TLB entries of an application. So it's used in the SVA case. right?
https://developer.arm.com/documentation/102142/0100/Stage-2-translation
So it unlikely has the situation of two identical ASIDs if they are on the same vIOMMU, because the ASID pool is per IOMMU instance (whether p or v).
With two vIOMMU instances, there might be the same ASIDs but they will be tagged with different VMIDs.
[1] https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@int...
Is "gIOVA" a type of invalidation that only uses "address" out of "PASID, DID and address"? I.e. PASID and DID are not provided via the invalidation request, so it's going to broadcast all viommus?
gIOVA is just a term v.s. vSVA. Just want to differentiate it from vSVA. :) PASID and DID are still provided in the invalidation.
I am still not getting this gIOVA. What it does exactly v.s. vSVA? And should RIDPASID be IOMMU_NO_PASID?
gIOVA is the IOVA in guest. vSVA just the SVA in guest. Maybe the confusion comes why not use vIOVA instead of gIOVA. is it? I think you are clear about IOVA v.s. SVA. :)
yes, RIDPASID is the IOMMU_NO_PASID although VT-d arch allows it to be non IOMMU_NO_PASID.
On Wed, Oct 09, 2024 at 03:20:57PM +0800, Yi Liu wrote:
On 2024/10/1 05:59, Nicolin Chen wrote:
On Sun, Sep 29, 2024 at 03:16:55PM +0800, Yi Liu wrote:
I feel these two might act somehow similarly to the two DIDs during nested translations?
not quite the same. Is it possible that the ASID is the same for stage-1? Intel VT-d side can have the pasid to be the same. Like the gIOVA, all devices use the same ridpasid. Like the scenario I replied to Baolu[1], do er choose to use different DIDs to differentiate the caches for the two devices.
On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or an SVA PASID>0 domain) has a unique ASID.
I see. Looks like ASID is not the PASID.
It's not. PASID is called Substream ID in SMMU term. It's used to index the PASID table. For cache invalidations, a PASID (ssid) is for ATC (dev cache) or PASID table entry invalidation only.
sure. Is there any relationship between PASID and ASID? Per the below link, ASID is used to tag the TLB entries of an application. So it's used in the SVA case. right?
Unlike Intel and AMD the IOTLB tag is entirely controlled by software. So the HW will lookup the PASID and retrieve an ASID, then use that as a cache tag.
Intel and AMD will use the PASID as the cache tag.
As we've talked about several times using the PASID directly as a cache tag robs the SW of optimization possibilities in some cases.
The extra ASID indirection allows the SW to always tag the same page table top pointer with the same ASID regardless of what PASID it is assigned to and guarentee IOTLB sharing.
Jason
linux-kselftest-mirror@lists.linaro.org