Nested translation is a hardware feature that is supported by many modern IOMMU hardwares. It has two stages (stage-1, stage-2) address translation to get access to the physical address. stage-1 translation table is owned by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page table. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the stage-1 address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------. | vIOMMU | | Guest I/O page table | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush --+ '-------------' | | | V | | I/O page table pointer in GPA '-------------' Guest ------| Shadow |---------------------------|-------- v v v Host .-------------. .------------------------. | pIOMMU | | FS for GIOVA->GPA | | | '------------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------.----------------------------------. | | | SS for GPA->HPA, unmanaged domain| | | '----------------------------------' '-------------' Where: - FS = First stage page tables - SS = Second stage page tables <Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt) and each has an iommu_domain allocated from iommu driver. So in this series hw_pagetable and iommu_domain means the same thing if no special note. IOMMUFD has already supported allocating hw_pagetable that is linked with an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable with driver specific parameters and interface to sync stage-1 IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1]. It first introduces new iommu op for allocating domains with user data and the op for invalidate stage-1 IOTLB, and then extend the IOMMUFD internal infrastructure to accept user_data and parent hwpt, then relay the data to iommu core to allocate user iommu_domain. After it, extends the ioctl IOMMU_HWPT_ALLOC to accept user data and stage-2 hwpt ID to allocate hwpt. Along with it, ioctl IOMMU_HWPT_INVALIDATE is added to invalidate stage-1 IOTLB. This is needed for user-managed hwpts. Selftest is added as well to cover the new ioctls.
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230724105936.107042-1-yi.l.liu@intel.c... [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [3] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4_nesting
Change log:
v3: - Add new uAPI things in alphabetical order - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for sanity, replacing the previous op->domain_alloc_user_data_len solution - Return ERR_PTR from domain_alloc_user instead of NULL - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin) - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O page table). (Kevin) - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl - Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.co... - Add union iommu_domain_user_data to include all user data structures to avoid passing void * in kernel APIs. - Add iommu op to return user data length for user domain allocation - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len - Convert cache_invalidate_user op to be int instead of void - Remove @data_type in struct iommu_hwpt_invalidate - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.c...
Thanks, Yi Liu
Lu Baolu (2): iommu: Add new iommu op to create domains owned by userspace iommu: Add nested domain support
Nicolin Chen (6): iommufd/hw_pagetable: Do not populate user-managed hw_pagetables iommufd: Only enforce IOMMU_RESV_SW_MSI when attaching user-managed HWPT iommufd/selftest: Add domain_alloc_user() support in iommu mock iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (9): iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation iommufd: Pass in hwpt_type/parent/user_data to iommufd_hw_pagetable_alloc() iommufd: Add IOMMU_RESV_IOVA_RANGES iommufd: IOMMU_HWPT_ALLOC allocation with user data iommufd: Add IOMMU_HWPT_INVALIDATE iommufd/selftest: Add a helper to get test device iommufd/selftest: Add IOMMU_TEST_OP_DEV_[ADD|DEL]_RESERVED to add/del reserved regions to selftest device iommufd/selftest: Add .get_resv_regions() for mock_dev iommufd/selftest: Add coverage for IOMMU_RESV_IOVA_RANGES
drivers/iommu/iommufd/device.c | 9 +- drivers/iommu/iommufd/hw_pagetable.c | 181 +++++++++++- drivers/iommu/iommufd/io_pagetable.c | 5 +- drivers/iommu/iommufd/iommufd_private.h | 20 +- drivers/iommu/iommufd/iommufd_test.h | 36 +++ drivers/iommu/iommufd/main.c | 59 +++- drivers/iommu/iommufd/selftest.c | 266 ++++++++++++++++-- include/linux/iommu.h | 34 +++ include/uapi/linux/iommufd.h | 96 ++++++- tools/testing/selftests/iommu/iommufd.c | 224 ++++++++++++++- tools/testing/selftests/iommu/iommufd_utils.h | 70 +++++ 11 files changed, 958 insertions(+), 42 deletions(-)
From: Lu Baolu baolu.lu@linux.intel.com
Introduce a new iommu_domain op to create domains owned by userspace, e.g. through iommufd. These domains have a few different properties compares to kernel owned domains:
- They may be UNMANAGED domains, but created with special parameters. For instance aperture size changes/number of levels, different IOPTE formats, or other things necessary to make a vIOMMU work
- We have to track all the memory allocations with GFP_KERNEL_ACCOUNT to make the cgroup sandbox stronger
- Device-specialty domains, such as NESTED domains can be created by iommufd.
The new op clearly says the domain is being created by IOMMUFD, that the domain is intended for userspace use, and it provides a way to pass a driver specific uAPI structure to customize the created domain to exactly what the vIOMMU userspace driver requires.
iommu drivers that cannot support VFIO/IOMMUFD should not support this op. This includes any driver that cannot provide a fully functional UNMANAGED domain.
This op chooses to make the special parameters opaque to the core. This suits the current usage model where accessing any of the IOMMU device special parameters does require a userspace driver that matches the kernel driver. If a need for common parameters, implemented similarly by several drivers, arises then there is room in the design to grow a generic parameter set as well.
This new op for now is only supposed to be used by iommufd, hence no wrapper for it. iommufd would call the callback directly. As for domain free, iommufd would use iommu_domain_free().
enum iommu_hwpt_type is defined to differentiate the user parameters use by different usages. For the allocations that don't require user parameter, IOMMU_HWPT_TYPE_DEFAULT is defined for backward compatibility. Other types would be added by future iommu vendor driver extensions.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- include/linux/iommu.h | 20 ++++++++++++++++++++ include/uapi/linux/iommufd.h | 8 ++++++++ 2 files changed, 28 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 4199e13b34e6..ecbec2627b63 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -226,6 +226,15 @@ struct iommu_iotlb_gather { bool queued; };
+/* + * The user data to allocate a specific type user iommu domain + * + * This includes the corresponding driver data structures in + * include/uapi/linux/iommufd.h. + */ +union iommu_domain_user_data { +}; + /** * struct iommu_ops - iommu ops and capabilities * @capable: check capability @@ -235,6 +244,13 @@ struct iommu_iotlb_gather { * after use. Return the data buffer if success, or ERR_PTR on * failure. * @domain_alloc: allocate iommu domain + * @domain_alloc_user: allocate a user iommu domain corresponding to the input + * @hwpt_type that is defined as enum iommu_hwpt_type in the + * include/uapi/linux/iommufd.h. A returning domain will be + * set to an IOMMU_DOMAIN_NESTED type, upon valid @user_data + * and @parent that is a kernel-managed domain. Otherwise, + * it will be set to an IOMMU_DOMAIN_UNMANAGED type. Return + * ERR_PTR on allocation failure. * @probe_device: Add device to iommu driver handling * @release_device: Remove device from iommu driver handling * @probe_finalize: Do final setup work after the device is added to an IOMMU @@ -275,6 +291,10 @@ struct iommu_ops {
/* Domain allocation and freeing by the iommu driver */ struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type); + struct iommu_domain *(*domain_alloc_user)(struct device *dev, + enum iommu_hwpt_type hwpt_type, + struct iommu_domain *parent, + const union iommu_domain_user_data *user_data);
struct iommu_device *(*probe_device)(struct device *dev); void (*release_device)(struct device *dev); diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 4295362e7b44..d38bc54fd5f2 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -347,6 +347,14 @@ struct iommu_vfio_ioas { }; #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+/** + * enum iommu_hwpt_type - IOMMU HWPT Type + * @IOMMU_HWPT_TYPE_DEFAULT: default + */ +enum iommu_hwpt_type { + IOMMU_HWPT_TYPE_DEFAULT, +}; + /** * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC) * @size: sizeof(struct iommu_hwpt_alloc)
From: Yi Liu yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
- @domain_alloc_user: allocate a user iommu domain corresponding to
the input
@hwpt_type that is defined as enum iommu_hwpt_type in the
include/uapi/linux/iommufd.h. A returning domain will be
set to an IOMMU_DOMAIN_NESTED type, upon valid
@user_data
and @parent that is a kernel-managed domain. Otherwise,
it will be set to an IOMMU_DOMAIN_UNMANAGED type.
Return
ERR_PTR on allocation failure.
"If @user_data is valid and @parent points to a kernel-managed domain, the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise it is set to IOMMU_DOMAIN_UNMANAGED type."
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Fri, Jul 28, 2023 at 09:37:21AM +0000, Tian, Kevin wrote:
From: Yi Liu yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
- @domain_alloc_user: allocate a user iommu domain corresponding to
the input
@hwpt_type that is defined as enum iommu_hwpt_type in the
include/uapi/linux/iommufd.h. A returning domain will be
set to an IOMMU_DOMAIN_NESTED type, upon valid
@user_data
and @parent that is a kernel-managed domain. Otherwise,
it will be set to an IOMMU_DOMAIN_UNMANAGED type.
Return
ERR_PTR on allocation failure.
"If @user_data is valid and @parent points to a kernel-managed domain, the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise it is set to IOMMU_DOMAIN_UNMANAGED type."
"If @user_data is valid and @parent points to a kernel-managed domain, then the returned domain must be the IOMMU_DOMAIN_NESTED type. Otherwise the returned domain is IOMMU_DOMAIN_UNMANAGED."
Notice the detail that this API expects the driver to set the type and fully initialize the domain, including the generic iommu_domain struct, which is different than alloc_domain.
When we implement this in drivers we should tidy this so all the alloc flows fully initialize the domain internally.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 12:56 AM
On Fri, Jul 28, 2023 at 09:37:21AM +0000, Tian, Kevin wrote:
From: Yi Liu yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
- @domain_alloc_user: allocate a user iommu domain corresponding to
the input
@hwpt_type that is defined as enum iommu_hwpt_type in the
include/uapi/linux/iommufd.h. A returning domain will be
set to an IOMMU_DOMAIN_NESTED type, upon valid
@user_data
and @parent that is a kernel-managed domain. Otherwise,
it will be set to an IOMMU_DOMAIN_UNMANAGED type.
Return
ERR_PTR on allocation failure.
"If @user_data is valid and @parent points to a kernel-managed domain, the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise it is set to IOMMU_DOMAIN_UNMANAGED type."
"If @user_data is valid and @parent points to a kernel-managed domain, then the returned domain must be the IOMMU_DOMAIN_NESTED type. Otherwise the returned domain is IOMMU_DOMAIN_UNMANAGED."
Notice the detail that this API expects the driver to set the type and fully initialize the domain, including the generic iommu_domain struct, which is different than alloc_domain.
When we implement this in drivers we should tidy this so all the alloc flows fully initialize the domain internally.
Yes. this should be documented in the kdoc. Is it?
Regards, Yi Liu
On Mon, Jul 31, 2023 at 12:44:25PM +0000, Liu, Yi L wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 12:56 AM
On Fri, Jul 28, 2023 at 09:37:21AM +0000, Tian, Kevin wrote:
From: Yi Liu yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
- @domain_alloc_user: allocate a user iommu domain corresponding to
the input
@hwpt_type that is defined as enum iommu_hwpt_type in the
include/uapi/linux/iommufd.h. A returning domain will be
set to an IOMMU_DOMAIN_NESTED type, upon valid
@user_data
and @parent that is a kernel-managed domain. Otherwise,
it will be set to an IOMMU_DOMAIN_UNMANAGED type.
Return
ERR_PTR on allocation failure.
"If @user_data is valid and @parent points to a kernel-managed domain, the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise it is set to IOMMU_DOMAIN_UNMANAGED type."
"If @user_data is valid and @parent points to a kernel-managed domain, then the returned domain must be the IOMMU_DOMAIN_NESTED type. Otherwise the returned domain is IOMMU_DOMAIN_UNMANAGED."
Notice the detail that this API expects the driver to set the type and fully initialize the domain, including the generic iommu_domain struct, which is different than alloc_domain.
When we implement this in drivers we should tidy this so all the alloc flows fully initialize the domain internally.
Yes. this should be documented in the kdoc. Is it?
Yeah, maybe it should be mentioned
Jason
On Fri, Jul 28, 2023 at 01:56:05PM -0300, Jason Gunthorpe wrote:
On Fri, Jul 28, 2023 at 09:37:21AM +0000, Tian, Kevin wrote:
From: Yi Liu yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
- @domain_alloc_user: allocate a user iommu domain corresponding to
the input
@hwpt_type that is defined as enum iommu_hwpt_type in the
include/uapi/linux/iommufd.h. A returning domain will be
set to an IOMMU_DOMAIN_NESTED type, upon valid
@user_data
and @parent that is a kernel-managed domain. Otherwise,
it will be set to an IOMMU_DOMAIN_UNMANAGED type.
Return
ERR_PTR on allocation failure.
"If @user_data is valid and @parent points to a kernel-managed domain, the returning domain is set to IOMMU_DOMAIN_NESTED type. Otherwise it is set to IOMMU_DOMAIN_UNMANAGED type."
"If @user_data is valid and @parent points to a kernel-managed domain, then the returned domain must be the IOMMU_DOMAIN_NESTED type. Otherwise the returned domain is IOMMU_DOMAIN_UNMANAGED."
Notice the detail that this API expects the driver to set the type and fully initialize the domain, including the generic iommu_domain struct, which is different than alloc_domain.
When we implement this in drivers we should tidy this so all the alloc flows fully initialize the domain internally.
Changed to:
+ * @domain_alloc_user: allocate a user iommu domain corresponding to the input + * @hwpt_type that is defined as enum iommu_hwpt_type in the + * include/uapi/linux/iommufd.h. Different from domain_alloc + * it requires iommu driver to fully initialize a new domain + * including the generic iommu_domain struct. Upon success, + * if the @user_data is valid and the @parent points to a + * kernel-managed domain, the type of the new domain must be + * IOMMU_DOMAIN_NESTED, otherwise be IOMMU_DOMAIN_UNMANAGED. + * Upon failure, ERR_PTR must be returned.
Thanks Nic
From: Lu Baolu baolu.lu@linux.intel.com
Introduce a new domain type for a user I/O page table, which is nested on top of another user space address represented by a UNMANAGED domain. The mappings of a nested domain are managed by user space software, therefore it's unnecessary to have map/unmap callbacks. But the updates of the PTEs in the nested domain page table must be propagated to the caches on both IOMMU (IOTLB) and devices (DevTLB).
The nested domain is allocated by the domain_alloc_user op, and attached to the device through the existing iommu_attach_device/group() interfaces.
A new domain op, named cache_invalidate_user is added for the userspace to flush the hardware caches for a nested domain through iommufd. No wrapper for it, as it's only supposed to be used by iommufd.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- include/linux/iommu.h | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ecbec2627b63..b8f09330b64e 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -66,6 +66,9 @@ struct iommu_domain_geometry {
#define __IOMMU_DOMAIN_SVA (1U << 4) /* Shared process address space */
+#define __IOMMU_DOMAIN_NESTED (1U << 5) /* User-managed address space nested + on a stage-2 translation */ + #define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ /* * This are the possible domain-types @@ -92,6 +95,7 @@ struct iommu_domain_geometry { __IOMMU_DOMAIN_DMA_API | \ __IOMMU_DOMAIN_DMA_FQ) #define IOMMU_DOMAIN_SVA (__IOMMU_DOMAIN_SVA) +#define IOMMU_DOMAIN_NESTED (__IOMMU_DOMAIN_NESTED)
struct iommu_domain { unsigned type; @@ -350,6 +354,10 @@ struct iommu_ops { * @iotlb_sync_map: Sync mappings created recently using @map to the hardware * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush * queue + * @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings + * @cache_invalidate_user_data_len: Defined length of input user data for the + * cache_invalidate_user op, being sizeof the + * structure in include/uapi/linux/iommufd.h * @iova_to_phys: translate iova to physical address * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE, * including no-snoop TLPs on PCIe or other platform @@ -379,6 +387,9 @@ struct iommu_domain_ops { size_t size); void (*iotlb_sync)(struct iommu_domain *domain, struct iommu_iotlb_gather *iotlb_gather); + int (*cache_invalidate_user)(struct iommu_domain *domain, + void *user_data); + const size_t cache_invalidate_user_data_len;
phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova);
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
From: Lu Baolu baolu.lu@linux.intel.com
Introduce a new domain type for a user I/O page table, which is nested on top of another user space address represented by a UNMANAGED domain. The mappings of a nested domain are managed by user space software, therefore it's unnecessary to have map/unmap callbacks. But the updates of the PTEs in the nested domain page table must be propagated to the caches on both IOMMU (IOTLB) and devices (DevTLB).
The nested domain is allocated by the domain_alloc_user op, and attached to the device through the existing iommu_attach_device/group() interfaces.
A new domain op, named cache_invalidate_user is added for the userspace to flush the hardware caches for a nested domain through iommufd. No wrapper for it, as it's only supposed to be used by iommufd.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Mon, Jul 24, 2023 at 04:03:51AM -0700, Yi Liu wrote:
@@ -350,6 +354,10 @@ struct iommu_ops {
- @iotlb_sync_map: Sync mappings created recently using @map to the hardware
- @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
queue
- @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
- @cache_invalidate_user_data_len: Defined length of input user data for the
cache_invalidate_user op, being sizeof the
structure in include/uapi/linux/iommufd.h
- @iova_to_phys: translate iova to physical address
- @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
including no-snoop TLPs on PCIe or other platform
@@ -379,6 +387,9 @@ struct iommu_domain_ops { size_t size); void (*iotlb_sync)(struct iommu_domain *domain, struct iommu_iotlb_gather *iotlb_gather);
- int (*cache_invalidate_user)(struct iommu_domain *domain,
void *user_data);
If we are doing const unions, then this void * should also be a const union.
Jason
On Fri, Jul 28, 2023 at 01:59:28PM -0300, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:03:51AM -0700, Yi Liu wrote:
@@ -350,6 +354,10 @@ struct iommu_ops {
- @iotlb_sync_map: Sync mappings created recently using @map to the hardware
- @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
queue
- @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
- @cache_invalidate_user_data_len: Defined length of input user data for the
cache_invalidate_user op, being sizeof the
structure in include/uapi/linux/iommufd.h
- @iova_to_phys: translate iova to physical address
- @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
including no-snoop TLPs on PCIe or other platform
@@ -379,6 +387,9 @@ struct iommu_domain_ops { size_t size); void (*iotlb_sync)(struct iommu_domain *domain, struct iommu_iotlb_gather *iotlb_gather);
- int (*cache_invalidate_user)(struct iommu_domain *domain,
void *user_data);
If we are doing const unions, then this void * should also be a const union.
Unlike iommu_domain_user_data is a union on its own, all invalidate user data structures are added to union ucmd_buffer. It feels a bit weird to cross reference "union ucmd_buffer" and to pass the naming "ucmd_buffer" in this cache_invalidate_user.
Any suggestion?
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 3, 2023 10:37 AM
On Fri, Jul 28, 2023 at 01:59:28PM -0300, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:03:51AM -0700, Yi Liu wrote:
@@ -350,6 +354,10 @@ struct iommu_ops {
- @iotlb_sync_map: Sync mappings created recently using @map to the hardware
- @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
queue
- @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
- @cache_invalidate_user_data_len: Defined length of input user data for the
cache_invalidate_user op, being sizeof the
structure in include/uapi/linux/iommufd.h
- @iova_to_phys: translate iova to physical address
- @enforce_cache_coherency: Prevent any kind of DMA from bypassing
IOMMU_CACHE,
including no-snoop TLPs on PCIe or other platform
@@ -379,6 +387,9 @@ struct iommu_domain_ops { size_t size); void (*iotlb_sync)(struct iommu_domain *domain, struct iommu_iotlb_gather *iotlb_gather);
- int (*cache_invalidate_user)(struct iommu_domain *domain,
void *user_data);
If we are doing const unions, then this void * should also be a const union.
Unlike iommu_domain_user_data is a union on its own, all invalidate user data structures are added to union ucmd_buffer. It feels a bit weird to cross reference "union ucmd_buffer" and to pass the naming "ucmd_buffer" in this cache_invalidate_user.
Any suggestion?
I think we can have a union like iommu_user_cache_invalidate, every new data structures should be put in this union, and this union is put in the ucmd_buffer.
Regards, Yi Liu
On Thu, Aug 03, 2023 at 02:53:34AM +0000, Liu, Yi L wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 3, 2023 10:37 AM
On Fri, Jul 28, 2023 at 01:59:28PM -0300, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:03:51AM -0700, Yi Liu wrote:
@@ -350,6 +354,10 @@ struct iommu_ops {
- @iotlb_sync_map: Sync mappings created recently using @map to the hardware
- @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
queue
- @cache_invalidate_user: Flush hardware TLBs caching user space IO mappings
- @cache_invalidate_user_data_len: Defined length of input user data for the
cache_invalidate_user op, being sizeof the
structure in include/uapi/linux/iommufd.h
- @iova_to_phys: translate iova to physical address
- @enforce_cache_coherency: Prevent any kind of DMA from bypassing
IOMMU_CACHE,
including no-snoop TLPs on PCIe or other platform
@@ -379,6 +387,9 @@ struct iommu_domain_ops { size_t size); void (*iotlb_sync)(struct iommu_domain *domain, struct iommu_iotlb_gather *iotlb_gather);
- int (*cache_invalidate_user)(struct iommu_domain *domain,
void *user_data);
If we are doing const unions, then this void * should also be a const union.
Unlike iommu_domain_user_data is a union on its own, all invalidate user data structures are added to union ucmd_buffer. It feels a bit weird to cross reference "union ucmd_buffer" and to pass the naming "ucmd_buffer" in this cache_invalidate_user.
Any suggestion?
I think we can have a union like iommu_user_cache_invalidate, every new data structures should be put in this union, and this union is put in the ucmd_buffer.
Ah, that should do the job.
Thanks!
This makes IOMMUFD to use iommu_domain_alloc_user() for iommu_domain creation as IOMMUFD needs to support iommu_domain allocation with parameters from userspace in nesting support. If the iommu driver doesn't provide domain_alloc_user callback then it falls back to use iommu_domain_alloc().
Suggested-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/hw_pagetable.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index cf2c1504e20d..1371e96653b2 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -5,6 +5,7 @@ #include <linux/iommu.h> #include <uapi/linux/iommufd.h>
+#include "../iommu-priv.h" #include "iommufd_private.h"
void iommufd_hw_pagetable_destroy(struct iommufd_object *obj) @@ -74,6 +75,7 @@ struct iommufd_hw_pagetable * iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, struct iommufd_device *idev, bool immediate_attach) { + const struct iommu_ops *ops = dev_iommu_ops(idev->dev); struct iommufd_hw_pagetable *hwpt; int rc;
@@ -88,10 +90,21 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, refcount_inc(&ioas->obj.users); hwpt->ioas = ioas;
- hwpt->domain = iommu_domain_alloc(idev->dev->bus); - if (!hwpt->domain) { - rc = -ENOMEM; - goto out_abort; + if (ops->domain_alloc_user) { + hwpt->domain = ops->domain_alloc_user(idev->dev, + IOMMU_HWPT_TYPE_DEFAULT, + NULL, NULL); + if (IS_ERR(hwpt->domain)) { + rc = PTR_ERR(hwpt->domain); + hwpt->domain = NULL; + goto out_abort; + } + } else { + hwpt->domain = iommu_domain_alloc(idev->dev->bus); + if (!hwpt->domain) { + rc = -ENOMEM; + goto out_abort; + } }
/*
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
This makes IOMMUFD to use iommu_domain_alloc_user() for iommu_domain creation as IOMMUFD needs to support iommu_domain allocation with parameters from userspace in nesting support. If the iommu driver doesn't provide domain_alloc_user callback then it falls back to use iommu_domain_alloc().
Suggested-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Lu Baolu baolu.lu@linux.intel.com Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Nested translation has stage-1 and stage-2 page tables. A stage-1 page table is managed by user space, and it needs to work with a stage-2 page table, which is a parent hwpt for the stage-1 hwpt.
iommu core already supports accepting parent iommu_domain and user_data to allocate an iommu_domain. This makes iommufd_hw_pagetable_alloc() to accept the parent hwpt and user_data, and relays them to iommu core, to prepare for supporting hw_pagetable allocation with user_data.
Also, add a parent pointer in struct iommufd_hw_pagetable for taking and releasing its refcount.
Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 3 +- drivers/iommu/iommufd/hw_pagetable.c | 40 +++++++++++++++++++++---- drivers/iommu/iommufd/iommufd_private.h | 8 ++++- 3 files changed, 44 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 0deb2a2ec01a..97e4e5f5aca0 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -542,7 +542,8 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev, }
hwpt = iommufd_hw_pagetable_alloc(idev->ictx, ioas, idev, - immediate_attach); + IOMMU_HWPT_TYPE_DEFAULT, + NULL, NULL, immediate_attach); if (IS_ERR(hwpt)) { destroy_hwpt = ERR_CAST(hwpt); goto out_unlock; diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 1371e96653b2..28122a49b529 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -24,6 +24,8 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj) if (hwpt->domain) iommu_domain_free(hwpt->domain);
+ if (hwpt->parent) + refcount_dec(&hwpt->parent->obj.users); refcount_dec(&hwpt->ioas->obj.users); }
@@ -61,6 +63,9 @@ int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt) * @ictx: iommufd context * @ioas: IOAS to associate the domain with * @idev: Device to get an iommu_domain for + * @hwpt_type: Requested type of hw_pagetable + * @parent: Optional parent HWPT to associate with + * @user_data: Optional user_data pointer * @immediate_attach: True if idev should be attached to the hwpt * * Allocate a new iommu_domain and return it as a hw_pagetable. The HWPT @@ -73,14 +78,24 @@ int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt) */ struct iommufd_hw_pagetable * iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, - struct iommufd_device *idev, bool immediate_attach) + struct iommufd_device *idev, + enum iommu_hwpt_type hwpt_type, + struct iommufd_hw_pagetable *parent, + union iommu_domain_user_data *user_data, + bool immediate_attach) { const struct iommu_ops *ops = dev_iommu_ops(idev->dev); + struct iommu_domain *parent_domain = NULL; struct iommufd_hw_pagetable *hwpt; int rc;
lockdep_assert_held(&ioas->mutex);
+ if (parent && !user_data) + return ERR_PTR(-EINVAL); + if (user_data && !ops->domain_alloc_user) + return ERR_PTR(-EOPNOTSUPP); + hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE); if (IS_ERR(hwpt)) return hwpt; @@ -89,11 +104,15 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, /* Pairs with iommufd_hw_pagetable_destroy() */ refcount_inc(&ioas->obj.users); hwpt->ioas = ioas; + if (parent) { + hwpt->parent = parent; + parent_domain = parent->domain; + refcount_inc(&parent->obj.users); + }
if (ops->domain_alloc_user) { - hwpt->domain = ops->domain_alloc_user(idev->dev, - IOMMU_HWPT_TYPE_DEFAULT, - NULL, NULL); + hwpt->domain = ops->domain_alloc_user(idev->dev, hwpt_type, + parent_domain, user_data); if (IS_ERR(hwpt->domain)) { rc = PTR_ERR(hwpt->domain); hwpt->domain = NULL; @@ -107,6 +126,15 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, } }
+ /* It must be either NESTED or UNMANAGED, depending on parent_domain */ + if (WARN_ON_ONCE((parent_domain && + hwpt->domain->type != IOMMU_DOMAIN_NESTED) || + (!parent_domain && + hwpt->domain->type != IOMMU_DOMAIN_UNMANAGED))) { + rc = -EINVAL; + goto out_abort; + } + /* * Set the coherency mode before we do iopt_table_add_domain() as some * iommus have a per-PTE bit that controls it and need to decide before @@ -168,7 +196,9 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) }
mutex_lock(&ioas->mutex); - hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, idev, false); + hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, idev, + IOMMU_HWPT_TYPE_DEFAULT, + NULL, NULL, false); if (IS_ERR(hwpt)) { rc = PTR_ERR(hwpt); goto out_unlock; diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index dba730129b8c..90dcf4041530 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -8,6 +8,7 @@ #include <linux/xarray.h> #include <linux/refcount.h> #include <linux/uaccess.h> +#include <linux/iommu.h>
struct iommu_domain; struct iommu_group; @@ -243,6 +244,7 @@ int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd); */ struct iommufd_hw_pagetable { struct iommufd_object obj; + struct iommufd_hw_pagetable *parent; struct iommufd_ioas *ioas; struct iommu_domain *domain; bool auto_domain : 1; @@ -254,7 +256,11 @@ struct iommufd_hw_pagetable {
struct iommufd_hw_pagetable * iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, - struct iommufd_device *idev, bool immediate_attach); + struct iommufd_device *idev, + enum iommu_hwpt_type hwpt_type, + struct iommufd_hw_pagetable *parent, + union iommu_domain_user_data *user_data, + bool immediate_attach); int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt); int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt, struct iommufd_device *idev);
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
Nested translation has stage-1 and stage-2 page tables. A stage-1 page table is managed by user space, and it needs to work with a stage-2 page table, which is a parent hwpt for the stage-1 hwpt.
iommu core already supports accepting parent iommu_domain and user_data to allocate an iommu_domain. This makes iommufd_hw_pagetable_alloc() to accept the parent hwpt and user_data, and relays them to iommu core, to prepare for supporting hw_pagetable allocation with user_data.
Also, add a parent pointer in struct iommufd_hw_pagetable for taking and releasing its refcount.
Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
From: Nicolin Chen nicolinc@nvidia.com
A user-managed hw_pagetable does not need to get populated, since it is managed by a guest OS. Move the iopt_table_add_domain and list_add_tail calls into a helper, where the hwpt pointer will be redirected to its hwpt->parent if it's available.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/hw_pagetable.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 28122a49b529..c7301cf0e85a 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -58,6 +58,24 @@ int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt) return 0; }
+static int iommufd_hw_pagetable_link_ioas(struct iommufd_hw_pagetable *hwpt) +{ + int rc; + + /* + * Only a parent hwpt needs to be linked to the IOAS. And a hwpt->parent + * must be linked to the IOAS already, when it's being allocated. + */ + if (hwpt->parent) + return 0; + + rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain); + if (rc) + return rc; + list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); + return 0; +} + /** * iommufd_hw_pagetable_alloc() - Get an iommu_domain for a device * @ictx: iommufd context @@ -160,10 +178,9 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, goto out_abort; }
- rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain); + rc = iommufd_hw_pagetable_link_ioas(hwpt); if (rc) goto out_detach; - list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list); return hwpt;
out_detach:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
From: Nicolin Chen nicolinc@nvidia.com
A user-managed hw_pagetable does not need to get populated, since it is managed by a guest OS. Move the iopt_table_add_domain and list_add_tail calls into a helper, where the hwpt pointer will be redirected to its hwpt->parent if it's available.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
From: Nicolin Chen nicolinc@nvidia.com
Reserved IOVA regions except for the IOMMU_RESV_DIRECT_RELAXABLE type should be excluded from the device's DMA address space. For single stage translation configuration, such reserved regions are excluded in the attaching HWPT that is managed by kernel. While for nested translation configuration, there are two stage HWPTs, the reserved regions should be excluded in stage-1 HWPT which is managed by userspace. The current code always excludes the reserved regions in the stage-2 HWPT which is kernel managed. This is incorrect. Instead, the reserved regions should be reported to userspace and excluded in stage-1 HWPT.
Besides above, the IOMMU_RESV_SW_MSI type region needs to be excluded in the stage-2 HWPT even in the nested translation configuration on ARM. So the IOMMU_RESV_SW_MSI type region should always be excluded in the kernel managed HWPT.
This adds a boolean argument to enforce IOMMU_RESV_SW_MSI only, if attaching HWPT is a user-managed one, i.e. hwpt->parent must be valid, resulting "!!hwpt->parent" to be true, hence only add the IOMMU_RESV_SW_MSI region to the stage-2 HWPT reserved iovas.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/device.c | 6 ++++-- drivers/iommu/iommufd/io_pagetable.c | 5 ++++- drivers/iommu/iommufd/iommufd_private.h | 3 ++- 3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 97e4e5f5aca0..c0917683097f 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -347,7 +347,8 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt, }
rc = iopt_table_enforce_dev_resv_regions(&hwpt->ioas->iopt, idev->dev, - &idev->igroup->sw_msi_start); + &idev->igroup->sw_msi_start, + !!hwpt->parent); if (rc) goto err_unlock;
@@ -445,7 +446,8 @@ iommufd_device_do_replace(struct iommufd_device *idev, if (hwpt->ioas != old_hwpt->ioas) { list_for_each_entry(cur, &igroup->device_list, group_item) { rc = iopt_table_enforce_dev_resv_regions( - &hwpt->ioas->iopt, cur->dev, NULL); + &hwpt->ioas->iopt, cur->dev, NULL, + !!hwpt->parent); if (rc) goto err_unresv; } diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 4d095115c2d0..e658261f44ed 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -1172,7 +1172,8 @@ void iopt_remove_access(struct io_pagetable *iopt, /* Narrow the valid_iova_itree to include reserved ranges from a device. */ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, struct device *dev, - phys_addr_t *sw_msi_start) + phys_addr_t *sw_msi_start, + bool sw_msi_only) { struct iommu_resv_region *resv; LIST_HEAD(resv_regions); @@ -1198,6 +1199,8 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, num_sw_msi++; }
+ if (sw_msi_only && resv->type != IOMMU_RESV_SW_MSI) + continue; rc = iopt_reserve_iova(iopt, resv->start, resv->length - 1 + resv->start, dev); if (rc) diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 90dcf4041530..268ae0d5ae12 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -79,7 +79,8 @@ void iopt_table_remove_domain(struct io_pagetable *iopt, struct iommu_domain *domain); int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt, struct device *dev, - phys_addr_t *sw_msi_start); + phys_addr_t *sw_msi_start, + bool sw_msi_only); int iopt_set_allow_iova(struct io_pagetable *iopt, struct rb_root_cached *allowed_iova); int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM }
rc = iopt_table_enforce_dev_resv_regions(&hwpt->ioas->iopt, idev-
dev,
&idev->igroup-
sw_msi_start);
&idev->igroup-
sw_msi_start,
if (rc) goto err_unlock;!!hwpt->parent);
I prefer to not setting parent ioas to hwpt in iommufd_hw_pagetable_alloc().
then here ioas can be retrieved from hwpt->parent and then it'd be pretty clear that in nested case the sw_msi reservation happens in the parent instead of pretending the stage-1 hwpt has an ioas too.
On Fri, Jul 28, 2023 at 10:02:36AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM }
rc = iopt_table_enforce_dev_resv_regions(&hwpt->ioas->iopt, idev-
dev,
&idev->igroup-
sw_msi_start);
&idev->igroup-
sw_msi_start,
if (rc) goto err_unlock;!!hwpt->parent);
I prefer to not setting parent ioas to hwpt in iommufd_hw_pagetable_alloc().
Yes, the prior patch didn't add it to the iopt, so it shouldn't have an ioas set. The NESTED domains don't have an IOAS almost by definition.
then here ioas can be retrieved from hwpt->parent and then it'd be pretty clear that in nested case the sw_msi reservation happens in the parent instead of pretending the stage-1 hwpt has an ioas too.
Yeah, I'm confused by this patch as well. Since there should be no IOAS for the NESTED why are we messing with resv_regions?
Jason
This reports device's reserved IOVA regions to userspace. This is needed in the nested translation as userspace owns stage-1 HWPT, and userspace needs to exclude the reserved IOVA regions in the stage-1 HWPT hence exclude them in the device's DMA address space.
This can also be used to figure out allowed IOVAs of an IOAS.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/main.c | 54 ++++++++++++++++++++++++++++++++++++ include/uapi/linux/iommufd.h | 46 ++++++++++++++++++++++++++++++ 2 files changed, 100 insertions(+)
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index bd3efc1d8509..510db114fc61 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -250,6 +250,57 @@ static int iommufd_get_hw_info(struct iommufd_ucmd *ucmd) return rc; }
+static int iommufd_resv_iova_ranges(struct iommufd_ucmd *ucmd) +{ + struct iommu_resv_iova_ranges *cmd = ucmd->cmd; + struct iommu_resv_iova_range __user *ranges; + struct iommu_resv_region *resv; + struct iommufd_device *idev; + LIST_HEAD(resv_regions); + u32 max_iovas; + int rc; + + if (cmd->__reserved) + return -EOPNOTSUPP; + + idev = iommufd_get_device(ucmd, cmd->dev_id); + if (IS_ERR(idev)) + return PTR_ERR(idev); + + max_iovas = cmd->num_iovas; + ranges = u64_to_user_ptr(cmd->resv_iovas); + cmd->num_iovas = 0; + + iommu_get_resv_regions(idev->dev, &resv_regions); + + list_for_each_entry(resv, &resv_regions, list) { + if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE) + continue; + if (cmd->num_iovas < max_iovas) { + struct iommu_resv_iova_range elm = { + .start = resv->start, + .last = resv->length - 1 + resv->start, + }; + + if (copy_to_user(&ranges[cmd->num_iovas], &elm, + sizeof(elm))) { + rc = -EFAULT; + goto out_put; + } + } + cmd->num_iovas++; + } + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); + if (rc) + goto out_put; + if (cmd->num_iovas > max_iovas) + rc = -EMSGSIZE; +out_put: + iommu_put_resv_regions(idev->dev, &resv_regions); + iommufd_put_object(&idev->obj); + return rc; +} + static int iommufd_fops_open(struct inode *inode, struct file *filp) { struct iommufd_ctx *ictx; @@ -347,6 +398,7 @@ union ucmd_buffer { struct iommu_ioas_map map; struct iommu_ioas_unmap unmap; struct iommu_option option; + struct iommu_resv_iova_ranges resv_ranges; struct iommu_vfio_ioas vfio_ioas; #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; @@ -389,6 +441,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { length), IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option, val64), + IOCTL_OP(IOMMU_RESV_IOVA_RANGES, iommufd_resv_iova_ranges, + struct iommu_resv_iova_ranges, resv_iovas), IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas, __reserved), #ifdef CONFIG_IOMMUFD_TEST diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index d38bc54fd5f2..f2026cde2d64 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -47,6 +47,7 @@ enum { IOMMUFD_CMD_VFIO_IOAS, IOMMUFD_CMD_HWPT_ALLOC, IOMMUFD_CMD_GET_HW_INFO, + IOMMUFD_CMD_RESV_IOVA_RANGES, };
/** @@ -422,4 +423,49 @@ struct iommu_hw_info { __u32 __reserved; }; #define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO) + +/** + * struct iommu_resv_iova_range - ioctl(IOMMU_RESV_IOVA_RANGE) + * @start: First IOVA + * @last: Inclusive last IOVA + * + * An interval in IOVA space. + */ +struct iommu_resv_iova_range { + __aligned_u64 start; + __aligned_u64 last; +}; + +/** + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES) + * @size: sizeof(struct iommu_resv_iova_ranges) + * @dev_id: device to read resv iova ranges for + * @num_iovas: Input/Output total number of resv ranges for the device + * @__reserved: Must be 0 + * @resv_iovas: Pointer to the output array of struct iommu_resv_iova_range + * + * Query a device for ranges of reserved IOVAs. num_iovas will be set to the + * total number of iovas and the resv_iovas[] will be filled in as space + * permits. + * + * On input num_iovas is the length of the resv_iovas array. On output it is + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and + * set num_iovas to the required value if num_iovas is too small. In this + * case the caller should allocate a larger output array and re-issue the + * ioctl. + * + * Under nested translation, userspace should query the reserved IOVAs for a + * given device, and report it to the stage-1 I/O page table owner to exclude + * the reserved IOVAs. The reserved IOVAs can also be used to figure out the + * allowed IOVA ranges for the IOAS that the device is attached to. For detail + * see ioctl IOMMU_IOAS_IOVA_RANGES. + */ +struct iommu_resv_iova_ranges { + __u32 size; + __u32 dev_id; + __u32 num_iovas; + __u32 __reserved; + __aligned_u64 resv_iovas; +}; +#define IOMMU_RESV_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_RESV_IOVA_RANGES) #endif
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
This reports device's reserved IOVA regions to userspace. This is needed in the nested translation as userspace owns stage-1 HWPT, and userspace needs to exclude the reserved IOVA regions in the stage-1 HWPT hence exclude them in the device's DMA address space.
This can also be used to figure out allowed IOVAs of an IOAS.
We may need a special type to mark SW_MSI since it requires identity mapping in stage-1 instead of being reserved.
On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
This reports device's reserved IOVA regions to userspace. This is needed in the nested translation as userspace owns stage-1 HWPT, and userspace needs to exclude the reserved IOVA regions in the stage-1 HWPT hence exclude them in the device's DMA address space.
This can also be used to figure out allowed IOVAs of an IOAS.
We may need a special type to mark SW_MSI since it requires identity mapping in stage-1 instead of being reserved.
Only the kernel can do this, so there is no action for user space to take beyond knowing that is is not mappable IOVA.
The merit for "SW_MSI" may be to inform the rest of the system about the IOVA of the ITS page, but with the current situation that isn't required since only the kernel needs that information.
I think the long term way forward is to somehow arrange for the SW_MSI to not become mapped when creating the parent HWPT and instead cause the ITS page to be mapped through some explicit IOCTL.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 1:17 AM
On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
This reports device's reserved IOVA regions to userspace. This is needed in the nested translation as userspace owns stage-1 HWPT, and
userspace
needs to exclude the reserved IOVA regions in the stage-1 HWPT hence exclude them in the device's DMA address space.
This can also be used to figure out allowed IOVAs of an IOAS.
We may need a special type to mark SW_MSI since it requires identity mapping in stage-1 instead of being reserved.
Only the kernel can do this, so there is no action for user space to take beyond knowing that is is not mappable IOVA.
The merit for "SW_MSI" may be to inform the rest of the system about the IOVA of the ITS page, but with the current situation that isn't required since only the kernel needs that information.
IIUC guest kernel needs to know the "SW_MSI" region and then setup an 1:1 mapping for it in S1.
So Qemu needs to know and pass this information to the guest?
I think the long term way forward is to somehow arrange for the SW_MSI to not become mapped when creating the parent HWPT and instead cause the ITS page to be mapped through some explicit IOCTL.
yes this is a cleaner approach. Qemu selects the intermediate address of vITS page and maps it to physical ITS page in S2. Then the guest kernel just pick whatever "SW_MSI" address in S1 to vITS as it does today on bare metal.
On Mon, Jul 31, 2023 at 06:14:50AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 1:17 AM
On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:04 PM
This reports device's reserved IOVA regions to userspace. This is needed in the nested translation as userspace owns stage-1 HWPT, and
userspace
needs to exclude the reserved IOVA regions in the stage-1 HWPT hence exclude them in the device's DMA address space.
This can also be used to figure out allowed IOVAs of an IOAS.
We may need a special type to mark SW_MSI since it requires identity mapping in stage-1 instead of being reserved.
Only the kernel can do this, so there is no action for user space to take beyond knowing that is is not mappable IOVA.
The merit for "SW_MSI" may be to inform the rest of the system about the IOVA of the ITS page, but with the current situation that isn't required since only the kernel needs that information.
IIUC guest kernel needs to know the "SW_MSI" region and then setup an 1:1 mapping for it in S1.
Yes, but qemu hardcodes this and for some reason people thought that was a good idea back when.
I think the long term way forward is to somehow arrange for the SW_MSI to not become mapped when creating the parent HWPT and instead cause the ITS page to be mapped through some explicit IOCTL.
yes this is a cleaner approach. Qemu selects the intermediate address of vITS page and maps it to physical ITS page in S2. Then the guest kernel just pick whatever "SW_MSI" address in S1 to vITS as it does today on bare metal.
Right, so I've been inclined to minimize the amount of special stuff created for this way of doing the MSI and hope we can reach a better way sooner than later
Jason
On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:
+/**
- struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
- @size: sizeof(struct iommu_resv_iova_ranges)
- @dev_id: device to read resv iova ranges for
- @num_iovas: Input/Output total number of resv ranges for the device
- @__reserved: Must be 0
- @resv_iovas: Pointer to the output array of struct iommu_resv_iova_range
- Query a device for ranges of reserved IOVAs. num_iovas will be set to the
- total number of iovas and the resv_iovas[] will be filled in as space
- permits.
- On input num_iovas is the length of the resv_iovas array. On output it is
- the total number of iovas filled in. The ioctl will return -EMSGSIZE and
- set num_iovas to the required value if num_iovas is too small. In this
- case the caller should allocate a larger output array and re-issue the
- ioctl.
- Under nested translation, userspace should query the reserved IOVAs for a
- given device, and report it to the stage-1 I/O page table owner to exclude
- the reserved IOVAs. The reserved IOVAs can also be used to figure out the
- allowed IOVA ranges for the IOAS that the device is attached to. For detail
- see ioctl IOMMU_IOAS_IOVA_RANGES.
I'm not sure I like this, the other APIs here work with the *allowed* IOVAs, which is the inverse of this one which works with the *disallowed* IOVAs.
It means we can't take the output of this API and feed it into IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going to do this anyhow.
On the other hand, it is kind of hard to intersect an allowed list of multiple idevs into a single master list.
As it is, userspace will have to aggregate the list, sort it, merge adjacent overlapping reserved ranges then invert the list to get an allowed list. This is not entirely simple..
Did you already write an algorithm to do this in qemu someplace?
Anyhow, this should be split out from this series. It seems simple enough to merge it now if someone can confirm what qemu needs.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 1:45 AM
On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:
+/**
- struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
- @size: sizeof(struct iommu_resv_iova_ranges)
- @dev_id: device to read resv iova ranges for
- @num_iovas: Input/Output total number of resv ranges for the device
- @__reserved: Must be 0
- @resv_iovas: Pointer to the output array of struct
iommu_resv_iova_range
- Query a device for ranges of reserved IOVAs. num_iovas will be set to
the
- total number of iovas and the resv_iovas[] will be filled in as space
- permits.
- On input num_iovas is the length of the resv_iovas array. On output it is
- the total number of iovas filled in. The ioctl will return -EMSGSIZE and
- set num_iovas to the required value if num_iovas is too small. In this
- case the caller should allocate a larger output array and re-issue the
- ioctl.
- Under nested translation, userspace should query the reserved IOVAs
for a
- given device, and report it to the stage-1 I/O page table owner to
exclude
- the reserved IOVAs. The reserved IOVAs can also be used to figure out
the
- allowed IOVA ranges for the IOAS that the device is attached to. For
detail
- see ioctl IOMMU_IOAS_IOVA_RANGES.
I'm not sure I like this, the other APIs here work with the *allowed* IOVAs, which is the inverse of this one which works with the *disallowed* IOVAs.
It means we can't take the output of this API and feed it into IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going to do this anyhow.
On the other hand, it is kind of hard to intersect an allowed list of multiple idevs into a single master list.
As it is, userspace will have to aggregate the list, sort it, merge adjacent overlapping reserved ranges then invert the list to get an allowed list. This is not entirely simple..
Did you already write an algorithm to do this in qemu someplace?
Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES is still being used. If the only purpose of using this new cmd is to report per-device reserved ranges to the guest then aggregation is not required.
Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this new cmd. But it's already there and as you said it's actually more convenient to be used if the user doesn't care about per-device reserved ranges...
Anyhow, this should be split out from this series. It seems simple enough to merge it now if someone can confirm what qemu needs.
Jason
From: Tian, Kevin kevin.tian@intel.com Sent: Monday, July 31, 2023 2:22 PM
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 1:45 AM
On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:
+/**
- struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
- @size: sizeof(struct iommu_resv_iova_ranges)
- @dev_id: device to read resv iova ranges for
- @num_iovas: Input/Output total number of resv ranges for the device
- @__reserved: Must be 0
- @resv_iovas: Pointer to the output array of struct
iommu_resv_iova_range
- Query a device for ranges of reserved IOVAs. num_iovas will be set to
the
- total number of iovas and the resv_iovas[] will be filled in as space
- permits.
- On input num_iovas is the length of the resv_iovas array. On output it is
- the total number of iovas filled in. The ioctl will return -EMSGSIZE and
- set num_iovas to the required value if num_iovas is too small. In this
- case the caller should allocate a larger output array and re-issue the
- ioctl.
- Under nested translation, userspace should query the reserved IOVAs
for a
- given device, and report it to the stage-1 I/O page table owner to
exclude
- the reserved IOVAs. The reserved IOVAs can also be used to figure out
the
- allowed IOVA ranges for the IOAS that the device is attached to. For
detail
- see ioctl IOMMU_IOAS_IOVA_RANGES.
I'm not sure I like this, the other APIs here work with the *allowed* IOVAs, which is the inverse of this one which works with the *disallowed* IOVAs.
It means we can't take the output of this API and feed it into IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going to do this anyhow.
On the other hand, it is kind of hard to intersect an allowed list of multiple idevs into a single master list.
As it is, userspace will have to aggregate the list, sort it, merge adjacent overlapping reserved ranges then invert the list to get an allowed list. This is not entirely simple..
Did you already write an algorithm to do this in qemu someplace?
Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES is still being used. If the only purpose of using this new cmd is to report per-device reserved ranges to the guest then aggregation is not required.
Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this new cmd. But it's already there and as you said it's actually more convenient to be used if the user doesn't care about per-device reserved ranges...
Yes. it's not simple as userspace needs to do a lot ofwork to get the allowed iovas if multiple devices are attached to an IOAS.
Anyhow, this should be split out from this series. It seems simple enough to merge it now if someone can confirm what qemu needs.
Ok, so the reason is this new ioctl can be used to figure out allowd iovas. Right?
Regards, Yi Liu
On Mon, Jul 31, 2023 at 06:21:44AM +0000, Tian, Kevin wrote:
As it is, userspace will have to aggregate the list, sort it, merge adjacent overlapping reserved ranges then invert the list to get an allowed list. This is not entirely simple..
Did you already write an algorithm to do this in qemu someplace?
Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES is still being used. If the only purpose of using this new cmd is to report per-device reserved ranges to the guest then aggregation is not required.
I don't think it is entirely optional.. If qmeu doesn't track this, then it will have failures when attaching the S2 to the device. It needs to make sure it punches the right holes in the guest memory map to be compatible with the VFIO HW.
I suppose in reality the reserved regions are fairly predictable and probably always match the existing qemu memory map so you can ignore this and still work.
Plus most qemu cases don't deal with hotplug so you can build up the identity ioas with all the devices and then use IOMMU_IOAS_IOVA_RANGES as you say and still work.
Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this new cmd. But it's already there and as you said it's actually more convenient to be used if the user doesn't care about per-device reserved ranges...
Yes and yes
So, I guess we should leave it like this?
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Monday, July 31, 2023 9:24 PM
On Mon, Jul 31, 2023 at 06:21:44AM +0000, Tian, Kevin wrote:
As it is, userspace will have to aggregate the list, sort it, merge adjacent overlapping reserved ranges then invert the list to get an allowed list. This is not entirely simple..
Did you already write an algorithm to do this in qemu someplace?
Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES is still being used. If the only purpose of using this new cmd is to report per-device reserved ranges to the guest then aggregation is not required.
I don't think it is entirely optional.. If qmeu doesn't track this, then it will have failures when attaching the S2 to the device. It needs to make sure it punches the right holes in the guest memory map to be compatible with the VFIO HW.
I suppose in reality the reserved regions are fairly predictable and probably always match the existing qemu memory map so you can ignore this and still work.
Plus most qemu cases don't deal with hotplug so you can build up the identity ioas with all the devices and then use IOMMU_IOAS_IOVA_RANGES as you say and still work.
Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this new cmd. But it's already there and as you said it's actually more convenient to be used if the user doesn't care about per-device reserved ranges...
Yes and yes
So, I guess we should leave it like this?
Yes. Along with this discussion (including what you explained for sw_msi) let's abandon this new cmd and leave it as today.
On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
So, I guess we should leave it like this?
Yes. Along with this discussion (including what you explained for sw_msi) let's abandon this new cmd and leave it as today.
You sure? This makes it basically impossible to write a "correct" vmm that is aware of what the physical memory map must be early on
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, August 2, 2023 2:23 AM
On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
So, I guess we should leave it like this?
Yes. Along with this discussion (including what you explained for sw_msi) let's abandon this new cmd and leave it as today.
You sure? This makes it basically impossible to write a "correct" vmm that is aware of what the physical memory map must be early on
emmm... I thought it's what you meant by "leave it like this" and the fact that existing VMM's memory layout happens to match the reserved regions. Nobody complains lacking of such a interface for years then we may postpone supporting it until it's really required.
btw even if we add this new cmd now, getting the Qemu support to use the aggregated list when creating the guest memory map is not a simple task given currently vfio only passively acts on change notifications in the guest memory layout. It requires a new mechanism to enforce strict order (probe all vfio devices before creating the memory layout) and then injects vfio reserved regions into the layout.
Preferably let's not making it a hard dependency for this series.
On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, August 2, 2023 2:23 AM
On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
So, I guess we should leave it like this?
Yes. Along with this discussion (including what you explained for sw_msi) let's abandon this new cmd and leave it as today.
You sure? This makes it basically impossible to write a "correct" vmm that is aware of what the physical memory map must be early on
emmm... I thought it's what you meant by "leave it like this" and the fact that existing VMM's memory layout happens to match the reserved regions. Nobody complains lacking of such a interface for years then we may postpone supporting it until it's really required.
btw even if we add this new cmd now, getting the Qemu support to use the aggregated list when creating the guest memory map is not a simple task given currently vfio only passively acts on change notifications in the guest memory layout. It requires a new mechanism to enforce strict order (probe all vfio devices before creating the memory layout) and then injects vfio reserved regions into the layout.
Preferably let's not making it a hard dependency for this series.
I see, if qemu won't use it then lets not do it
Jason
On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, August 2, 2023 2:23 AM
On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
So, I guess we should leave it like this?
Yes. Along with this discussion (including what you explained for sw_msi) let's abandon this new cmd and leave it as today.
You sure? This makes it basically impossible to write a "correct" vmm that is aware of what the physical memory map must be early on
emmm... I thought it's what you meant by "leave it like this" and the fact that existing VMM's memory layout happens to match the reserved regions. Nobody complains lacking of such a interface for years then we may postpone supporting it until it's really required.
btw even if we add this new cmd now, getting the Qemu support to use the aggregated list when creating the guest memory map is not a simple task given currently vfio only passively acts on change notifications in the guest memory layout. It requires a new mechanism to enforce strict order (probe all vfio devices before creating the memory layout) and then injects vfio reserved regions into the layout.
Preferably let's not making it a hard dependency for this series.
Should we drop this and its selftest patch from this series?
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 3, 2023 9:23 AM
On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, August 2, 2023 2:23 AM
On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
So, I guess we should leave it like this?
Yes. Along with this discussion (including what you explained for
sw_msi)
let's abandon this new cmd and leave it as today.
You sure? This makes it basically impossible to write a "correct" vmm that is aware of what the physical memory map must be early on
emmm... I thought it's what you meant by "leave it like this" and the fact that existing VMM's memory layout happens to match the reserved regions. Nobody complains lacking of such a interface for years then we may postpone supporting it until it's really required.
btw even if we add this new cmd now, getting the Qemu support to use the aggregated list when creating the guest memory map is not a simple task given currently vfio only passively acts on change notifications in the guest memory layout. It requires a new mechanism to enforce strict order (probe all vfio devices before creating the memory layout) and then injects vfio reserved regions into the layout.
Preferably let's not making it a hard dependency for this series.
Should we drop this and its selftest patch from this series?
Yes. let's drop it.
On Thu, Aug 03, 2023 at 01:25:37AM +0000, Tian, Kevin wrote:
Preferably let's not making it a hard dependency for this series.
Should we drop this and its selftest patch from this series?
Yes. let's drop it.
Ack. Thanks!
IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce. But it can only allocate hw_pagetables linked with IOAS. There are needs to support hw_pagetable allocation with parameters specified by user. For example, in nested translation, user needs to allocate hw_pagetable for the stage-1 translation (e.g. a single I/O page table or a set of I/O page tables) with user data. It also needs to provide a stage-2 hw_pagetable which is linked to the GPA IOAS.
This extends IOMMU_HWPT_ALLOC to accept user specified parameter and hwpt ID in @pt_id field. It can be used to allocate user-managed stage-1 hwpt, which requires a parent hwpt to point to the stage-2 translation.
Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/hw_pagetable.c | 65 ++++++++++++++++++++++++---- drivers/iommu/iommufd/main.c | 2 +- include/uapi/linux/iommufd.h | 20 ++++++++- 3 files changed, 77 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index c7301cf0e85a..97e4114226de 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -193,29 +193,75 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) { + struct iommufd_hw_pagetable *hwpt, *parent = NULL; + union iommu_domain_user_data *data = NULL; struct iommu_hwpt_alloc *cmd = ucmd->cmd; - struct iommufd_hw_pagetable *hwpt; + struct iommufd_object *pt_obj; struct iommufd_device *idev; struct iommufd_ioas *ioas; - int rc; + int rc = 0;
if (cmd->flags || cmd->__reserved) return -EOPNOTSUPP; + if (!cmd->data_len && cmd->hwpt_type != IOMMU_HWPT_TYPE_DEFAULT) + return -EINVAL;
idev = iommufd_get_device(ucmd, cmd->dev_id); if (IS_ERR(idev)) return PTR_ERR(idev);
- ioas = iommufd_get_ioas(ucmd->ictx, cmd->pt_id); - if (IS_ERR(ioas)) { - rc = PTR_ERR(ioas); + pt_obj = iommufd_get_object(ucmd->ictx, cmd->pt_id, IOMMUFD_OBJ_ANY); + if (IS_ERR(pt_obj)) { + rc = -EINVAL; goto out_put_idev; }
+ switch (pt_obj->type) { + case IOMMUFD_OBJ_IOAS: + ioas = container_of(pt_obj, struct iommufd_ioas, obj); + break; + case IOMMUFD_OBJ_HW_PAGETABLE: + /* pt_id points HWPT only when hwpt_type is !IOMMU_HWPT_TYPE_DEFAULT */ + if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) { + rc = -EINVAL; + goto out_put_pt; + } + + parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj); + /* + * Cannot allocate user-managed hwpt linking to auto_created + * hwpt. If the parent hwpt is already a user-managed hwpt, + * don't allocate another user-managed hwpt linking to it. + */ + if (parent->auto_domain || parent->parent) { + rc = -EINVAL; + goto out_put_pt; + } + ioas = parent->ioas; + break; + default: + rc = -EINVAL; + goto out_put_pt; + } + + if (cmd->data_len) { + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (!data) { + rc = -ENOMEM; + goto out_put_pt; + } + + rc = copy_struct_from_user(data, sizeof(*data), + u64_to_user_ptr(cmd->data_uptr), + cmd->data_len); + if (rc) + goto out_free_data; + } + mutex_lock(&ioas->mutex); hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, idev, - IOMMU_HWPT_TYPE_DEFAULT, - NULL, NULL, false); + cmd->hwpt_type, + parent, data, false); if (IS_ERR(hwpt)) { rc = PTR_ERR(hwpt); goto out_unlock; @@ -232,7 +278,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) iommufd_object_abort_and_destroy(ucmd->ictx, &hwpt->obj); out_unlock: mutex_unlock(&ioas->mutex); - iommufd_put_object(&ioas->obj); +out_free_data: + kfree(data); +out_put_pt: + iommufd_put_object(pt_obj); out_put_idev: iommufd_put_object(&idev->obj); return rc; diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 510db114fc61..5f4420626421 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -426,7 +426,7 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info, __reserved), IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc, - __reserved), + data_uptr), IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl, struct iommu_ioas_alloc, out_ioas_id), IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas, diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index f2026cde2d64..73bf9af91e99 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -364,12 +364,27 @@ enum iommu_hwpt_type { * @pt_id: The IOAS to connect this HWPT to * @out_hwpt_id: The ID of the new HWPT * @__reserved: Must be 0 + * @hwpt_type: One of enum iommu_hwpt_type + * @data_len: Length of the type specific data + * @data_uptr: User pointer to the type specific data * * Explicitly allocate a hardware page table object. This is the same object * type that is returned by iommufd_device_attach() and represents the * underlying iommu driver's iommu_domain kernel object. * - * A HWPT will be created with the IOVA mappings from the given IOAS. + * A kernel-managed HWPT will be created with the mappings from the given + * IOAS via the @pt_id. The @hwpt_type for this allocation can be set to + * either IOMMU_HWPT_TYPE_DEFAULT or a pre-defined type corresponding to + * an I/O page table type supported by the underlying IOMMU hardware. + * + * A user-managed HWPT will be created from a given parent HWPT via the + * @pt_id, in which the parent HWPT must be allocated previously via the + * same ioctl from a given IOAS (@pt_id). In this case, the @hwpt_type + * must be set to a pre-defined type corresponding to an I/O page table + * type supported by the underlying IOMMU hardware. + * + * If the @hwpt_type is set to IOMMU_HWPT_TYPE_DEFAULT, both the @data_len + * and the @data_uptr will be ignored. Otherwise, both must be given. */ struct iommu_hwpt_alloc { __u32 size; @@ -378,6 +393,9 @@ struct iommu_hwpt_alloc { __u32 pt_id; __u32 out_hwpt_id; __u32 __reserved; + __u32 hwpt_type; + __u32 data_len; + __aligned_u64 data_uptr; }; #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
- switch (pt_obj->type) {
- case IOMMUFD_OBJ_IOAS:
ioas = container_of(pt_obj, struct iommufd_ioas, obj);
break;
- case IOMMUFD_OBJ_HW_PAGETABLE:
/* pt_id points HWPT only when hwpt_type is !IOMMU_HWPT_TYPE_DEFAULT */
if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
rc = -EINVAL;
goto out_put_pt;
}
parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
/*
* Cannot allocate user-managed hwpt linking to auto_created
* hwpt. If the parent hwpt is already a user-managed hwpt,
* don't allocate another user-managed hwpt linking to it.
*/
if (parent->auto_domain || parent->parent) {
rc = -EINVAL;
goto out_put_pt;
}
ioas = parent->ioas;
Why do we set ioas here? I would think it should be NULL.
I think it is looking like a mistake to try and re-use iommufd_hw_pagetable_alloc() directly for the nested case. It should not have a IOAS argument, it should not do enforce_cc, or iopt_* functions
So must of the function is removed. Probably better to make a new ioas-less function for the nesting case.
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 510db114fc61..5f4420626421 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -426,7 +426,7 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info, __reserved), IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
__reserved),
data_uptr),
Nono, these can never change once we put them it. It specifies the hard minimum size that userspace must provide. If userspace gives less than this then the ioctl always fails. Changing it breaks all the existing software.
The core code ensures that the trailing part of the cmd struct is zero'd the extended implementation must cope with Zero'd values, which this does.
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index f2026cde2d64..73bf9af91e99 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -364,12 +364,27 @@ enum iommu_hwpt_type {
- @pt_id: The IOAS to connect this HWPT to
- @out_hwpt_id: The ID of the new HWPT
- @__reserved: Must be 0
- @hwpt_type: One of enum iommu_hwpt_type
- @data_len: Length of the type specific data
- @data_uptr: User pointer to the type specific data
- Explicitly allocate a hardware page table object. This is the same object
- type that is returned by iommufd_device_attach() and represents the
- underlying iommu driver's iommu_domain kernel object.
- A HWPT will be created with the IOVA mappings from the given IOAS.
- A kernel-managed HWPT will be created with the mappings from the given
- IOAS via the @pt_id. The @hwpt_type for this allocation can be set to
- either IOMMU_HWPT_TYPE_DEFAULT or a pre-defined type corresponding to
- an I/O page table type supported by the underlying IOMMU hardware.
- A user-managed HWPT will be created from a given parent HWPT via the
- @pt_id, in which the parent HWPT must be allocated previously via the
- same ioctl from a given IOAS (@pt_id). In this case, the @hwpt_type
- must be set to a pre-defined type corresponding to an I/O page table
- type supported by the underlying IOMMU hardware.
- If the @hwpt_type is set to IOMMU_HWPT_TYPE_DEFAULT, both the @data_len
- and the @data_uptr will be ignored. Otherwise, both must be
given.
If the @hwpt_type is set to IOMMU_HWPT_TYPE_DEFAULT then @data_len must be zero.
Jason
On Fri, Jul 28, 2023 at 02:55:57PM -0300, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
- switch (pt_obj->type) {
- case IOMMUFD_OBJ_IOAS:
ioas = container_of(pt_obj, struct iommufd_ioas, obj);
break;
- case IOMMUFD_OBJ_HW_PAGETABLE:
/* pt_id points HWPT only when hwpt_type is !IOMMU_HWPT_TYPE_DEFAULT */
if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
rc = -EINVAL;
goto out_put_pt;
}
parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
/*
* Cannot allocate user-managed hwpt linking to auto_created
* hwpt. If the parent hwpt is already a user-managed hwpt,
* don't allocate another user-managed hwpt linking to it.
*/
if (parent->auto_domain || parent->parent) {
rc = -EINVAL;
goto out_put_pt;
}
ioas = parent->ioas;
Why do we set ioas here? I would think it should be NULL.
I think it is looking like a mistake to try and re-use iommufd_hw_pagetable_alloc() directly for the nested case. It should not have a IOAS argument, it should not do enforce_cc, or iopt_* functions
So must of the function is removed. Probably better to make a new ioas-less function for the nesting case.
OK.
@Yi, along with the PATCH-6 (IOMMU_RESV_SW_MSI), I will rework on this -- mainly breaking up NESTED hwpt with IOAS.
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Saturday, July 29, 2023 3:10 AM
On Fri, Jul 28, 2023 at 02:55:57PM -0300, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
- switch (pt_obj->type) {
- case IOMMUFD_OBJ_IOAS:
ioas = container_of(pt_obj, struct iommufd_ioas, obj);
break;
- case IOMMUFD_OBJ_HW_PAGETABLE:
/* pt_id points HWPT only when hwpt_type
is !IOMMU_HWPT_TYPE_DEFAULT */
if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
rc = -EINVAL;
goto out_put_pt;
}
parent = container_of(pt_obj, struct iommufd_hw_pagetable, obj);
/*
* Cannot allocate user-managed hwpt linking to auto_created
* hwpt. If the parent hwpt is already a user-managed hwpt,
* don't allocate another user-managed hwpt linking to it.
*/
if (parent->auto_domain || parent->parent) {
rc = -EINVAL;
goto out_put_pt;
}
ioas = parent->ioas;
Why do we set ioas here? I would think it should be NULL.
I think it is looking like a mistake to try and re-use iommufd_hw_pagetable_alloc() directly for the nested case. It should not have a IOAS argument, it should not do enforce_cc, or iopt_* functions
So must of the function is removed. Probably better to make a new ioas-less function for the nesting case.
OK.
@Yi, along with the PATCH-6 (IOMMU_RESV_SW_MSI), I will rework on this -- mainly breaking up NESTED hwpt with IOAS.
Thanks. Then I'll address the hw_info comments first.
Regards, Yi Liu
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 1:56 AM
On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
- switch (pt_obj->type) {
- case IOMMUFD_OBJ_IOAS:
ioas = container_of(pt_obj, struct iommufd_ioas, obj);
break;
- case IOMMUFD_OBJ_HW_PAGETABLE:
/* pt_id points HWPT only when hwpt_type
is !IOMMU_HWPT_TYPE_DEFAULT */
if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
rc = -EINVAL;
goto out_put_pt;
}
parent = container_of(pt_obj, struct iommufd_hw_pagetable,
obj);
/*
* Cannot allocate user-managed hwpt linking to
auto_created
* hwpt. If the parent hwpt is already a user-managed hwpt,
* don't allocate another user-managed hwpt linking to it.
*/
if (parent->auto_domain || parent->parent) {
rc = -EINVAL;
goto out_put_pt;
}
ioas = parent->ioas;
Why do we set ioas here? I would think it should be NULL.
I think it is looking like a mistake to try and re-use iommufd_hw_pagetable_alloc() directly for the nested case. It should not have a IOAS argument, it should not do enforce_cc, or iopt_* functions
enforce_cc is still required since it's about memory accesses post page table walking (no matter the walked table is single stage or nested).
So must of the function is removed. Probably better to make a new ioas-less function for the nesting case.
diff --git a/drivers/iommu/iommufd/main.c
b/drivers/iommu/iommufd/main.c
index 510db114fc61..5f4420626421 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -426,7 +426,7 @@ static const struct iommufd_ioctl_op
iommufd_ioctl_ops[] = {
IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct
iommu_hw_info,
__reserved),
IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct
iommu_hwpt_alloc,
__reserved),
data_uptr),
Nono, these can never change once we put them it. It specifies the hard minimum size that userspace must provide. If userspace gives less than this then the ioctl always fails. Changing it breaks all the existing software.
The core code ensures that the trailing part of the cmd struct is zero'd the extended implementation must cope with Zero'd values, which this does.
Ideally expanding uAPI structure size should come with new flag bits.
this one might be a bit special. hwpt_alloc() series was just queued to iommufd-next. If the nesting series could come together in one cycle then probably changing it in current way is fine since there is no existing software. Otherwise we need follow common practice. 😊
On Mon, Jul 31, 2023 at 06:31:20AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 1:56 AM
On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
- switch (pt_obj->type) {
- case IOMMUFD_OBJ_IOAS:
ioas = container_of(pt_obj, struct iommufd_ioas, obj);
break;
- case IOMMUFD_OBJ_HW_PAGETABLE:
/* pt_id points HWPT only when hwpt_type
is !IOMMU_HWPT_TYPE_DEFAULT */
if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
rc = -EINVAL;
goto out_put_pt;
}
parent = container_of(pt_obj, struct iommufd_hw_pagetable,
obj);
/*
* Cannot allocate user-managed hwpt linking to
auto_created
* hwpt. If the parent hwpt is already a user-managed hwpt,
* don't allocate another user-managed hwpt linking to it.
*/
if (parent->auto_domain || parent->parent) {
rc = -EINVAL;
goto out_put_pt;
}
ioas = parent->ioas;
Why do we set ioas here? I would think it should be NULL.
I think it is looking like a mistake to try and re-use iommufd_hw_pagetable_alloc() directly for the nested case. It should not have a IOAS argument, it should not do enforce_cc, or iopt_* functions
enforce_cc is still required since it's about memory accesses post page table walking (no matter the walked table is single stage or nested).
enforce_cc only has meaning if the kernel owns the IOPTEs, and if we are creating a nest it does not. The guest has to set any necessary IOPTE bits.
enforce_cc will be done on the parent of the nest as normal.
Ideally expanding uAPI structure size should come with new flag bits.
Flags or some kind of 'zero is the same behavior as a smaller struct' scheme.
This patch is doing the zero option:
__u32 __reserved; + __u32 hwpt_type; + __u32 data_len; + __aligned_u64 data_uptr; };
hwpt_type == 0 means default type data_len == 0 means no data data_uptr is ignored (zero is safe)
So there is no need to change it
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Monday, July 31, 2023 9:16 PM
On Mon, Jul 31, 2023 at 06:31:20AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 1:56 AM
On Mon, Jul 24, 2023 at 04:03:57AM -0700, Yi Liu wrote:
- switch (pt_obj->type) {
- case IOMMUFD_OBJ_IOAS:
ioas = container_of(pt_obj, struct iommufd_ioas, obj);
break;
- case IOMMUFD_OBJ_HW_PAGETABLE:
/* pt_id points HWPT only when hwpt_type
is !IOMMU_HWPT_TYPE_DEFAULT */
if (cmd->hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) {
rc = -EINVAL;
goto out_put_pt;
}
parent = container_of(pt_obj, struct iommufd_hw_pagetable,
obj);
/*
* Cannot allocate user-managed hwpt linking to
auto_created
* hwpt. If the parent hwpt is already a user-managed hwpt,
* don't allocate another user-managed hwpt linking to it.
*/
if (parent->auto_domain || parent->parent) {
rc = -EINVAL;
goto out_put_pt;
}
ioas = parent->ioas;
Why do we set ioas here? I would think it should be NULL.
I think it is looking like a mistake to try and re-use iommufd_hw_pagetable_alloc() directly for the nested case. It should not have a IOAS argument, it should not do enforce_cc, or iopt_* functions
enforce_cc is still required since it's about memory accesses post page table walking (no matter the walked table is single stage or nested).
enforce_cc only has meaning if the kernel owns the IOPTEs, and if we are creating a nest it does not. The guest has to set any necessary IOPTE bits.
enforce_cc will be done on the parent of the nest as normal.
Ah, yes. What I described is the final behavior but in concept it's done for the parent.
Ideally expanding uAPI structure size should come with new flag bits.
Flags or some kind of 'zero is the same behavior as a smaller struct' scheme.
This patch is doing the zero option:
__u32 __reserved;
- __u32 hwpt_type;
- __u32 data_len;
- __aligned_u64 data_uptr;
};
hwpt_type == 0 means default type data_len == 0 means no data data_uptr is ignored (zero is safe)
So there is no need to change it
Make sense
On Mon, Jul 31, 2023 at 10:16:17AM -0300, Jason Gunthorpe wrote:
Ideally expanding uAPI structure size should come with new flag bits.
Flags or some kind of 'zero is the same behavior as a smaller struct' scheme.
This patch is doing the zero option:
__u32 __reserved;
- __u32 hwpt_type;
- __u32 data_len;
- __aligned_u64 data_uptr;
};
hwpt_type == 0 means default type data_len == 0 means no data data_uptr is ignored (zero is safe)
So there is no need to change it
TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects a -EINVAL error code from "if (ucmd.user_size < op->min_size)" check in the iommufd_fops_ioctl(). This has been working when min_size is exactly the size of the structure.
When the size of the structure becomes larger than min_size, i.e. the passing size above is larger than min_size, it bypasses that min_size sanity and goes down to an ioctl handler with a potential risk. And actually, the size range can be [min_size, struct_size), making it harder for us to sanitize with the existing code.
I wonder what's the generic way of sanitizing this case? And, it seems that TEST_LENGTH needs some rework to test min_size only?
Thanks Nic
On Wed, Aug 02, 2023 at 04:42:10PM -0700, Nicolin Chen wrote:
On Mon, Jul 31, 2023 at 10:16:17AM -0300, Jason Gunthorpe wrote:
Ideally expanding uAPI structure size should come with new flag bits.
Flags or some kind of 'zero is the same behavior as a smaller struct' scheme.
This patch is doing the zero option:
__u32 __reserved;
- __u32 hwpt_type;
- __u32 data_len;
- __aligned_u64 data_uptr;
};
hwpt_type == 0 means default type data_len == 0 means no data data_uptr is ignored (zero is safe)
So there is no need to change it
TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects a -EINVAL error code from "if (ucmd.user_size < op->min_size)" check in the iommufd_fops_ioctl(). This has been working when min_size is exactly the size of the structure.
When the size of the structure becomes larger than min_size, i.e. the passing size above is larger than min_size, it bypasses that min_size sanity and goes down to an ioctl handler with a potential risk. And actually, the size range can be [min_size, struct_size), making it harder for us to sanitize with the existing code.
I wonder what's the generic way of sanitizing this case? And, it seems that TEST_LENGTH needs some rework to test min_size only?
Yes, it should technically test using offsetof and a matching set of struct members.
Jason
On Wed, Aug 02, 2023 at 08:43:12PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 02, 2023 at 04:42:10PM -0700, Nicolin Chen wrote:
On Mon, Jul 31, 2023 at 10:16:17AM -0300, Jason Gunthorpe wrote:
Ideally expanding uAPI structure size should come with new flag bits.
Flags or some kind of 'zero is the same behavior as a smaller struct' scheme.
This patch is doing the zero option:
__u32 __reserved;
- __u32 hwpt_type;
- __u32 data_len;
- __aligned_u64 data_uptr;
};
hwpt_type == 0 means default type data_len == 0 means no data data_uptr is ignored (zero is safe)
So there is no need to change it
TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects a -EINVAL error code from "if (ucmd.user_size < op->min_size)" check in the iommufd_fops_ioctl(). This has been working when min_size is exactly the size of the structure.
When the size of the structure becomes larger than min_size, i.e. the passing size above is larger than min_size, it bypasses that min_size sanity and goes down to an ioctl handler with a potential risk. And actually, the size range can be [min_size, struct_size), making it harder for us to sanitize with the existing code.
I wonder what's the generic way of sanitizing this case? And, it seems that TEST_LENGTH needs some rework to test min_size only?
Yes, it should technically test using offsetof and a matching set of struct members.
OK. I copied 3 lines for offsetofend from the kernel and did this: -------------------------------------------------------------------------- diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 6b075a68b928..a15a475c1243 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -86,12 +86,13 @@ TEST_F(iommufd, cmd_fail)
TEST_F(iommufd, cmd_length) { -#define TEST_LENGTH(_struct, _ioctl) \ +#define TEST_LENGTH(_struct, _ioctl, _last) \ { \ + size_t min_size = offsetofend(struct _struct, _last); \ struct { \ struct _struct cmd; \ uint8_t extra; \ - } cmd = { .cmd = { .size = sizeof(struct _struct) - 1 }, \ + } cmd = { .cmd = { .size = min_size - 1 }, \ .extra = UINT8_MAX }; \ int old_errno; \ int rc; \ --------------------------------------------------------------------------
Any misaligned size within the range of [min_size, struct_size) still doesn't have a coverage though. Is this something that we have to let it fail with a potential risk?
Thanks Nic
On Wed, Aug 02, 2023 at 05:53:40PM -0700, Nicolin Chen wrote:
On Wed, Aug 02, 2023 at 08:43:12PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 02, 2023 at 04:42:10PM -0700, Nicolin Chen wrote:
On Mon, Jul 31, 2023 at 10:16:17AM -0300, Jason Gunthorpe wrote:
Ideally expanding uAPI structure size should come with new flag bits.
Flags or some kind of 'zero is the same behavior as a smaller struct' scheme.
This patch is doing the zero option:
__u32 __reserved;
- __u32 hwpt_type;
- __u32 data_len;
- __aligned_u64 data_uptr;
};
hwpt_type == 0 means default type data_len == 0 means no data data_uptr is ignored (zero is safe)
So there is no need to change it
TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects a -EINVAL error code from "if (ucmd.user_size < op->min_size)" check in the iommufd_fops_ioctl(). This has been working when min_size is exactly the size of the structure.
When the size of the structure becomes larger than min_size, i.e. the passing size above is larger than min_size, it bypasses that min_size sanity and goes down to an ioctl handler with a potential risk. And actually, the size range can be [min_size, struct_size), making it harder for us to sanitize with the existing code.
I wonder what's the generic way of sanitizing this case? And, it seems that TEST_LENGTH needs some rework to test min_size only?
Yes, it should technically test using offsetof and a matching set of struct members.
OK. I copied 3 lines for offsetofend from the kernel and did this:
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 6b075a68b928..a15a475c1243 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -86,12 +86,13 @@ TEST_F(iommufd, cmd_fail)
TEST_F(iommufd, cmd_length) { -#define TEST_LENGTH(_struct, _ioctl) \ +#define TEST_LENGTH(_struct, _ioctl, _last) \ { \
size_t min_size = offsetofend(struct _struct, _last); \ struct { \ struct _struct cmd; \ uint8_t extra; \
} cmd = { .cmd = { .size = sizeof(struct _struct) - 1 }, \
} cmd = { .cmd = { .size = min_size - 1 }, \ .extra = UINT8_MAX }; \ int old_errno; \ int rc; \
Any misaligned size within the range of [min_size, struct_size) still doesn't have a coverage though. Is this something that we have to let it fail with a potential risk?
It looks about right, I didn't try to test all the permutations, it could be done but I'm not sure it has value.
Jason
In nested translation, the stage-1 page table is user-managed and used by IOMMU hardware, so update of any present page table entry in the stage-1 page table should be followed with an IOTLB invalidation.
This adds IOMMU_HWPT_INVALIDATE for stage-1 IOTLB invalidation.
Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/hw_pagetable.c | 45 +++++++++++++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 9 +++++ drivers/iommu/iommufd/main.c | 3 ++ include/uapi/linux/iommufd.h | 22 ++++++++++++ 4 files changed, 79 insertions(+)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 97e4114226de..9064e6d181b4 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -286,3 +286,48 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) iommufd_put_object(&idev->obj); return rc; } + +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) +{ + struct iommu_hwpt_invalidate *cmd = ucmd->cmd; + struct iommufd_hw_pagetable *hwpt; + u32 user_data_len, klen; + u64 user_ptr; + int rc = 0; + + if (!cmd->data_len || cmd->__reserved) + return -EOPNOTSUPP; + + hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id); + if (IS_ERR(hwpt)) + return PTR_ERR(hwpt); + + /* Do not allow any kernel-managed hw_pagetable */ + if (!hwpt->parent) { + rc = -EINVAL; + goto out_put_hwpt; + } + + klen = hwpt->domain->ops->cache_invalidate_user_data_len; + if (!hwpt->domain->ops->cache_invalidate_user || !klen) { + rc = -EOPNOTSUPP; + goto out_put_hwpt; + } + + /* + * Copy the needed fields before reusing the ucmd buffer, this + * avoids memory allocation in this path. + */ + user_ptr = cmd->data_uptr; + user_data_len = cmd->data_len; + + rc = copy_struct_from_user(cmd, klen, + u64_to_user_ptr(user_ptr), user_data_len); + if (rc) + goto out_put_hwpt; + + rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain, cmd); +out_put_hwpt: + iommufd_put_object(&hwpt->obj); + return rc; +} diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 268ae0d5ae12..047317fa4df0 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -270,6 +270,7 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev); void iommufd_hw_pagetable_destroy(struct iommufd_object *obj); void iommufd_hw_pagetable_abort(struct iommufd_object *obj); int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd); +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd);
static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx, struct iommufd_hw_pagetable *hwpt) @@ -281,6 +282,14 @@ static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx, refcount_dec(&hwpt->obj.users); }
+static inline struct iommufd_hw_pagetable * +iommufd_get_hwpt(struct iommufd_ucmd *ucmd, u32 id) +{ + return container_of(iommufd_get_object(ucmd->ictx, id, + IOMMUFD_OBJ_HW_PAGETABLE), + struct iommufd_hw_pagetable, obj); +} + struct iommufd_group { struct kref ref; struct mutex lock; diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 5f4420626421..255e8a3c5b0e 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -391,6 +391,7 @@ union ucmd_buffer { struct iommu_destroy destroy; struct iommu_hw_info info; struct iommu_hwpt_alloc hwpt; + struct iommu_hwpt_invalidate cache; struct iommu_ioas_alloc alloc; struct iommu_ioas_allow_iovas allow_iovas; struct iommu_ioas_copy ioas_copy; @@ -427,6 +428,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { __reserved), IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc, data_uptr), + IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate, + struct iommu_hwpt_invalidate, data_uptr), IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl, struct iommu_ioas_alloc, out_ioas_id), IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas, diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 73bf9af91e99..034da283cd3a 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -48,6 +48,7 @@ enum { IOMMUFD_CMD_HWPT_ALLOC, IOMMUFD_CMD_GET_HW_INFO, IOMMUFD_CMD_RESV_IOVA_RANGES, + IOMMUFD_CMD_HWPT_INVALIDATE, };
/** @@ -486,4 +487,25 @@ struct iommu_resv_iova_ranges { __aligned_u64 resv_iovas; }; #define IOMMU_RESV_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_RESV_IOVA_RANGES) + +/** + * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) + * @size: sizeof(struct iommu_hwpt_invalidate) + * @hwpt_id: HWPT ID of target hardware page table for the invalidation + * @data_len: Length of the type specific data + * @__reserved: Must be 0 + * @data_uptr: User pointer to the type specific data + * + * Invalidate the iommu cache for user-managed page table. Modifications + * on user-managed page table should be followed with this operation to + * sync the IOTLB. The data in @data_uptr differs per the hwpt type. + */ +struct iommu_hwpt_invalidate { + __u32 size; + __u32 hwpt_id; + __u32 data_len; + __u32 __reserved; + __aligned_u64 data_uptr; +}; +#define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE) #endif
On Mon, Jul 24, 2023 at 04:03:58AM -0700, Yi Liu wrote:
In nested translation, the stage-1 page table is user-managed and used by IOMMU hardware, so update of any present page table entry in the stage-1 page table should be followed with an IOTLB invalidation.
This adds IOMMU_HWPT_INVALIDATE for stage-1 IOTLB invalidation.
Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/iommufd/hw_pagetable.c | 45 +++++++++++++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 9 +++++ drivers/iommu/iommufd/main.c | 3 ++ include/uapi/linux/iommufd.h | 22 ++++++++++++ 4 files changed, 79 insertions(+)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 97e4114226de..9064e6d181b4 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -286,3 +286,48 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) iommufd_put_object(&idev->obj); return rc; }
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) +{
- struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
- struct iommufd_hw_pagetable *hwpt;
- u32 user_data_len, klen;
- u64 user_ptr;
- int rc = 0;
- if (!cmd->data_len || cmd->__reserved)
return -EOPNOTSUPP;
- hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
- if (IS_ERR(hwpt))
return PTR_ERR(hwpt);
- /* Do not allow any kernel-managed hw_pagetable */
- if (!hwpt->parent) {
I don't think this is needed because:
rc = -EINVAL;
goto out_put_hwpt;
- }
- klen = hwpt->domain->ops->cache_invalidate_user_data_len;
- if (!hwpt->domain->ops->cache_invalidate_user || !klen) {
rc = -EOPNOTSUPP;
We need to get to a place where the drivers are providing proper ops for the domains, so this op should never exist for a paging domain.
And return EINVAL here instead.
goto out_put_hwpt;
- }
- /*
* Copy the needed fields before reusing the ucmd buffer, this
* avoids memory allocation in this path.
*/
- user_ptr = cmd->data_uptr;
- user_data_len = cmd->data_len;
Uhh, who checks that klen < the temporary stack struct?
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Saturday, July 29, 2023 2:03 AM
On Mon, Jul 24, 2023 at 04:03:58AM -0700, Yi Liu wrote:
In nested translation, the stage-1 page table is user-managed and used by IOMMU hardware, so update of any present page table entry in the stage-1 page table should be followed with an IOTLB invalidation.
This adds IOMMU_HWPT_INVALIDATE for stage-1 IOTLB invalidation.
Co-developed-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/iommufd/hw_pagetable.c | 45 +++++++++++++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 9 +++++ drivers/iommu/iommufd/main.c | 3 ++ include/uapi/linux/iommufd.h | 22 ++++++++++++ 4 files changed, 79 insertions(+)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c
b/drivers/iommu/iommufd/hw_pagetable.c
index 97e4114226de..9064e6d181b4 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -286,3 +286,48 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) iommufd_put_object(&idev->obj); return rc; }
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) +{
- struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
- struct iommufd_hw_pagetable *hwpt;
- u32 user_data_len, klen;
- u64 user_ptr;
- int rc = 0;
- if (!cmd->data_len || cmd->__reserved)
return -EOPNOTSUPP;
- hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
- if (IS_ERR(hwpt))
return PTR_ERR(hwpt);
- /* Do not allow any kernel-managed hw_pagetable */
- if (!hwpt->parent) {
I don't think this is needed because:
rc = -EINVAL;
goto out_put_hwpt;
- }
- klen = hwpt->domain->ops->cache_invalidate_user_data_len;
- if (!hwpt->domain->ops->cache_invalidate_user || !klen) {
rc = -EOPNOTSUPP;
We need to get to a place where the drivers are providing proper ops for the domains, so this op should never exist for a paging domain.
Yes.
And return EINVAL here instead.
Sure.
goto out_put_hwpt;
- }
- /*
* Copy the needed fields before reusing the ucmd buffer, this
* avoids memory allocation in this path.
*/
- user_ptr = cmd->data_uptr;
- user_data_len = cmd->data_len;
Uhh, who checks that klen < the temporary stack struct?
Take vtd as an example. The invalidate structure is struct iommu_hwpt_vtd_s1_invalidate[1]. The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2]. iommu_hwpt_vtd_s1_invalidate is also placed in the temporary stack struct (actually it is a union)[1]. So the klen should be <= temporary stack.
[1] https://lore.kernel.org/linux-iommu/20230724111335.107427-8-yi.l.liu@intel.c... [2] https://lore.kernel.org/linux-iommu/20230724111335.107427-10-yi.l.liu@intel....
It's not so explicit though. Perhaps worth to have a check like below in this patch?
if (unlikely(klen > sizeof(union ucmd_buffer))) return -EINVAL;
Regards, Yi Liu
On Mon, Jul 31, 2023 at 10:07:32AM +0000, Liu, Yi L wrote:
goto out_put_hwpt;
- }
- /*
* Copy the needed fields before reusing the ucmd buffer, this
* avoids memory allocation in this path.
*/
- user_ptr = cmd->data_uptr;
- user_data_len = cmd->data_len;
Uhh, who checks that klen < the temporary stack struct?
Take vtd as an example. The invalidate structure is struct iommu_hwpt_vtd_s1_invalidate[1]. The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2]. iommu_hwpt_vtd_s1_invalidate is also placed in the temporary stack struct (actually it is a union)[1]. So the klen should be <= temporary stack.
Ohh, I think I would add a few comments noting that the invalidate structs need to be added to that union. Easy to miss.
It's not so explicit though. Perhaps worth to have a check like below in this patch?
if (unlikely(klen > sizeof(union ucmd_buffer))) return -EINVAL;
Yes, stick this in the domain allocate path with a WARN_ON. The driver is broken to allocate a domain with an invalid size.
Jason
On Mon, Jul 31, 2023 at 10:19:09AM -0300, Jason Gunthorpe wrote:
On Mon, Jul 31, 2023 at 10:07:32AM +0000, Liu, Yi L wrote:
goto out_put_hwpt;
- }
- /*
* Copy the needed fields before reusing the ucmd buffer, this
* avoids memory allocation in this path.
*/
- user_ptr = cmd->data_uptr;
- user_data_len = cmd->data_len;
Uhh, who checks that klen < the temporary stack struct?
Take vtd as an example. The invalidate structure is struct iommu_hwpt_vtd_s1_invalidate[1]. The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2]. iommu_hwpt_vtd_s1_invalidate is also placed in the temporary stack struct (actually it is a union)[1]. So the klen should be <= temporary stack.
Ohh, I think I would add a few comments noting that the invalidate structs need to be added to that union. Easy to miss.
Added here:
- * Copy the needed fields before reusing the ucmd buffer, this - * avoids memory allocation in this path. + * Copy the needed fields before reusing the ucmd buffer, this avoids + * memory allocation in this path. Again, user invalidate data struct + * must be added to the union ucmd_buffer.
It's not so explicit though. Perhaps worth to have a check like below in this patch?
if (unlikely(klen > sizeof(union ucmd_buffer))) return -EINVAL;
Yes, stick this in the domain allocate path with a WARN_ON. The driver is broken to allocate a domain with an invalid size.
And here too with a WARN_ON_ONCE.
+ /* + * Either the driver is broken by having an invalid size, or the user + * invalidate data struct used by the driver is missing in the union. + */ + if (WARN_ON_ONCE(hwpt->domain->ops->cache_invalidate_user && + (!hwpt->domain->ops->cache_invalidate_user_data_len || + hwpt->domain->ops->cache_invalidate_user_data_len > + sizeof(union ucmd_buffer)))) { + rc = -EINVAL; + goto out_abort; + + }
Though I am making this cache_invalidate_user optional here, I wonder if there actually could be a case that a user-managed domain doesn't need a cache_invalidate_user op...
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 3, 2023 10:17 AM
On Mon, Jul 31, 2023 at 10:19:09AM -0300, Jason Gunthorpe wrote:
On Mon, Jul 31, 2023 at 10:07:32AM +0000, Liu, Yi L wrote:
goto out_put_hwpt;
- }
- /*
* Copy the needed fields before reusing the ucmd buffer, this
* avoids memory allocation in this path.
*/
- user_ptr = cmd->data_uptr;
- user_data_len = cmd->data_len;
Uhh, who checks that klen < the temporary stack struct?
Take vtd as an example. The invalidate structure is struct
iommu_hwpt_vtd_s1_invalidate[1].
The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2].
iommu_hwpt_vtd_s1_invalidate
is also placed in the temporary stack struct (actually it is a union)[1]. So the klen
should
be <= temporary stack.
Ohh, I think I would add a few comments noting that the invalidate structs need to be added to that union. Easy to miss.
Added here:
* Copy the needed fields before reusing the ucmd buffer, this
* avoids memory allocation in this path.
* Copy the needed fields before reusing the ucmd buffer, this avoids
* memory allocation in this path. Again, user invalidate data struct
* must be added to the union ucmd_buffer.
It's not so explicit though. Perhaps worth to have a check like below in this patch?
if (unlikely(klen > sizeof(union ucmd_buffer))) return -EINVAL;
Yes, stick this in the domain allocate path with a WARN_ON. The driver is broken to allocate a domain with an invalid size.
And here too with a WARN_ON_ONCE.
/*
* Either the driver is broken by having an invalid size, or the user
* invalidate data struct used by the driver is missing in the union.
*/
if (WARN_ON_ONCE(hwpt->domain->ops->cache_invalidate_user &&
(!hwpt->domain->ops->cache_invalidate_user_data_len ||
hwpt->domain->ops->cache_invalidate_user_data_len >
sizeof(union ucmd_buffer)))) {
rc = -EINVAL;
goto out_abort;
}
Though I am making this cache_invalidate_user optional here, I wonder if there actually could be a case that a user-managed domain doesn't need a cache_invalidate_user op...
If user-managed domain is the stage-1 domain in nested, then seems not possible as cache invalidate is a must. But I think this logic is fine as not all the domains allocated by the user is user-managed. It may be kernel managed like the s2 domains.
Regards, Yi Liu
On Thu, Aug 03, 2023 at 03:07:23AM +0000, Liu, Yi L wrote:
It's not so explicit though. Perhaps worth to have a check like below in this patch?
if (unlikely(klen > sizeof(union ucmd_buffer))) return -EINVAL;
Yes, stick this in the domain allocate path with a WARN_ON. The driver is broken to allocate a domain with an invalid size.
And here too with a WARN_ON_ONCE.
/*
* Either the driver is broken by having an invalid size, or the user
* invalidate data struct used by the driver is missing in the union.
*/
if (WARN_ON_ONCE(hwpt->domain->ops->cache_invalidate_user &&
(!hwpt->domain->ops->cache_invalidate_user_data_len ||
hwpt->domain->ops->cache_invalidate_user_data_len >
sizeof(union ucmd_buffer)))) {
rc = -EINVAL;
goto out_abort;
}
Though I am making this cache_invalidate_user optional here, I wonder if there actually could be a case that a user-managed domain doesn't need a cache_invalidate_user op...
If user-managed domain is the stage-1 domain in nested, then seems not possible as cache invalidate is a must. But I think this logic is fine as not all the domains allocated by the user is user-managed. It may be kernel managed like the s2 domains.
Oh, I forgot to mention that this piece is in the user-managed HWPT allocator, following Jason's suggestion to separate them.
So, perhaps should just mark it mandatory:
+ if (WARN_ON_ONCE(!hwpt->domain->ops->cache_invalidate_user || + !hwpt->domain->ops->cache_invalidate_user_data_len || + hwpt->domain->ops->cache_invalidate_user_data_len > + sizeof(union iommu_user_cache_invalidate))) { + rc = -EINVAL; + goto out_abort; + }
From: Jason Gunthorpe jgg@nvidia.com Sent: Monday, July 31, 2023 9:19 PM
On Mon, Jul 31, 2023 at 10:07:32AM +0000, Liu, Yi L wrote:
goto out_put_hwpt;
- }
- /*
* Copy the needed fields before reusing the ucmd buffer, this
* avoids memory allocation in this path.
*/
- user_ptr = cmd->data_uptr;
- user_data_len = cmd->data_len;
Uhh, who checks that klen < the temporary stack struct?
Take vtd as an example. The invalidate structure is struct
iommu_hwpt_vtd_s1_invalidate[1].
The klen is sizeof(struct iommu_hwpt_vtd_s1_invalidate)[2].
iommu_hwpt_vtd_s1_invalidate
is also placed in the temporary stack struct (actually it is a union)[1]. So the klen should be <= temporary stack.
Ohh, I think I would add a few comments noting that the invalidate structs need to be added to that union. Easy to miss.
Sure. Actually, there are some description as below in patch [1]. Would add some as well here.
"Cache invalidation path is performance path, so it's better to avoid memory allocation in such path. To achieve it, this path reuses the ucmd_buffer to copy user data. So the new data structures are added in the ucmd_buffer union to avoid overflow."
[1] https://lore.kernel.org/linux-iommu/20230724111335.107427-8-yi.l.liu@intel.c...
It's not so explicit though. Perhaps worth to have a check like below in this patch?
if (unlikely(klen > sizeof(union ucmd_buffer))) return -EINVAL;
Yes, stick this in the domain allocate path with a WARN_ON. The driver is broken to allocate a domain with an invalid size.
Ok. so if any mistake on the data structure, the allocation fails in the first place.
Regards, Yi Liu
On Fri, Jul 28, 2023 at 03:02:43PM -0300, Jason Gunthorpe wrote:
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) +{
- struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
- struct iommufd_hw_pagetable *hwpt;
- u32 user_data_len, klen;
- u64 user_ptr;
- int rc = 0;
- if (!cmd->data_len || cmd->__reserved)
return -EOPNOTSUPP;
- hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
- if (IS_ERR(hwpt))
return PTR_ERR(hwpt);
- /* Do not allow any kernel-managed hw_pagetable */
- if (!hwpt->parent) {
I don't think this is needed because:
rc = -EINVAL;
goto out_put_hwpt;
- }
- klen = hwpt->domain->ops->cache_invalidate_user_data_len;
- if (!hwpt->domain->ops->cache_invalidate_user || !klen) {
rc = -EOPNOTSUPP;
We need to get to a place where the drivers are providing proper ops for the domains, so this op should never exist for a paging domain.
And return EINVAL here instead.
Fixed those two above and added the following in alloc():
------------------------------------------------------------------------------- diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 08aa4debc58a..7ddeda22ac62 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -123,6 +123,12 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, rc = -EINVAL; goto out_abort; } + /* Driver is buggy by mixing user-managed ops in kernel-managed ops */ + if (WARN_ON_ONCE(hwpt->domain->ops->cache_invalidate_user || + hwpt->domain->ops->cache_invalidate_user_data_len)) { + rc = -EINVAL; + goto out_abort; + }
/* * Set the coherency mode before we do iopt_table_add_domain() as some -------------------------------------------------------------------------------
Thanks Nic
There is need to get the selftest device (sobj->type == TYPE_IDEV) in multiple places, so have a helper to get it.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/selftest.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-)
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index af7459e211ad..72d0c37e0b5e 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -472,29 +472,39 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd, return rc; }
-/* Replace the mock domain with a manually allocated hw_pagetable */ -static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd, - unsigned int device_id, u32 pt_id, - struct iommu_test_cmd *cmd) +static struct selftest_obj *iommufd_test_get_device(struct iommufd_ctx *ictx, + u32 id) { struct iommufd_object *dev_obj; struct selftest_obj *sobj; - int rc;
/* * Prefer to use the OBJ_SELFTEST because the destroy_rwsem will ensure * it doesn't race with detach, which is not allowed. */ - dev_obj = - iommufd_get_object(ucmd->ictx, device_id, IOMMUFD_OBJ_SELFTEST); + dev_obj = iommufd_get_object(ictx, id, IOMMUFD_OBJ_SELFTEST); if (IS_ERR(dev_obj)) - return PTR_ERR(dev_obj); + return (struct selftest_obj *)dev_obj;
sobj = container_of(dev_obj, struct selftest_obj, obj); if (sobj->type != TYPE_IDEV) { - rc = -EINVAL; - goto out_dev_obj; + iommufd_put_object(dev_obj); + return ERR_PTR(-EINVAL); } + return sobj; +} + +/* Replace the mock domain with a manually allocated hw_pagetable */ +static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd, + unsigned int device_id, u32 pt_id, + struct iommu_test_cmd *cmd) +{ + struct selftest_obj *sobj; + int rc; + + sobj = iommufd_test_get_device(ucmd->ictx, device_id); + if (IS_ERR(sobj)) + return PTR_ERR(sobj);
rc = iommufd_device_replace(sobj->idev.idev, &pt_id); if (rc) @@ -504,7 +514,7 @@ static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd, rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
out_dev_obj: - iommufd_put_object(dev_obj); + iommufd_put_object(&sobj->obj); return rc; }
This makes test tool able to specify reserved regions to the selftest device and hence able to validate the ioctl IOMMU_RESV_IOVA_RANGES.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/iommufd_test.h | 6 ++ drivers/iommu/iommufd/selftest.c | 82 ++++++++++++++++++++++++++++ 2 files changed, 88 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 3f3644375bf1..b7f4bee2ea1b 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -19,6 +19,8 @@ enum { IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT, IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE, IOMMU_TEST_OP_ACCESS_REPLACE_IOAS, + IOMMU_TEST_OP_DEV_ADD_RESERVED, + IOMMU_TEST_OP_DEV_DEL_RESERVED, };
enum { @@ -50,6 +52,10 @@ struct iommu_test_cmd { __aligned_u64 start; __aligned_u64 length; } add_reserved; + struct { + __aligned_u64 start; + __aligned_u64 length; + } add_dev_reserved; struct { __u32 out_stdev_id; __u32 out_hwpt_id; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 72d0c37e0b5e..271c6c261eb4 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -93,6 +93,8 @@ enum selftest_obj_type {
struct mock_dev { struct device dev; + struct rw_semaphore reserved_rwsem; + struct rb_root_cached reserved_itree; };
struct selftest_obj { @@ -385,6 +387,10 @@ static struct mock_dev *mock_dev_create(void) mutex_init(&mdev->dev.iommu->lock); mdev->dev.iommu->iommu_dev = &mock_iommu_device;
+ /* Reserve range related fields */ + init_rwsem(&mdev->reserved_rwsem); + mdev->reserved_itree = RB_ROOT_CACHED; + rc = device_add(&mdev->dev); if (rc) goto err_dev_iommu; @@ -407,8 +413,20 @@ static struct mock_dev *mock_dev_create(void) return ERR_PTR(rc); }
+static void mock_dev_remove_reserved_iovas(struct mock_dev *mdev) +{ + struct interval_tree_node *node; + + while ((node = interval_tree_iter_first(&mdev->reserved_itree, 0, + ULONG_MAX))) { + interval_tree_remove(node, &mdev->reserved_itree); + kfree(node); + } +} + static void mock_dev_destroy(struct mock_dev *mdev) { + mock_dev_remove_reserved_iovas(mdev); iommu_group_remove_device(&mdev->dev); device_del(&mdev->dev); kfree(mdev->dev.iommu); @@ -536,6 +554,64 @@ static int iommufd_test_add_reserved(struct iommufd_ucmd *ucmd, return rc; }
+static int mock_dev_add_reserved_iova(struct mock_dev *mdev, + unsigned long start, unsigned long last) +{ + struct interval_tree_node *reserved; + + if (interval_tree_iter_first(&mdev->reserved_itree, start, last)) + return -EADDRINUSE; + + reserved = kzalloc(sizeof(*reserved), GFP_KERNEL_ACCOUNT); + if (!reserved) + return -ENOMEM; + reserved->start = start; + reserved->last = last; + interval_tree_insert(reserved, &mdev->reserved_itree); + return 0; +} + +/* Add a reserved IOVA region to a device */ +static int iommufd_test_dev_add_reserved(struct iommufd_ucmd *ucmd, + unsigned int device_id, + unsigned long start, size_t length) +{ + struct selftest_obj *sobj; + struct mock_dev *mdev; + int rc; + + sobj = iommufd_test_get_device(ucmd->ictx, device_id); + if (IS_ERR(sobj)) + return PTR_ERR(sobj); + + mdev = sobj->idev.mock_dev; + down_write(&mdev->reserved_rwsem); + rc = mock_dev_add_reserved_iova(sobj->idev.mock_dev, + start, start + length - 1); + up_write(&mdev->reserved_rwsem); + iommufd_put_object(&sobj->obj); + return rc; +} + +/* Remove all reserved IOVA regions of a device */ +static int iommufd_test_dev_del_reserved(struct iommufd_ucmd *ucmd, + unsigned int device_id) +{ + struct selftest_obj *sobj; + struct mock_dev *mdev; + + sobj = iommufd_test_get_device(ucmd->ictx, device_id); + if (IS_ERR(sobj)) + return PTR_ERR(sobj); + + mdev = sobj->idev.mock_dev; + down_write(&mdev->reserved_rwsem); + mock_dev_remove_reserved_iovas(mdev); + up_write(&mdev->reserved_rwsem); + iommufd_put_object(&sobj->obj); + return 0; +} + /* Check that every pfn under each iova matches the pfn under a user VA */ static int iommufd_test_md_check_pa(struct iommufd_ucmd *ucmd, unsigned int mockpt_id, unsigned long iova, @@ -1025,6 +1101,12 @@ int iommufd_test(struct iommufd_ucmd *ucmd) return iommufd_test_add_reserved(ucmd, cmd->id, cmd->add_reserved.start, cmd->add_reserved.length); + case IOMMU_TEST_OP_DEV_ADD_RESERVED: + return iommufd_test_dev_add_reserved(ucmd, cmd->id, + cmd->add_dev_reserved.start, + cmd->add_dev_reserved.length); + case IOMMU_TEST_OP_DEV_DEL_RESERVED: + return iommufd_test_dev_del_reserved(ucmd, cmd->id); case IOMMU_TEST_OP_MOCK_DOMAIN: return iommufd_test_mock_domain(ucmd, cmd); case IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE:
This makes mock_dev to report resv_regions.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/selftest.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 271c6c261eb4..e40780db88b5 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -292,6 +292,32 @@ static void mock_domain_set_plaform_dma_ops(struct device *dev) */ }
+static void iommufd_test_get_resv_regions(struct device *dev, + struct list_head *head) +{ + struct mock_dev *mdev = container_of(dev, struct mock_dev, dev); + struct interval_tree_node *reserved, *next; + + down_read(&mdev->reserved_rwsem); + for (reserved = interval_tree_iter_first(&mdev->reserved_itree, 0, ULONG_MAX); + reserved; reserved = next) { + struct iommu_resv_region *resv; + + next = interval_tree_iter_next(reserved, 0, ULONG_MAX); + /* Only adds IOMMU_RESV_DIRECT so far */ + resv = iommu_alloc_resv_region(reserved->start, + reserved->last + 1 - reserved->start, + IOMMU_READ | IOMMU_WRITE, + IOMMU_RESV_DIRECT, + GFP_ATOMIC); + if (!resv) + break; + + list_add_tail(&resv->list, head); + } + up_read(&mdev->reserved_rwsem); +} + static const struct iommu_ops mock_ops = { .owner = THIS_MODULE, .pgsize_bitmap = MOCK_IO_PAGE_SIZE, @@ -299,6 +325,7 @@ static const struct iommu_ops mock_ops = { .hw_info = mock_domain_hw_info, .domain_alloc = mock_domain_alloc, .capable = mock_domain_capable, + .get_resv_regions = iommufd_test_get_resv_regions, .set_platform_dma_ops = mock_domain_set_plaform_dma_ops, .default_domain_ops = &(struct iommu_domain_ops){
This adds selftest for ioctl IOMMU_RESV_IOVA_RANGES.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- tools/testing/selftests/iommu/iommufd.c | 98 +++++++++++++++++++++++++ 1 file changed, 98 insertions(+)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 6b075a68b928..bfe0f76f7fb0 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -1780,4 +1780,102 @@ TEST_F(vfio_compat_mock_domain, huge_map) } }
+FIXTURE(iommufd_device_resv_iova) +{ + int fd; + uint32_t ioas_id; + uint32_t stdev_id; + uint32_t device_id; +}; + +FIXTURE_SETUP(iommufd_device_resv_iova) +{ + self->fd = open("/dev/iommu", O_RDWR); + ASSERT_NE(-1, self->fd); + test_ioctl_ioas_alloc(&self->ioas_id); + + test_cmd_mock_domain(self->ioas_id, &self->stdev_id, + NULL, &self->device_id); +} + +FIXTURE_TEARDOWN(iommufd_device_resv_iova) +{ + teardown_iommufd(self->fd, _metadata); +} + +TEST_F(iommufd_device_resv_iova, dev_resv_iova_ranges) +{ + struct iommu_test_cmd test_cmd = { + .size = sizeof(test_cmd), + .op = IOMMU_TEST_OP_DEV_ADD_RESERVED, + .id = self->stdev_id, + .add_dev_reserved = { .start = PAGE_SIZE, .length = PAGE_SIZE }, + }; + struct iommu_test_cmd test_cmd_del = { + .size = sizeof(test_cmd_del), + .op = IOMMU_TEST_OP_DEV_DEL_RESERVED, + .id = self->stdev_id, + }; + struct iommu_resv_iova_range *ranges = buffer; + struct iommu_resv_iova_ranges ranges_cmd = { + .size = sizeof(ranges_cmd), + .dev_id = self->device_id, + .num_iovas = BUFFER_SIZE / sizeof(*ranges), + .resv_iovas = (uintptr_t)ranges, + }; + + /* Range can be read */ + ASSERT_EQ(0, ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, &ranges_cmd)); + EXPECT_EQ(0, ranges_cmd.num_iovas); + + /* 1 range */ + ASSERT_EQ(0, + ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DEV_ADD_RESERVED), + &test_cmd)); + ranges_cmd.num_iovas = BUFFER_SIZE / sizeof(*ranges); + ASSERT_EQ(0, ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, &ranges_cmd)); + EXPECT_EQ(1, ranges_cmd.num_iovas); + EXPECT_EQ(PAGE_SIZE, ranges[0].start); + EXPECT_EQ(PAGE_SIZE * 2 - 1, ranges[0].last); + + /* Buffer too small */ + memset(ranges, 0, BUFFER_SIZE); + ranges_cmd.num_iovas = 0; + EXPECT_ERRNO(EMSGSIZE, + ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, &ranges_cmd)); + EXPECT_EQ(1, ranges_cmd.num_iovas); + EXPECT_EQ(0, ranges[0].start); + EXPECT_EQ(0, ranges[0].last); + + /* 2 ranges */ + test_cmd.add_dev_reserved.start = PAGE_SIZE * 4; + test_cmd.add_dev_reserved.length = PAGE_SIZE; + + ASSERT_EQ(0, + ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DEV_ADD_RESERVED), + &test_cmd)); + ranges_cmd.num_iovas = BUFFER_SIZE / sizeof(*ranges); + ASSERT_EQ(0, ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, &ranges_cmd)); + EXPECT_EQ(2, ranges_cmd.num_iovas); + EXPECT_EQ(PAGE_SIZE, ranges[0].start); + EXPECT_EQ(PAGE_SIZE * 2 - 1, ranges[0].last); + EXPECT_EQ(PAGE_SIZE * 4, ranges[1].start); + EXPECT_EQ(PAGE_SIZE * 5 - 1, ranges[1].last); + + /* Buffer too small */ + memset(ranges, 0, BUFFER_SIZE); + ranges_cmd.num_iovas = 1; + EXPECT_ERRNO(EMSGSIZE, ioctl(self->fd, IOMMU_RESV_IOVA_RANGES, + &ranges_cmd)); + EXPECT_EQ(2, ranges_cmd.num_iovas); + EXPECT_EQ(PAGE_SIZE, ranges[0].start); + EXPECT_EQ(PAGE_SIZE * 2 - 1, ranges[0].last); + EXPECT_EQ(0, ranges[1].start); + EXPECT_EQ(0, ranges[1].last); + + ASSERT_EQ(0, + ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DEV_DEL_RESERVED), + &test_cmd_del)); +} + TEST_HARNESS_MAIN
From: Nicolin Chen nicolinc@nvidia.com
Add mock_domain_alloc_user function and iommu_hwpt_selftest data structure to support user space selftest program to allocate domains with user data.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/hw_pagetable.c | 1 + drivers/iommu/iommufd/iommufd_test.h | 16 ++++++ drivers/iommu/iommufd/selftest.c | 84 +++++++++++++++++++++++++--- include/linux/iommu.h | 3 + 4 files changed, 95 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 9064e6d181b4..e30fce5d553b 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -7,6 +7,7 @@
#include "../iommu-priv.h" #include "iommufd_private.h" +#include "iommufd_test.h"
void iommufd_hw_pagetable_destroy(struct iommufd_object *obj) { diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index b7f4bee2ea1b..eeac8100844d 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -115,4 +115,20 @@ struct iommu_test_hw_info { __u32 test_reg; };
+/* Should not be equal to any defined value in enum iommu_hwpt_type */ +#define IOMMU_HWPT_TYPE_SELFTTEST 0xdead + +/** + * struct iommu_hwpt_selftest + * + * @flags: page table entry attributes + * @test_config: default iotlb setup (value IOMMU_TEST_IOTLB_DEFAULT) + */ +struct iommu_hwpt_selftest { +#define IOMMU_TEST_FLAG_NESTED (1ULL << 0) + __u64 flags; +#define IOMMU_TEST_IOTLB_DEFAULT 0xbadbeef + __u64 test_config; +}; + #endif diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index e40780db88b5..8364003efde3 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -84,7 +84,9 @@ void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
struct mock_iommu_domain { struct iommu_domain domain; + struct mock_iommu_domain *parent; struct xarray pfns; + u32 iotlb; };
enum selftest_obj_type { @@ -144,26 +146,81 @@ static void *mock_domain_hw_info(struct device *dev, u32 *length) return info; }
-static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type) +static const struct iommu_ops mock_ops; +static struct iommu_domain_ops domain_nested_ops; + +static struct iommu_domain * +__mock_domain_alloc(unsigned int iommu_domain_type, + struct mock_iommu_domain *mock_parent, + const struct iommu_hwpt_selftest *user_cfg) { struct mock_iommu_domain *mock;
if (iommu_domain_type == IOMMU_DOMAIN_BLOCKED) return &mock_blocking_domain;
- if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_UNMANAGED)) - return NULL; - mock = kzalloc(sizeof(*mock), GFP_KERNEL); if (!mock) - return NULL; - mock->domain.geometry.aperture_start = MOCK_APERTURE_START; - mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST; + return ERR_PTR(-ENOMEM); + mock->parent = mock_parent; + mock->domain.type = iommu_domain_type; mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE; - xa_init(&mock->pfns); + if (mock_parent) { + mock->iotlb = user_cfg->test_config; + mock->domain.ops = &domain_nested_ops; + } else { + mock->domain.geometry.aperture_start = MOCK_APERTURE_START; + mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST; + mock->domain.ops = mock_ops.default_domain_ops; + xa_init(&mock->pfns); + } return &mock->domain; }
+static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type) +{ + struct iommu_domain *domain; + + if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_BLOCKED && + iommu_domain_type != IOMMU_DOMAIN_UNMANAGED)) + return NULL; + domain = __mock_domain_alloc(IOMMU_DOMAIN_UNMANAGED, NULL, NULL); + if (IS_ERR(domain)) + domain = NULL; + return domain; +} + +static struct iommu_domain *mock_domain_alloc_user(struct device *dev, + enum iommu_hwpt_type hwpt_type, + struct iommu_domain *parent, + const union iommu_domain_user_data *user_data) +{ + const struct iommu_hwpt_selftest *user_cfg = (struct iommu_hwpt_selftest *)user_data; + unsigned int iommu_domain_type = IOMMU_DOMAIN_UNMANAGED; + struct mock_iommu_domain *mock_parent = NULL; + + /* Check union iommu_domain_user_data in include/linux/iommu.h */ + static_assert(sizeof(struct iommu_hwpt_selftest) <= 16); + + if (hwpt_type != IOMMU_HWPT_TYPE_DEFAULT && + hwpt_type != IOMMU_HWPT_TYPE_SELFTTEST) + return ERR_PTR(-EINVAL); + + if (parent) { + if (hwpt_type != IOMMU_HWPT_TYPE_SELFTTEST) + return ERR_PTR(-EINVAL); + if (parent->ops != mock_ops.default_domain_ops) + return ERR_PTR(-EINVAL); + if (!user_cfg || !(user_cfg->flags & IOMMU_TEST_FLAG_NESTED)) + return ERR_PTR(-EINVAL); + iommu_domain_type = IOMMU_DOMAIN_NESTED; + mock_parent = container_of(parent, + struct mock_iommu_domain, domain); + } + + return __mock_domain_alloc(iommu_domain_type, mock_parent, user_cfg); +} + static void mock_domain_free(struct iommu_domain *domain) { struct mock_iommu_domain *mock = @@ -324,6 +381,7 @@ static const struct iommu_ops mock_ops = { .hw_info_type = IOMMU_HW_INFO_TYPE_SELFTEST, .hw_info = mock_domain_hw_info, .domain_alloc = mock_domain_alloc, + .domain_alloc_user = mock_domain_alloc_user, .capable = mock_domain_capable, .get_resv_regions = iommufd_test_get_resv_regions, .set_platform_dma_ops = mock_domain_set_plaform_dma_ops, @@ -337,6 +395,11 @@ static const struct iommu_ops mock_ops = { }, };
+static struct iommu_domain_ops domain_nested_ops = { + .free = mock_domain_free, + .attach_dev = mock_domain_nop_attach, +}; + static struct iommu_device mock_iommu_device = { .ops = &mock_ops, }; @@ -353,7 +416,10 @@ get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id, if (IS_ERR(obj)) return ERR_CAST(obj); hwpt = container_of(obj, struct iommufd_hw_pagetable, obj); - if (hwpt->domain->ops != mock_ops.default_domain_ops) { + if ((hwpt->domain->type == IOMMU_DOMAIN_UNMANAGED && + hwpt->domain->ops != mock_ops.default_domain_ops) || + (hwpt->domain->type == IOMMU_DOMAIN_NESTED && + hwpt->domain->ops != &domain_nested_ops)) { iommufd_put_object(&hwpt->obj); return ERR_PTR(-EINVAL); } diff --git a/include/linux/iommu.h b/include/linux/iommu.h index b8f09330b64e..e4835230d5f0 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -237,6 +237,9 @@ struct iommu_iotlb_gather { * include/uapi/linux/iommufd.h. */ union iommu_domain_user_data { +#ifdef CONFIG_IOMMUFD_TEST + __u64 test[2]; +#endif };
/**
From: Nicolin Chen nicolinc@nvidia.com
The IOMMU_HWPT_ALLOC ioctl now supports passing user_data to allocate a customized domain. Add its coverage for both a regular domain case and a nested domain case.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- tools/testing/selftests/iommu/iommufd.c | 114 +++++++++++++++++- tools/testing/selftests/iommu/iommufd_utils.h | 36 ++++++ 2 files changed, 149 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index bfe0f76f7fb0..70116c2dfc94 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -114,6 +114,7 @@ TEST_F(iommufd, cmd_length)
TEST_LENGTH(iommu_destroy, IOMMU_DESTROY); TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO); + TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC); TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC); TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES); TEST_LENGTH(iommu_ioas_allow_iovas, IOMMU_IOAS_ALLOW_IOVAS); @@ -194,6 +195,7 @@ FIXTURE_VARIANT(iommufd_ioas) { unsigned int mock_domains; unsigned int memory_limit; + bool new_hwpt; };
FIXTURE_SETUP(iommufd_ioas) @@ -233,6 +235,12 @@ FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain) .mock_domains = 1, };
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain_hwpt) +{ + .mock_domains = 1, + .new_hwpt = true, +}; + FIXTURE_VARIANT_ADD(iommufd_ioas, two_mock_domain) { .mock_domains = 2, @@ -260,6 +268,93 @@ TEST_F(iommufd_ioas, ioas_destroy) } }
+TEST_F(iommufd_ioas, hwpt_alloc) +{ + uint32_t new_hwpt_id = 0; + + if (self->stdev_id && self->device_id) { + test_cmd_hwpt_alloc(self->device_id, self->ioas_id, &new_hwpt_id); + test_cmd_mock_domain_replace(self->stdev_id, new_hwpt_id); + /* hw_pagetable cannot be freed if a device is attached to it */ + EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, new_hwpt_id)); + + /* Detach from the new hw_pagetable and try again */ + test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id); + test_ioctl_destroy(new_hwpt_id); + } else { + test_err_cmd_hwpt_alloc(ENOENT, self->device_id, + self->ioas_id, &new_hwpt_id); + test_err_mock_domain_replace(ENOENT, + self->stdev_id, new_hwpt_id); + } +} + +TEST_F(iommufd_ioas, nested_hwpt_alloc) +{ + uint32_t nested_hwpt_id[2] = {}; + uint32_t parent_hwpt_id = 0; + uint32_t test_hwpt_id = 0; + + if (self->device_id) { + /* Negative tests */ + test_err_cmd_hwpt_alloc(ENOENT, self->ioas_id, self->device_id, + &test_hwpt_id); + test_err_cmd_hwpt_alloc(EINVAL, self->device_id, + self->device_id, &test_hwpt_id); + + /* Allocate two nested hwpts sharing one common parent hwpt */ + test_cmd_hwpt_alloc(self->device_id, self->ioas_id, + &parent_hwpt_id); + + test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id, + &nested_hwpt_id[0]); + test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id, + &nested_hwpt_id[1]); + + /* Negative test: a nested hwpt on top of a nested hwpt */ + test_err_cmd_hwpt_alloc_nested(EINVAL, self->device_id, + nested_hwpt_id[0], + &test_hwpt_id); + /* Negative test: parent hwpt now cannot be freed */ + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, parent_hwpt_id)); + + /* Attach device to nested_hwpt_id[0] that then will be busy */ + test_cmd_mock_domain_replace(self->stdev_id, + nested_hwpt_id[0]); + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, nested_hwpt_id[0])); + + /* Switch from nested_hwpt_id[0] to nested_hwpt_id[1] */ + test_cmd_mock_domain_replace(self->stdev_id, + nested_hwpt_id[1]); + EXPECT_ERRNO(EBUSY, + _test_ioctl_destroy(self->fd, nested_hwpt_id[1])); + test_ioctl_destroy(nested_hwpt_id[0]); + + /* Detach from nested_hwpt_id[1] and destroy it */ + test_cmd_mock_domain_replace(self->stdev_id, parent_hwpt_id); + test_ioctl_destroy(nested_hwpt_id[1]); + + /* Detach from the parent hw_pagetable and destroy it */ + test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id); + test_ioctl_destroy(parent_hwpt_id); + } else { + test_err_cmd_hwpt_alloc(ENOENT, self->device_id, self->ioas_id, + &parent_hwpt_id); + test_err_cmd_hwpt_alloc_nested(ENOENT, self->device_id, + parent_hwpt_id, + &nested_hwpt_id[0]); + test_err_cmd_hwpt_alloc_nested(ENOENT, self->device_id, + parent_hwpt_id, + &nested_hwpt_id[1]); + test_err_mock_domain_replace(ENOENT, self->stdev_id, + nested_hwpt_id[0]); + test_err_mock_domain_replace(ENOENT, self->stdev_id, + nested_hwpt_id[1]); + } +} + TEST_F(iommufd_ioas, hwpt_attach) { /* Create a device attached directly to a hwpt */ @@ -663,6 +758,8 @@ TEST_F(iommufd_ioas, access_pin) MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES);
for (npages = 1; npages < BUFFER_SIZE / PAGE_SIZE; npages++) { + uint32_t new_hwpt_id = 0; + uint32_t mock_device_id; uint32_t mock_stdev_id; uint32_t mock_hwpt_id;
@@ -696,12 +793,27 @@ TEST_F(iommufd_ioas, access_pin) _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES), &access_cmd)); test_cmd_mock_domain(self->ioas_id, &mock_stdev_id, - &mock_hwpt_id, NULL); + &mock_hwpt_id, &mock_device_id); check_map_cmd.id = mock_hwpt_id; + if (variant->new_hwpt) { + test_cmd_hwpt_alloc(mock_device_id, self->ioas_id, + &new_hwpt_id); + test_cmd_mock_domain_replace(mock_stdev_id, + new_hwpt_id); + check_map_cmd.id = new_hwpt_id; + } else { + check_map_cmd.id = mock_hwpt_id; + } ASSERT_EQ(0, ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP), &check_map_cmd));
+ if (variant->new_hwpt) { + /* Detach from the new hwpt for its destroy() */ + test_cmd_mock_domain_replace(mock_stdev_id, + mock_hwpt_id); + test_ioctl_destroy(new_hwpt_id); + } test_ioctl_destroy(mock_stdev_id); test_cmd_destroy_access_pages( access_cmd.id, diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index f13df84f6b42..365fa3da2016 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -118,6 +118,42 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id,
#define test_cmd_hwpt_alloc(device_id, pt_id, hwpt_id) \ ASSERT_EQ(0, _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, hwpt_id)) +#define test_err_cmd_hwpt_alloc(_errno, device_id, pt_id, hwpt_id) \ + EXPECT_ERRNO(_errno, _test_cmd_hwpt_alloc(self->fd, device_id, \ + pt_id, hwpt_id)) + +static int _test_cmd_hwpt_alloc_nested(int fd, __u32 device_id, __u32 parent_id, + __u32 *hwpt_id) +{ + struct iommu_hwpt_selftest data = { + .flags = IOMMU_TEST_FLAG_NESTED, + .test_config = IOMMU_TEST_IOTLB_DEFAULT, + }; + struct iommu_hwpt_alloc cmd = { + .size = sizeof(cmd), + .dev_id = device_id, + .pt_id = parent_id, + .hwpt_type = IOMMU_HWPT_TYPE_SELFTTEST, + .data_len = sizeof(data), + .data_uptr = (uint64_t)&data, + }; + int ret; + + ret = ioctl(fd, IOMMU_HWPT_ALLOC, &cmd); + if (ret) + return ret; + if (hwpt_id) + *hwpt_id = cmd.out_hwpt_id; + return 0; +} + +#define test_cmd_hwpt_alloc_nested(device_id, parent_id, hwpt_id) \ + ASSERT_EQ(0, _test_cmd_hwpt_alloc_nested(self->fd, device_id, \ + parent_id, hwpt_id)) +#define test_err_cmd_hwpt_alloc_nested(_errno, device_id, parent_id, hwpt_id) \ + EXPECT_ERRNO(_errno, \ + _test_cmd_hwpt_alloc_nested(self->fd, device_id, \ + parent_id, hwpt_id))
static int _test_cmd_access_replace_ioas(int fd, __u32 access_id, unsigned int ioas_id)
From: Nicolin Chen nicolinc@nvidia.com
This allows to test whether IOTLB has been invalidated or not.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/iommufd_test.h | 4 ++++ drivers/iommu/iommufd/selftest.c | 22 +++++++++++++++++++ tools/testing/selftests/iommu/iommufd.c | 4 ++++ tools/testing/selftests/iommu/iommufd_utils.h | 14 ++++++++++++ 4 files changed, 44 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index eeac8100844d..93fcbb57b904 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -21,6 +21,7 @@ enum { IOMMU_TEST_OP_ACCESS_REPLACE_IOAS, IOMMU_TEST_OP_DEV_ADD_RESERVED, IOMMU_TEST_OP_DEV_DEL_RESERVED, + IOMMU_TEST_OP_MD_CHECK_IOTLB, };
enum { @@ -101,6 +102,9 @@ struct iommu_test_cmd { struct { __u32 ioas_id; } access_replace_ioas; + struct { + __u32 iotlb; + } check_iotlb; }; __u32 last; }; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 8364003efde3..92c37716804e 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -795,6 +795,25 @@ static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd, return 0; }
+static int iommufd_test_md_check_iotlb(struct iommufd_ucmd *ucmd, + u32 mockpt_id, u32 iotlb) +{ + struct iommufd_hw_pagetable *hwpt; + struct mock_iommu_domain *mock; + int rc = 0; + + hwpt = get_md_pagetable(ucmd, mockpt_id, &mock); + if (IS_ERR(hwpt)) + return PTR_ERR(hwpt); + + mock = container_of(hwpt->domain, struct mock_iommu_domain, domain); + + if (iotlb != mock->iotlb) + rc = -EINVAL; + iommufd_put_object(&hwpt->obj); + return rc; +} + struct selftest_access { struct iommufd_access *access; struct file *file; @@ -1214,6 +1233,9 @@ int iommufd_test(struct iommufd_ucmd *ucmd) return iommufd_test_md_check_refs( ucmd, u64_to_user_ptr(cmd->check_refs.uptr), cmd->check_refs.length, cmd->check_refs.refs); + case IOMMU_TEST_OP_MD_CHECK_IOTLB: + return iommufd_test_md_check_iotlb(ucmd, cmd->id, + cmd->check_iotlb.iotlb); case IOMMU_TEST_OP_CREATE_ACCESS: return iommufd_test_create_access(ucmd, cmd->id, cmd->create_access.flags); diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 70116c2dfc94..942860365cb6 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -310,6 +310,10 @@ TEST_F(iommufd_ioas, nested_hwpt_alloc) &nested_hwpt_id[0]); test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id, &nested_hwpt_id[1]); + test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], + IOMMU_TEST_IOTLB_DEFAULT); + test_cmd_hwpt_check_iotlb(nested_hwpt_id[1], + IOMMU_TEST_IOTLB_DEFAULT);
/* Negative test: a nested hwpt on top of a nested hwpt */ test_err_cmd_hwpt_alloc_nested(EINVAL, self->device_id, diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 365fa3da2016..5e75dfac65e9 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -122,6 +122,20 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id, EXPECT_ERRNO(_errno, _test_cmd_hwpt_alloc(self->fd, device_id, \ pt_id, hwpt_id))
+#define test_cmd_hwpt_check_iotlb(hwpt_id, expected) \ + ({ \ + struct iommu_test_cmd test_cmd = { \ + .size = sizeof(test_cmd), \ + .op = IOMMU_TEST_OP_MD_CHECK_IOTLB, \ + .id = hwpt_id, \ + .check_iotlb = { .iotlb = expected }, \ + }; \ + ASSERT_EQ(0, \ + ioctl(self->fd, \ + _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_IOTLB), \ + &test_cmd)); \ + }) + static int _test_cmd_hwpt_alloc_nested(int fd, __u32 device_id, __u32 parent_id, __u32 *hwpt_id) {
From: Nicolin Chen nicolinc@nvidia.com
Add a mock_domain_cache_invalidate_user() and a corresponding struct iommu_hwpt_invalidate_selftest, to allow to test IOMMU_HWPT_INVALIDATE from user space, by using the new IOMMU_TEST_OP_MD_CHECK_IOTLB.
Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/iommufd_test.h | 10 ++++++++++ drivers/iommu/iommufd/selftest.c | 19 ++++++++++++++++++ tools/testing/selftests/iommu/iommufd.c | 8 ++++++++ tools/testing/selftests/iommu/iommufd_utils.h | 20 +++++++++++++++++++ 4 files changed, 57 insertions(+)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 93fcbb57b904..9768d0c9e347 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -135,4 +135,14 @@ struct iommu_hwpt_selftest { __u64 test_config; };
+/** + * struct iommu_hwpt_invalidate_selftest + * + * @flags: invalidate flags + */ +struct iommu_hwpt_invalidate_selftest { +#define IOMMU_TEST_INVALIDATE_ALL (1ULL << 0) + __u64 flags; +}; + #endif diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 92c37716804e..5f3e1f2a24e7 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -395,9 +395,28 @@ static const struct iommu_ops mock_ops = { }, };
+static int mock_domain_cache_invalidate_user(struct iommu_domain *domain, + void *user_data) +{ + struct iommu_hwpt_invalidate_selftest *inv_info = user_data; + struct mock_iommu_domain *mock = + container_of(domain, struct mock_iommu_domain, domain); + + if (domain->type != IOMMU_DOMAIN_NESTED || !mock->parent) + return -EINVAL; + + if (inv_info->flags & IOMMU_TEST_INVALIDATE_ALL) + mock->iotlb = 0; + + return 0; +} + static struct iommu_domain_ops domain_nested_ops = { .free = mock_domain_free, .attach_dev = mock_domain_nop_attach, + .cache_invalidate_user = mock_domain_cache_invalidate_user, + .cache_invalidate_user_data_len = + sizeof(struct iommu_hwpt_invalidate_selftest), };
static struct iommu_device mock_iommu_device = { diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 942860365cb6..bf0082d88a38 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -115,6 +115,7 @@ TEST_F(iommufd, cmd_length) TEST_LENGTH(iommu_destroy, IOMMU_DESTROY); TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO); TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC); + TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE); TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC); TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES); TEST_LENGTH(iommu_ioas_allow_iovas, IOMMU_IOAS_ALLOW_IOVAS); @@ -323,6 +324,13 @@ TEST_F(iommufd_ioas, nested_hwpt_alloc) EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, parent_hwpt_id));
+ /* hwpt_invalidate only supports a user-managed hwpt (nested) */ + test_err_cmd_hwpt_invalidate(EINVAL, parent_hwpt_id); + test_cmd_hwpt_invalidate(nested_hwpt_id[0]); + test_cmd_hwpt_check_iotlb(nested_hwpt_id[0], 0); + test_cmd_hwpt_invalidate(nested_hwpt_id[1]); + test_cmd_hwpt_check_iotlb(nested_hwpt_id[1], 0); + /* Attach device to nested_hwpt_id[0] that then will be busy */ test_cmd_mock_domain_replace(self->stdev_id, nested_hwpt_id[0]); diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 5e75dfac65e9..9b3e5f36c4a3 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -169,6 +169,26 @@ static int _test_cmd_hwpt_alloc_nested(int fd, __u32 device_id, __u32 parent_id, _test_cmd_hwpt_alloc_nested(self->fd, device_id, \ parent_id, hwpt_id))
+static int _test_cmd_hwpt_invalidate(int fd, __u32 hwpt_id) +{ + struct iommu_hwpt_invalidate_selftest data = { + .flags = IOMMU_TEST_INVALIDATE_ALL, + }; + struct iommu_hwpt_invalidate cmd = { + .size = sizeof(cmd), + .hwpt_id = hwpt_id, + .data_len = sizeof(data), + .data_uptr = (uint64_t)&data, + }; + + return ioctl(fd, IOMMU_HWPT_INVALIDATE, &cmd); +} + +#define test_cmd_hwpt_invalidate(hwpt_id) \ + ASSERT_EQ(0, _test_cmd_hwpt_invalidate(self->fd, hwpt_id)) +#define test_err_cmd_hwpt_invalidate(_errno, hwpt_id) \ + EXPECT_ERRNO(_errno, _test_cmd_hwpt_invalidate(self->fd, hwpt_id)) + static int _test_cmd_access_replace_ioas(int fd, __u32 access_id, unsigned int ioas_id) {
linux-kselftest-mirror@lists.linaro.org