This is the second part to add Intel VT-d nested translation based on IOMMUFD nesting infrastructure. As the iommufd nesting infrastructure series [1], iommu core supports new ops to invalidate the cache after the modifictions in stage-1 page table. So far, the cache invalidation data is vendor specific, the data_type (IOMMU_HWPT_DATA_VTD_S1) defined for the vendor specific HWPT allocation is reused in the cache invalidation path. User should provide the correct data_type that suit with the type used in HWPT allocation.
IOMMU_HWPT_INVALIDATE iotcl returns an error in @out_driver_error_code. However Intel VT-d does not define error code so far, so it's not easy to pre-define it in iommufd neither. As a result, this field should just be ignored on VT-d platform.
Complete code can be found in [2], corresponding QEMU could can be found in [3].
[1] https://lore.kernel.org/linux-iommu/20231117130717.19875-1-yi.l.liu@intel.co... [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v7: - No much change, just rebase on top of 6.7-rc1
v6: https://lore.kernel.org/linux-iommu/20231020093719.18725-1-yi.l.liu@intel.co... - Address comments from Kevin - Split the VT-d nesting series into two parts (Jason)
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.c... - Add Kevin's r-b for patch 2, 3 ,5 8, 10 - Drop enforce_cache_coherency callback from the nested type domain ops (Kevin) - Remove duplicate agaw check in patch 04 (Kevin) - Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin) - Check parent's force_snooping to set pgsnp in the pasid entry (Kevin) - uapi data structure check (Kevin) - Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.c... - Remove ascii art tables (Jason) - Drop EMT (Tina, Jason) - Drop MTS and related definitions (Kevin) - Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin) - Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin) - Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin) - Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin) - Do not trim the higher page levels of S2 domain in nested domain attachment as the S2 domain may have been used independently. (Kevin) - Remove the first-stage pgd check against the maximum address of s2_domain as hw can check it anyhow. It makes sense to check every pfns used in the stage-1 page table. But it cannot make it. So just leave it to hw. (Kevin) - Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin) - Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used as parent domain of a nested domain. This removes the nested_users counting. (Kevin) - Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.co... - Further split the patches into an order of adding helpers for nested domain, iotlb flush, nested domain attachment and nested domain allocation callback, then report the hw_info to userspace. - Add batch support in cache invalidation from userspace - Disallow nested translation usage if RO mappings exists in stage-2 domain due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.c... - The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.co...
Regards, Yi Liu
Yi Liu (3): iommufd: Add data structure for Intel VT-d stage-1 cache invalidation iommu/vt-d: Make iotlb flush helpers to be extern iommu/vt-d: Add iotlb flush for nested domain
drivers/iommu/intel/iommu.c | 10 +++---- drivers/iommu/intel/iommu.h | 6 ++++ drivers/iommu/intel/nested.c | 54 ++++++++++++++++++++++++++++++++++++ include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++ 4 files changed, 101 insertions(+), 5 deletions(-)
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 7f92cecc87d7..cafd98642abf 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -614,6 +614,42 @@ struct iommu_hwpt_get_dirty_bitmap { #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \ IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
+/** + * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d + * stage-1 cache invalidation + * @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only the + * leaf PTE caching needs to be invalidated + * and other paging structure caches can be + * preserved. + */ +enum iommu_hwpt_vtd_s1_invalidate_flags { + IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0, +}; + +/** + * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation + * (IOMMU_HWPT_DATA_VTD_S1) + * @addr: The start address of the addresses to be invalidated. It needs + * to be 4KB aligned. + * @npages: Number of contiguous 4K pages to be invalidated. + * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags + * @__reserved: Must be 0 + * + * The Intel VT-d specific invalidation data for user-managed stage-1 cache + * invalidation in nested translation. Userspace uses this structure to + * tell the impacted cache scope after modifying the stage-1 page table. + * + * Invalidating all the caches related to the page table by setting @addr + * to be 0 and @npages to be __aligned_u64(-1). This includes the + * corresponding device-TLB if ATS is enabled on the attached devices. + */ +struct iommu_hwpt_vtd_s1_invalidate { + __aligned_u64 addr; + __aligned_u64 npages; + __u32 flags; + __u32 __reserved; +}; + /** * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) * @size: sizeof(struct iommu_hwpt_invalidate)
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
"no for device-TLB" is misleading. Here just say that cache invalidation request applies to both IOTLB and device TLB (if ATS is enabled ...)
+/**
- enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
stage-1 cache invalidation
- @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only
the
leaf PTE caching needs to be invalidated
and other paging structure caches can be
preserved.
remove the words after 'and'
+/**
- struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
(IOMMU_HWPT_DATA_VTD_S1)
- @addr: The start address of the addresses to be invalidated. It needs
to be 4KB aligned.
remove "of the addresses"
- @npages: Number of contiguous 4K pages to be invalidated.
- @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
- @__reserved: Must be 0
- The Intel VT-d specific invalidation data for user-managed stage-1 cache
- invalidation in nested translation. Userspace uses this structure to
- tell the impacted cache scope after modifying the stage-1 page table.
- Invalidating all the caches related to the page table by setting @addr
- to be 0 and @npages to be __aligned_u64(-1). This includes the
- corresponding device-TLB if ATS is enabled on the attached devices.
put words about device-TLB to last paragraph. Putting it here is confusing as if it only applies to invalidate-all.
On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
"no for device-TLB" is misleading. Here just say that cache invalidation request applies to both IOTLB and device TLB (if ATS is enabled ...)
I think we should forward the ATS invalidation from the guest too? That is what ARM and AMD will have to do, can we keep them all consistent?
I understand Intel keeps track of enough stuff to know what the RIDs are, but is it necessary to make it different?
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, November 21, 2023 7:05 AM
On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
"no for device-TLB" is misleading. Here just say that cache invalidation request applies to both IOTLB and device TLB (if ATS is enabled ...)
I think we should forward the ATS invalidation from the guest too? That is what ARM and AMD will have to do, can we keep them all consistent?
I understand Intel keeps track of enough stuff to know what the RIDs are, but is it necessary to make it different?
probably ask the other way. Now intel-iommu driver always flushes iotlb and device tlb together then is it necessary to separate them in uAPI for no good (except doubled syscalls)? :)
anyway this is driver specific contract. I don't see a need to keep it consistent for all.
On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, November 21, 2023 7:05 AM
On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
"no for device-TLB" is misleading. Here just say that cache invalidation request applies to both IOTLB and device TLB (if ATS is enabled ...)
I think we should forward the ATS invalidation from the guest too? That is what ARM and AMD will have to do, can we keep them all consistent?
I understand Intel keeps track of enough stuff to know what the RIDs are, but is it necessary to make it different?
probably ask the other way. Now intel-iommu driver always flushes iotlb and device tlb together then is it necessary to separate them in uAPI for no good (except doubled syscalls)? :)
I wish I knew more about Intel CC design to be able to answer that :|
Doesn't the VM issue the ATC flush command regardless? How does it know it has a working ATC but does not need to flush it?
anyway this is driver specific contract. I don't see a need to keep it consistent for all.
Given that ARM and AMD need this and would have serious bugs if it didn't work this way I'm mildly concerned that Intel will be missing something here..
To my mind it seems like this is just a hold over from the prior design.
Jason
On 11/21/23 8:17 PM, Jason Gunthorpe wrote:
On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, November 21, 2023 7:05 AM
On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
"no for device-TLB" is misleading. Here just say that cache invalidation request applies to both IOTLB and device TLB (if ATS is enabled ...)
I think we should forward the ATS invalidation from the guest too? That is what ARM and AMD will have to do, can we keep them all consistent?
I understand Intel keeps track of enough stuff to know what the RIDs are, but is it necessary to make it different?
probably ask the other way. Now intel-iommu driver always flushes iotlb and device tlb together then is it necessary to separate them in uAPI for no good (except doubled syscalls)? :)
I wish I knew more about Intel CC design to be able to answer that :|
Doesn't the VM issue the ATC flush command regardless? How does it know it has a working ATC but does not need to flush it?
The Intel VT-d spec doesn't require the driver to flush iotlb and device tlb together. Therefore, the current approach of relying on caching mode to determine whether device TLB invalidation is necessary appears to be a performance optimization rather than an architectural requirement.
The vIOMMU driver assumes that it is running within a VM guest when caching mode is enabled. This assumption leads to an omission of device TLB invalidation, relying on the hypervisor to perform a combined flush of the IOLB and device TLB.
While this optimization aims to reduce VMEXIT overhead, it introduces potential issues:
- When a Linux guest running on a hypervisor other than KVM/QEMU, the assumption of combined IOLB and device TLB flushing by the hypervisor may be incorrect, potentially leading to missed device TLB invalidation.
- The caching mode doesn't apply to first-stage translation. Therefore, if the driver uses first-stage translation and still relies on caching mode to determine device TLB invalidation, the optimization fails.
A more reasonable optimization would be to allocate a bit in the iommu capability registers. The vIOMMU driver could then leverage this bit to determine whether it could eliminate a device invalidation request.
Best regards, baolu
On 2023/11/22 10:32, Baolu Lu wrote:
On 11/21/23 8:17 PM, Jason Gunthorpe wrote:
On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, November 21, 2023 7:05 AM
On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
"no for device-TLB" is misleading. Here just say that cache invalidation request applies to both IOTLB and device TLB (if ATS is enabled ...)
I think we should forward the ATS invalidation from the guest too? That is what ARM and AMD will have to do, can we keep them all consistent?
I understand Intel keeps track of enough stuff to know what the RIDs are, but is it necessary to make it different?
probably ask the other way. Now intel-iommu driver always flushes iotlb and device tlb together then is it necessary to separate them in uAPI for no good (except doubled syscalls)? :)
I wish I knew more about Intel CC design to be able to answer that :|
Doesn't the VM issue the ATC flush command regardless? How does it know it has a working ATC but does not need to flush it?
The Intel VT-d spec doesn't require the driver to flush iotlb and device tlb together.
Spec has below description. Although it does not say iotlb and device tlb should be flushed together, but there is indeed requirement that both should be flushed when a page is unmapped.
Chapter 6.5.2.5: "Since translation requests-without-PASID from a device may be serviced by hardware from the IOTLB, software must always request IOTLB invalidation (iotlb_inv_dsc) before requesting corresponding Device-TLB (dev_tlb_inv_dsc) invalidation."
Therefore, the current approach of relying on caching mode to determine whether device TLB invalidation is necessary appears to be a performance optimization rather than an architectural requirement.
The vIOMMU driver assumes that it is running within a VM guest when caching mode is enabled. This assumption leads to an omission of device TLB invalidation, relying on the hypervisor to perform a combined flush of the IOLB and device TLB.
yes, this is what the current intel iommu driver does. However, whether rely on caching mode or not is orthogonal with whether we need to uapis here. I think guest iommu driver could submit both iotlb and device tlb invalidation request. But Qemu could select if it needs to forward the device tlb invalidation request to kernel if kernel iommu driver has already covered the device tlb invalidation when get the request to invalidate iotlb.
While this optimization aims to reduce VMEXIT overhead, it introduces potential issues:
- When a Linux guest running on a hypervisor other than KVM/QEMU, the
assumption of combined IOLB and device TLB flushing by the hypervisor  may be incorrect, potentially leading to missed device TLB  invalidation.
Hmmm, this appears to be an intel iommu driver bug, it should submit both iotlb invalidation and device tlb invalidation requests. But as above, I think this is orthogonal here. KVM/QEMU could define its own uapi based on the implementation to gain the best suit.
- The caching mode doesn't apply to first-stage translation. Therefore,
if the driver uses first-stage translation and still relies on caching  mode to determine device TLB invalidation, the optimization fails.
yes, caching mode does no apply to first-stage translation table. But in nested translation, guest does not need to notify hypervisor when there is page unmapped. is it? So whether caching mode applies to first-stage translation table does not matter. TBH. I didn't see the problem due to this reason. But I agree that linux guest intel iommu driver needs to submit both iotlb and device tlb invalidation request to guarantee it can work on other hypervisors. And there should be other way to do the performance optimization.
A more reasonable optimization would be to allocate a bit in the iommu capability registers. The vIOMMU driver could then leverage this bit to determine whether it could eliminate a device invalidation request.
this may be something spec can be enhanced. But again it is just to make guest intel iommu driver to gain performance optimization and also can work on other hypervisors. As of this uapi design, considering it within the linux ecosystem is enough.
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, November 21, 2023 8:17 PM
On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, November 21, 2023 7:05 AM
On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-
TLB
invalidation will be covered automatically in the IOTLB invalidation if
the
underlying IOMMU driver has enabled ATS for the affected device.
"no for device-TLB" is misleading. Here just say that cache invalidation request applies to both IOTLB and device TLB (if ATS is enabled ...)
I think we should forward the ATS invalidation from the guest too? That is what ARM and AMD will have to do, can we keep them all consistent?
I understand Intel keeps track of enough stuff to know what the RIDs are, but is it necessary to make it different?
probably ask the other way. Now intel-iommu driver always flushes iotlb and device tlb together then is it necessary to separate them in uAPI for no good (except doubled syscalls)? :)
I wish I knew more about Intel CC design to be able to answer that :|
Doesn't the VM issue the ATC flush command regardless? How does it know it has a working ATC but does not need to flush it?
anyway this is driver specific contract. I don't see a need to keep it consistent for all.
Given that ARM and AMD need this and would have serious bugs if it didn't work this way I'm mildly concerned that Intel will be missing something here..
To my mind it seems like this is just a hold over from the prior design.
As Yi/Baolu discussed there is an issue in intel-iommu driver which incorrectly skips devtlb invalidation in the guest with the assumption that the host combines iotlb/devtlb invalidation together. This is incorrect and should be fixed.
But what I was talking about earlier is about the uAPI between viommu and iommu driver. I don't see a need of having separate invalidation cmds for each, as I'm not sure what the user can expect in the window when iotlb and devtlb are out of sync.
then we just define hwpt 'cache' invalidation in vtd always refers to both iotlb and devtlb. Then viommu just needs to call invalidation uapi once when emulating virtual iotlb invalidation descriptor while emulating the following devtlb invalidation descriptor as a nop.
On Wed, Nov 22, 2023 at 04:58:24AM +0000, Tian, Kevin wrote:
As Yi/Baolu discussed there is an issue in intel-iommu driver which incorrectly skips devtlb invalidation in the guest with the assumption that the host combines iotlb/devtlb invalidation together. This is incorrect and should be fixed.
Yes, this seems quite problematic - you guys will have to think of something and decide what kind of backward compat you want :(
Maybe the viommu driver can observe the guest and if it sees an ATC invalidation assume it is non-buggy, until seen it can do a combined flush.
But what I was talking about earlier is about the uAPI between viommu and iommu driver. I don't see a need of having separate invalidation cmds for each, as I'm not sure what the user can expect in the window when iotlb and devtlb are out of sync.
If the guest is always issuing the device invalidation then I don't see too much point in suppressing it in the kernel. Just forward it naturally.
then we just define hwpt 'cache' invalidation in vtd always refers to both iotlb and devtlb. Then viommu just needs to call invalidation uapi once when emulating virtual iotlb invalidation descriptor while emulating the following devtlb invalidation descriptor as a nop.
In principle ATC and IOMMU TLB invalidations should not always be linked.
Any scenario that allows devices to share an IOTLB cache tag requires fewer IOMMU TLB invalidations than ATC invalidations.
I like the view of this invalidation interface as reflecting the actual HW and not trying to be smarter an real HW.
I'm fully expecting that Intel will adopt an direct-DMA flush queue like SMMU and AMD have already done as a performance optimization. In this world it makes no sense that the behavior of the direct DMA queue and driver mediated queue would be different.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, November 22, 2023 9:26 PM
On Wed, Nov 22, 2023 at 04:58:24AM +0000, Tian, Kevin wrote:
then we just define hwpt 'cache' invalidation in vtd always refers to both iotlb and devtlb. Then viommu just needs to call invalidation uapi once when emulating virtual iotlb invalidation descriptor while emulating the following devtlb invalidation descriptor as a nop.
In principle ATC and IOMMU TLB invalidations should not always be linked.
Any scenario that allows devices to share an IOTLB cache tag requires fewer IOMMU TLB invalidations than ATC invalidations.
as long as the host iommu driver has the same knowledge then it will always do the right thing.
e.g. one iotlb entry shared by 4 devices.
guest issues: 1) iotlb invalidation 2) devtlb invalidation for dev1 3) devtlb invalidation for dev2 4) devtlb invalidation for dev3 5) devtlb invalidation for dev4
intel-viommu calls HWPT cache invalidation for 1) and treats 2-5) as nop.
intel-iommu driver internally knows the iotlb is shared by 4 devices (given the same domain is attached to those devices) to handle HWPT cache invalidation:
1) iotlb invalidation 2) devtlb invalidation for dev1 3) devtlb invalidation for dev2 4) devtlb invalidation for dev3 5) devtlb invalidation for dev4
this is a good optimization by reducing 5 syscalls to 1, with the assumption that the guest shouldn't expect any deterministic behavior before 5) is completed to bring iotlb/devtlbs in sync.
another alternative is to have guest batch 1-5) in one request which allows viommu to batch them in one invalidation call too. But this is an orthogonal optimization in guest which we don't want to rely on.
I like the view of this invalidation interface as reflecting the actual HW and not trying to be smarter an real HW.
the guest-oriented interface e.g. viommu reflects the HW.
uAPI is kind of viommu internal implementation. IMHO it's not a bad thing to make it smarter as long as no guest observable breakage.
I'm fully expecting that Intel will adopt an direct-DMA flush queue like SMMU and AMD have already done as a performance optimization. In this world it makes no sense that the behavior of the direct DMA queue and driver mediated queue would be different.
that's a orthogonal topic. I don't think the value of direct-DMA flush queue should prevent possible optimization in the mediation path (as long as guest-expected deterministic behavior is sustained).
On Fri, Nov 24, 2023 at 03:00:45AM +0000, Tian, Kevin wrote:
I'm fully expecting that Intel will adopt an direct-DMA flush queue like SMMU and AMD have already done as a performance optimization. In this world it makes no sense that the behavior of the direct DMA queue and driver mediated queue would be different.
that's a orthogonal topic. I don't think the value of direct-DMA flush queue should prevent possible optimization in the mediation path (as long as guest-expected deterministic behavior is sustained).
Okay, well as long as the guest is generating the ATC invalidations we can always make an iommufd API flag to include or exclude the ATC invalidation when doing the ASID invalidation. So we aren't trapped
Jason
On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for flushing iotlb for the nested domain
allocated with IOMMU_HWPT_DATA_VTD_S1 type.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
Signed-off-by: Yi Liu yi.l.liu@intel.com
include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 7f92cecc87d7..cafd98642abf 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -614,6 +614,42 @@ struct iommu_hwpt_get_dirty_bitmap { #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \ IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
+/**
- enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
stage-1 cache invalidation
- @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only the
leaf PTE caching needs to be invalidated
and other paging structure caches can be
preserved.
- */
+enum iommu_hwpt_vtd_s1_invalidate_flags {
- IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0,
+};
+/**
- struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
(IOMMU_HWPT_DATA_VTD_S1)
- @addr: The start address of the addresses to be invalidated. It needs
to be 4KB aligned.
- @npages: Number of contiguous 4K pages to be invalidated.
- @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
- @__reserved: Must be 0
- The Intel VT-d specific invalidation data for user-managed stage-1 cache
- invalidation in nested translation. Userspace uses this structure to
- tell the impacted cache scope after modifying the stage-1 page table.
- Invalidating all the caches related to the page table by setting @addr
- to be 0 and @npages to be __aligned_u64(-1). This includes the
- corresponding device-TLB if ATS is enabled on the attached devices.
- */
+struct iommu_hwpt_vtd_s1_invalidate {
- __aligned_u64 addr;
- __aligned_u64 npages;
- __u32 flags;
- __u32 __reserved;
+};
- /**
- struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
- @size: sizeof(struct iommu_hwpt_invalidate)
Hi Jason, Kevin,
Per the prior discussion[1], we agreed to move the error reporting into the driver specific part. On Intel side, we want to report two devTLB invalidation errors: ICE (invalid completion error) and ITE (invalidation timeout error). Such errors have an additional SID information to tell which device failed the devTLB invalidation. I've got the below structure.
+/** + * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation + * (IOMMU_HWPT_DATA_VTD_S1) + * @addr: The start address of the addresses to be invalidated. It needs + * to be 4KB aligned. + * @npages: Number of contiguous 4K pages to be invalidated. + * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags + * @__reserved: Must be 0. + * @error: One of enum iommu_hwpt_vtd_s1_invalidate_error_code + * @dev_id: The device in the invalidation completion message, it's meaninfful + * when @error_code is set to IOMMU_HWPT_VTD_S1_INVALIDATE_DEVTLB_ICE + * or IOMMU_HWPT_VTD_S1_INVALIDATE_DEVTLB_ITE. + * + * The Intel VT-d specific invalidation data for user-managed stage-1 cache + * invalidation in nested translation. Userspace uses this structure to + * tell the impacted cache scope after modifying the stage-1 page table. + * + * Invalidating all the caches related to the page table by setting @addr + * to be 0 and @npages to be U64_MAX. + * + * @error_code is meaningful only if the request is handled by kernel. This + * can be known by checking struct iommu_hwpt_invalidate::req_num output. + * @error_code only covers the errors detected by hardware after submitting + * the invalidation. The software detected errors would go through the normal + * ioctl errno. + */ +struct iommu_hwpt_vtd_s1_invalidate { + __aligned_u64 addr; + __aligned_u64 npages; + __u32 flags; + __u32 __reserved; + __u32 error; + __u32 dev_id; +};
dev_id is used to report the failed device, userspace should be able to map it to a vRID, and inject it to VM as part of ITE/ICE error.
However, I got a problem when trying to get dev_id in cache invalidation path, since this is filled in intel iommu driver. Seems like there is no good way for it. I've below alternatives to move forward, wish you have a look.
- Reporting pSID instead of dev_id. This may not work if userspace for example Qemu cen get a vfio device cdev fd from management stack. Maybe you have different opinion, do let me know.
- Let iommufd to convert a SID info or device pointer to a dev_id, and then report it back to userspace. This seems easiest, but breaks layer and also requires vt-d specific logic. :(
- Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is maintained, then intel iommu can report a vRID back to user. But intel iommu driver does not have viommu context, no place to hold the vRID->pRID mapping. TBH. It may require other reasons to introduce it other than the error reporting need. Anyhow, this requires more thinking and also has dependency even if it is doable in intel side.
- Only report error code, but no device info at first. May adding the device info (dev_id or vRID) in future series. In reality, the existing Qemu vIOMMU does not report ICE, ITE, neither the SID to VM. Also, VT-d spec defined the ICE/ITE errors first in 2007 spec 1.1, and added SID info later in 2019 spec 3.1. We may do it in stage as well.
What about your opinion?
[1] https://lore.kernel.org/linux-iommu/a9699f71-805a-4a5a-9282-3ec52e5bc81a@int...
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, December 14, 2023 7:27 PM
On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for flushing iotlb for the nested domain
+struct iommu_hwpt_vtd_s1_invalidate {
- __aligned_u64 addr;
- __aligned_u64 npages;
- __u32 flags;
- __u32 __reserved;
- __u32 error;
- __u32 dev_id;
+};
dev_id is used to report the failed device, userspace should be able to map it to a vRID, and inject it to VM as part of ITE/ICE error.
However, I got a problem when trying to get dev_id in cache invalidation path, since this is filled in intel iommu driver. Seems like there is no good way for it. I've below alternatives to move forward, wish you have a look.
- Reporting pSID instead of dev_id. This may not work if userspace for
example Qemu cen get a vfio device cdev fd from management stack. Maybe you have different opinion, do let me know.
yes, there is no guarantee that pRID is always visible to the user.
- Let iommufd to convert a SID info or device pointer to a dev_id, and then
report it back to userspace. This seems easiest, but breaks layer and also requires vt-d specific logic. :(
yes, the current philosophy of iommufd is to put diver specific knowledge out of iommufd.
- Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
maintained, then intel iommu can report a vRID back to user. But intel iommu driver does not have viommu context, no place to hold the vRID-
pRID
mapping. TBH. It may require other reasons to introduce it other than the error reporting need. Anyhow, this requires more thinking and also has dependency even if it is doable in intel side.
this sounds like a cleaner way to inject knowledge which iommu driver requires to find out the user tag. but yes it's a bit weird to introduce viommu awareness in intel iommu driver when there is no such thing in real hardware.
and for this error reporting case what we actually require is the reverse map i.e. pRID->vRID. Not sure whether we can leverage the same RID mapping uAPI as for ARM/AMD but ignore viommu_id and then store vRID under device_domain_info. a bit tricky on life cycle management and also incompatible with SIOV...
let's see whether Jason has a better idea here.
- Only report error code, but no device info at first. May adding the
device info (dev_id or vRID) in future series. In reality, the existing Qemu vIOMMU does not report ICE, ITE, neither the SID to VM. Also, VT-d
and IOAS_UNMAP doesn't provide such ATS error either.
spec defined the ICE/ITE errors first in 2007 spec 1.1, and added SID info later in 2019 spec 3.1. We may do it in stage as well.
and it's not tied to a specific iommu version. the spec is stated in a way that software treats zero value in SID as no hw support so theoretically even a modern hw may not report SID for certain reason.
What about your opinion?
[1] https://lore.kernel.org/linux-iommu/a9699f71-805a-4a5a-9282- 3ec52e5bc81a@intel.com/
I'm fine with this staged approach given the spec allows this behavior and no vIOMMU properly emulates ITE/ICE today.
let's work out a new version w/o dev info and make sure it's in a good state first. Then let Jason decide next week whether he wants to take it for this cycle or not.
On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, December 14, 2023 7:27 PM
On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for flushing iotlb for the nested domain
+struct iommu_hwpt_vtd_s1_invalidate {
__aligned_u64 addr;
__aligned_u64 npages;
__u32 flags;
__u32 __reserved;
__u32 error;
__u32 dev_id;
+};
dev_id is used to report the failed device, userspace should be able to map it to a vRID, and inject it to VM as part of ITE/ICE error.
However, I got a problem when trying to get dev_id in cache invalidation path, since this is filled in intel iommu driver. Seems like there is no good way for it. I've below alternatives to move forward, wish you have a look.
- Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
maintained, then intel iommu can report a vRID back to user. But intel iommu driver does not have viommu context, no place to hold the vRID-
pRID
mapping. TBH. It may require other reasons to introduce it other than the error reporting need. Anyhow, this requires more thinking and also has dependency even if it is doable in intel side.
this sounds like a cleaner way to inject knowledge which iommu driver requires to find out the user tag. but yes it's a bit weird to introduce viommu awareness in intel iommu driver when there is no such thing in real hardware.
I think a viommu is defined more like a software object representing the virtual IOMMU in a VM. Since VT-d has a vIOMMU in a nesting case, there could be an object for it too?
and for this error reporting case what we actually require is the reverse map i.e. pRID->vRID. Not sure whether we can leverage the same RID mapping uAPI as for ARM/AMD but ignore viommu_id and then store vRID under device_domain_info. a bit tricky on life cycle management and also incompatible with SIOV...
One thing that I am not very clear here: since both vRID and dev_id are given by the VMM, shouldn't it already know the mapping if the point is to translate (pRID->)dev_id->vRID?
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, December 15, 2023 10:28 AM
On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, December 14, 2023 7:27 PM
On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for flushing iotlb for the nested domain
+struct iommu_hwpt_vtd_s1_invalidate {
__aligned_u64 addr;
__aligned_u64 npages;
__u32 flags;
__u32 __reserved;
__u32 error;
__u32 dev_id;
+};
dev_id is used to report the failed device, userspace should be able to
map
it to a vRID, and inject it to VM as part of ITE/ICE error.
However, I got a problem when trying to get dev_id in cache invalidation path, since this is filled in intel iommu driver. Seems like there is no good way for it. I've below alternatives to move forward, wish you have a look.
- Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
maintained, then intel iommu can report a vRID back to user. But intel iommu driver does not have viommu context, no place to hold the vRID-
pRID
mapping. TBH. It may require other reasons to introduce it other than the error reporting need. Anyhow, this requires more thinking and also has dependency even if it is doable in intel side.
this sounds like a cleaner way to inject knowledge which iommu driver requires to find out the user tag. but yes it's a bit weird to introduce viommu awareness in intel iommu driver when there is no such thing in real hardware.
I think a viommu is defined more like a software object representing the virtual IOMMU in a VM. Since VT-d has a vIOMMU in a nesting case, there could be an object for it too?
for VT-d it's not necessary to maintain such vIOMMU awareness in the kernel (before this error reporting case) given its interfaces are simply around hwpt's. there is no vIOMMU-scope operation provided by intel-iommu driver so far.
and for this error reporting case what we actually require is the reverse map i.e. pRID->vRID. Not sure whether we can leverage the same RID mapping uAPI as for ARM/AMD but ignore viommu_id and then store vRID under device_domain_info. a bit tricky on life cycle management and also incompatible with SIOV...
One thing that I am not very clear here: since both vRID and dev_id are given by the VMM, shouldn't it already know the mapping if the point is to translate (pRID->)dev_id->vRID?
it's true for current Qemu.
but there is plan to support Qemu accepting a fd passed by Libvirt. In that case Qemu even doesn't see the sysfs path hence is not aware of pRID. otherwise yes we could leave the translation to VMM instead.
On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, December 15, 2023 10:28 AM On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, December 14, 2023 7:27 PM
On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for flushing iotlb for the nested domain
+struct iommu_hwpt_vtd_s1_invalidate {
__aligned_u64 addr;
__aligned_u64 npages;
__u32 flags;
__u32 __reserved;
__u32 error;
__u32 dev_id;
+};
dev_id is used to report the failed device, userspace should be able to
map
it to a vRID, and inject it to VM as part of ITE/ICE error.
and for this error reporting case what we actually require is the reverse map i.e. pRID->vRID. Not sure whether we can leverage the same RID mapping uAPI as for ARM/AMD but ignore viommu_id and then store vRID under device_domain_info. a bit tricky on life cycle management and also incompatible with SIOV...
One thing that I am not very clear here: since both vRID and dev_id are given by the VMM, shouldn't it already know the mapping if the point is to translate (pRID->)dev_id->vRID?
it's true for current Qemu.
but there is plan to support Qemu accepting a fd passed by Libvirt. In that case Qemu even doesn't see the sysfs path hence is not aware of pRID. otherwise yes we could leave the translation to VMM instead.
I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :)
On 2023/12/15 11:32, Nicolin Chen wrote:
On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, December 15, 2023 10:28 AM On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, December 14, 2023 7:27 PM
On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for flushing iotlb for the nested domain
+struct iommu_hwpt_vtd_s1_invalidate {
__aligned_u64 addr;
__aligned_u64 npages;
__u32 flags;
__u32 __reserved;
__u32 error;
__u32 dev_id;
+};
dev_id is used to report the failed device, userspace should be able to
map
it to a vRID, and inject it to VM as part of ITE/ICE error.
and for this error reporting case what we actually require is the reverse map i.e. pRID->vRID. Not sure whether we can leverage the same RID mapping uAPI as for ARM/AMD but ignore viommu_id and then store vRID under device_domain_info. a bit tricky on life cycle management and also incompatible with SIOV...
One thing that I am not very clear here: since both vRID and dev_id are given by the VMM, shouldn't it already know the mapping if the point is to translate (pRID->)dev_id->vRID?
it's true for current Qemu.
but there is plan to support Qemu accepting a fd passed by Libvirt. In that case Qemu even doesn't see the sysfs path hence is not aware of pRID. otherwise yes we could leave the translation to VMM instead.
I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :)
In concept, dev_id works, but in reality we have problem to get a dev_id for a given device in intel iommu driver, hence I'm asking for help here. :)
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
On 2023/12/15 11:32, Nicolin Chen wrote:
On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, December 15, 2023 10:28 AM On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, December 14, 2023 7:27 PM
On 2023/11/17 21:18, Yi Liu wrote:
and for this error reporting case what we actually require is the reverse map i.e. pRID->vRID. Not sure whether we can leverage the same RID mapping uAPI as for ARM/AMD but ignore viommu_id and then store vRID under device_domain_info. a bit tricky on life cycle management and also incompatible with SIOV...
One thing that I am not very clear here: since both vRID and dev_id are given by the VMM, shouldn't it already know the mapping if the point is to translate (pRID->)dev_id->vRID?
it's true for current Qemu.
but there is plan to support Qemu accepting a fd passed by Libvirt. In that case Qemu even doesn't see the sysfs path hence is not aware of pRID. otherwise yes we could leave the translation to VMM instead.
I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :)
In concept, dev_id works, but in reality we have problem to get a dev_id for a given device in intel iommu driver, hence I'm asking for help here. :)
Yea, I got that.
Would it be possible for us to postpone this error report in the vtd driver?
Jason is taking vacation, so he'll unlikely be very active until the new year, although he would probably spare some time taking the cache_invalidate series once it's mature.
If the final solution is to use pRID<->vRID mappings for vtd too, we'd likely need the viommu/dev_set_virtual_id series that I am still working on, which certainly won't make it to this cycle.
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Sunday, December 17, 2023 2:49 AM
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
On 2023/12/15 11:32, Nicolin Chen wrote:
On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, December 15, 2023 10:28 AM On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> From: Liu, Yi L yi.l.liu@intel.com > Sent: Thursday, December 14, 2023 7:27 PM > > On 2023/11/17 21:18, Yi Liu wrote: and for this error reporting case what we actually require is the reverse map i.e. pRID->vRID. Not sure whether we can leverage the same RID mapping uAPI as for ARM/AMD but ignore viommu_id and then store vRID under device_domain_info. a bit tricky on life cycle management and also incompatible with SIOV...
One thing that I am not very clear here: since both vRID and dev_id are given by the VMM, shouldn't it already know the mapping if the point is to translate (pRID->)dev_id->vRID?
it's true for current Qemu.
but there is plan to support Qemu accepting a fd passed by Libvirt. In that case Qemu even doesn't see the sysfs path hence is not aware of pRID. otherwise yes we could leave the translation to VMM instead.
I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :)
In concept, dev_id works, but in reality we have problem to get a dev_id for a given device in intel iommu driver, hence I'm asking for help here. :)
Yea, I got that.
Would it be possible for us to postpone this error report in the vtd driver?
that is the plan. We'll still report error code but no dev_id error data. this is allowed by vtd spec.
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :)
In concept, dev_id works, but in reality we have problem to get a dev_id for a given device in intel iommu driver, hence I'm asking for help here. :)
I think we just need to solve this one way or another.. Even if you use a viommu object you still end up having difficult coupling to iommufd
Some: iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
Callable by a driver (using the driver-callable function infrastructure we made for dirty tracking) Is really all that is needed here.
Jason
On 2024/1/3 07:38, Jason Gunthorpe wrote:
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :)
In concept, dev_id works, but in reality we have problem to get a dev_id for a given device in intel iommu driver, hence I'm asking for help here. :)
I think we just need to solve this one way or another.. Even if you use a viommu object you still end up having difficult coupling to iommufd
Some: iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
Callable by a driver (using the driver-callable function infrastructure we made for dirty tracking) Is really all that is needed here.
yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when IOMMUFD is configed. Maybe such an API could be compiled when IOMMUFD_DRIVER is enabled as well. This does address my concern on making intel iommu driver depending on iommufd. But still need a way to pass in the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel iommu driver could call the above iommufd_get_dev_id(). One possible way is passing it when attaching device to domain and clear it in detach. However, this seems not ideal as iommufd_ctx information should be static between bind_iommufd and unbind. But we don't call into intel iommu driver in the bind and unbind operations. May need to add new iommu op. Any suggestion here?
On Wed, Jan 03, 2024 at 10:24:42AM +0800, Yi Liu wrote:
On 2024/1/3 07:38, Jason Gunthorpe wrote:
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :)
In concept, dev_id works, but in reality we have problem to get a dev_id for a given device in intel iommu driver, hence I'm asking for help here. :)
I think we just need to solve this one way or another.. Even if you use a viommu object you still end up having difficult coupling to iommufd
Some: iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
Callable by a driver (using the driver-callable function infrastructure we made for dirty tracking) Is really all that is needed here.
yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when IOMMUFD is configed. Maybe such an API could be compiled when IOMMUFD_DRIVER is enabled as well. This does address my concern on making intel iommu driver depending on iommufd. But still need a way to pass in the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel iommu driver could call the above iommufd_get_dev_id(). One possible way is passing it when attaching device to domain and clear it in detach. However, this seems not ideal as iommufd_ctx information should be static between bind_iommufd and unbind. But we don't call into intel iommu driver in the bind and unbind operations. May need to add new iommu op. Any suggestion here?
You can pass the ctx to the invalidate op, it is already implied because the passed iommu_domain is linked to a single iommufd ctx.
Jason
On Wed, Jan 03, 2024 at 12:01:08PM -0400, Jason Gunthorpe wrote:
On Wed, Jan 03, 2024 at 10:24:42AM +0800, Yi Liu wrote:
On 2024/1/3 07:38, Jason Gunthorpe wrote:
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :)
In concept, dev_id works, but in reality we have problem to get a dev_id for a given device in intel iommu driver, hence I'm asking for help here. :)
I think we just need to solve this one way or another.. Even if you use a viommu object you still end up having difficult coupling to iommufd
Some: iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
Callable by a driver (using the driver-callable function infrastructure we made for dirty tracking) Is really all that is needed here.
yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when IOMMUFD is configed. Maybe such an API could be compiled when IOMMUFD_DRIVER is enabled as well. This does address my concern on making intel iommu driver depending on iommufd. But still need a way to pass in the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel iommu driver could call the above iommufd_get_dev_id(). One possible way is passing it when attaching device to domain and clear it in detach. However, this seems not ideal as iommufd_ctx information should be static between bind_iommufd and unbind. But we don't call into intel iommu driver in the bind and unbind operations. May need to add new iommu op. Any suggestion here?
You can pass the ctx to the invalidate op, it is already implied because the passed iommu_domain is linked to a single iommufd ctx.
The device virtual id lookup API needs something similar, yet it likely needs a viommu pointer (or its id) instead? As the table is attached to a viommu while an ictx can have multiple viommus, right?
Thanks Nic
On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
You can pass the ctx to the invalidate op, it is already implied because the passed iommu_domain is linked to a single iommufd ctx.
The device virtual id lookup API needs something similar, yet it likely needs a viommu pointer (or its id) instead? As the table is attached to a viommu while an ictx can have multiple viommus, right?
Yes, when we get to an API for that it will have to be some op 'invalidate_viommu(..)' and it can get the necessary pointers.
The viommu object will have to be some driver object like the iommu_domain.
Jason
On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
You can pass the ctx to the invalidate op, it is already implied because the passed iommu_domain is linked to a single iommufd ctx.
The device virtual id lookup API needs something similar, yet it likely needs a viommu pointer (or its id) instead? As the table is attached to a viommu while an ictx can have multiple viommus, right?
Yes, when we get to an API for that it will have to be some op 'invalidate_viommu(..)' and it can get the necessary pointers.
OK! I will try that first.
The viommu object will have to be some driver object like the iommu_domain.
I drafted something like this, linking it to struct iommu_device:
+struct iommufd_viommu { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommu_device *iommu_dev; + struct iommufd_hwpt_paging *hwpt; + /* array of struct iommufd_device, indexed by device virtual id */ + struct xarray device_ids; +};
Thanks Nicolin
On Wed, Jan 03, 2024 at 09:06:23AM -0800, Nicolin Chen wrote:
On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
You can pass the ctx to the invalidate op, it is already implied because the passed iommu_domain is linked to a single iommufd ctx.
The device virtual id lookup API needs something similar, yet it likely needs a viommu pointer (or its id) instead? As the table is attached to a viommu while an ictx can have multiple viommus, right?
Yes, when we get to an API for that it will have to be some op 'invalidate_viommu(..)' and it can get the necessary pointers.
OK! I will try that first.
The viommu object will have to be some driver object like the iommu_domain.
I drafted something like this, linking it to struct iommu_device:
+struct iommufd_viommu {
struct iommufd_object obj;
struct iommufd_ctx *ictx;
struct iommu_device *iommu_dev;
struct iommufd_hwpt_paging *hwpt;
/* array of struct iommufd_device, indexed by device virtual id */
struct xarray device_ids;
+};
The driver would have to create it and there would be some driver specific enclosing struct to go with it
Perhaps device_ids goes in the driver specific struct, I don't know.
Not sure it should have hwpt at all, probably vmid should come from the driver specific struct in some driver specific way
Jason
On Wed, Jan 03, 2024 at 01:52:02PM -0400, Jason Gunthorpe wrote:
On Wed, Jan 03, 2024 at 09:06:23AM -0800, Nicolin Chen wrote:
On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
You can pass the ctx to the invalidate op, it is already implied because the passed iommu_domain is linked to a single iommufd ctx.
The device virtual id lookup API needs something similar, yet it likely needs a viommu pointer (or its id) instead? As the table is attached to a viommu while an ictx can have multiple viommus, right?
Yes, when we get to an API for that it will have to be some op 'invalidate_viommu(..)' and it can get the necessary pointers.
OK! I will try that first.
The viommu object will have to be some driver object like the iommu_domain.
I drafted something like this, linking it to struct iommu_device:
+struct iommufd_viommu {
struct iommufd_object obj;
struct iommufd_ctx *ictx;
struct iommu_device *iommu_dev;
struct iommufd_hwpt_paging *hwpt;
/* array of struct iommufd_device, indexed by device virtual id */
struct xarray device_ids;
+};
The driver would have to create it and there would be some driver specific enclosing struct to go with it
Perhaps device_ids goes in the driver specific struct, I don't know.
+struct iommufd_viommu { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommu_device *iommu_dev; + struct iommufd_hwpt_paging *hwpt; /* maybe unneeded */ + + int vmid; + + union iommu_driver_user_data { + struct iommu_driver_user_vtd; + struct iommu_driver_user_arm_smmuv3; + struct iommu_driver_user_amd_viommu; + }; +};
Then iommu_ops would need something like: iommu_user_alloc/free(struct iommu_device *iommu_dev, union *iommu_driver_user_data, int *vmid); iommu_user_set/unset_dev_id(union iommu_driver_user_data, struct device* dev. u32/u64 id); iommu_user_invalidate(union iommu_driver_user_data *iommu, struct iommu_user_data_array *array);
Comments and ideas on better naming convention?
Not sure it should have hwpt at all, probably vmid should come from the driver specific struct in some driver specific way
The idea having a hwpt pointer is to share the paging hwpt among the viommu objects. I don't think it "shouldn't have", yet likely we can avoid it depending on whether it will have some use or not in the context.
Nicolin
On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
The driver would have to create it and there would be some driver specific enclosing struct to go with it
Perhaps device_ids goes in the driver specific struct, I don't know.
+struct iommufd_viommu {
- struct iommufd_object obj;
- struct iommufd_ctx *ictx;
- struct iommu_device *iommu_dev;
- struct iommufd_hwpt_paging *hwpt; /* maybe unneeded */
- int vmid;
- union iommu_driver_user_data {
struct iommu_driver_user_vtd;
struct iommu_driver_user_arm_smmuv3;
struct iommu_driver_user_amd_viommu;
- };
Not like that, in the usual container_of way
Jason
On Wed, Jan 03, 2024 at 08:02:04PM -0400, Jason Gunthorpe wrote:
On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
The driver would have to create it and there would be some driver specific enclosing struct to go with it
Perhaps device_ids goes in the driver specific struct, I don't know.
+struct iommufd_viommu {
- struct iommufd_object obj;
- struct iommufd_ctx *ictx;
- struct iommu_device *iommu_dev;
- struct iommufd_hwpt_paging *hwpt; /* maybe unneeded */
- int vmid;
- union iommu_driver_user_data {
struct iommu_driver_user_vtd;
struct iommu_driver_user_arm_smmuv3;
struct iommu_driver_user_amd_viommu;
- };
Not like that, in the usual container_of way
How about this:
// iommu.h @@ -490,6 +490,16 @@ struct iommu_ops { + /* User space instance allocation and freeing by the iommu driver */ + struct iommu_device_user *(*iommu_alloc_user)(struct iommu_device *iommu); + void (*iommu_free_user)(struct iommu_device_user *iommu_user); + int (*iommu_user_set_dev_id)(struct iommu_device_user *iommu_user, + struct device *dev, u64 dev_id); + int (*iommu_user_unset_dev_id)(struct iommu_device_user *iommu_user, + struct device *dev); + int (*iommu_cache_invalidate_usewr)(struct iommu_device_user *iommu_user, + struct iommu_user_data_array *array); ... +/** + * struct iommu_device_user - IOMMU core representation of one IOMMU virtual + * instance + * @iommu_dev: Underlying IOMMU hardware instance + * @id: Virtual instance ID, e.g. a vmid + */ +struct iommu_device_user { + struct iommu_device *iommu_dev; + struct xarray virtual_ids; + u32 id; +};
// iommufd_private.h +struct iommufd_viommu { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommu_device *iommu_dev; + struct iommu_device_user *iommu_user; + struct iommufd_hwpt_paging *hwpt; +}; + +int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); +void iommufd_viommu_destroy(struct iommufd_object *obj);
So iommu_alloc_user op allocates a driver private structure that contains an iommu_device_user and returns &priv->iommu_user.
The set/unset_dev_id ops probably need a type argument if there can be a multi-xarray case, then the virtual_ids xarray should be moved to the driver private structure too?
Also, a 64-bit virtual id in the uAPI was suggested previously. But xarray is limited to a 32-bit index? Should we compromise the uAPI to 32-bit or is there an alternative for a 64-bit id lookup?
Thanks Nicolin
On Thu, Jan 04, 2024 at 11:38:40PM -0800, Nicolin Chen wrote:
On Wed, Jan 03, 2024 at 08:02:04PM -0400, Jason Gunthorpe wrote:
On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
The driver would have to create it and there would be some driver specific enclosing struct to go with it
Perhaps device_ids goes in the driver specific struct, I don't know.
+struct iommufd_viommu {
- struct iommufd_object obj;
- struct iommufd_ctx *ictx;
- struct iommu_device *iommu_dev;
- struct iommufd_hwpt_paging *hwpt; /* maybe unneeded */
- int vmid;
- union iommu_driver_user_data {
struct iommu_driver_user_vtd;
struct iommu_driver_user_arm_smmuv3;
struct iommu_driver_user_amd_viommu;
- };
Not like that, in the usual container_of way
How about this:
// iommu.h @@ -490,6 +490,16 @@ struct iommu_ops {
/* User space instance allocation and freeing by the iommu driver */
struct iommu_device_user *(*iommu_alloc_user)(struct iommu_device *iommu);
struct iommufd_viommu *iommu_alloc_viommu(struct device *dev);
+/**
- struct iommu_device_user - IOMMU core representation of one IOMMU virtual
instance
- @iommu_dev: Underlying IOMMU hardware instance
- @id: Virtual instance ID, e.g. a vmid
- */
+struct iommu_device_user {
struct iommu_device *iommu_dev;
struct xarray virtual_ids;
u32 id;
+};
// iommufd_private.h +struct iommufd_viommu {
struct iommufd_object obj;
struct iommufd_ctx *ictx;
struct iommu_device *iommu_dev;
struct iommu_device_user *iommu_user;
struct iommufd_hwpt_paging *hwpt;
+};
And you probably don't need two structs, just combine them together
And don't repeat the iommu_domain misdesign, there should be some helper to init (or maybe allocate and init) the core structure along with the driver part.
The set/unset_dev_id ops probably need a type argument if there can be a multi-xarray case, then the virtual_ids xarray should be moved to the driver private structure too?
Yeah, probably. No need to add things that are not used right now, but it does make some sense that the driver could control the datastructure used for mapping. eg AMD has a HW backed datastructure so they may not need an xarray.
Also, a 64-bit virtual id in the uAPI was suggested previously. But xarray is limited to a 32-bit index? Should we compromise the uAPI to 32-bit or is there an alternative for a 64-bit id lookup?
You can use 64 bits in the uapi and forbid values that are too large. xarry uses an unsigned long for the index so it is 64 bit on 64 bit systems.
Jason
On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
- Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
maintained, then intel iommu can report a vRID back to user. But intel iommu driver does not have viommu context, no place to hold the vRID-
pRID
mapping. TBH. It may require other reasons to introduce it other than the error reporting need. Anyhow, this requires more thinking and also has dependency even if it is doable in intel side.
this sounds like a cleaner way to inject knowledge which iommu driver requires to find out the user tag. but yes it's a bit weird to introduce viommu awareness in intel iommu driver when there is no such thing in real hardware.
and for this error reporting case what we actually require is the reverse map i.e. pRID->vRID. Not sure whether we can leverage the same RID mapping uAPI as for ARM/AMD but ignore viommu_id and then store vRID under device_domain_info. a bit tricky on life cycle management and also incompatible with SIOV...
let's see whether Jason has a better idea here.
I think v10 is OK
struct iommu_hwpt_invalidate { __u32 size; __u32 hwpt_id; __aligned_u64 data_uptr; __u32 data_type; __u32 entry_len; __u32 entry_num; __u32 __reserved; };
Sends the invalidation to the HWPT which matches what Intel wanted where the entire HWPT and all its associated devices are invalidated. No seperate per-device invalidation.
For error and event reporting they should be returned to userspace with the IOMMU dev_id indicating the originating PCI function.
The VMM would have to convert dev_id into vRID according to the vIOMMU instance that the device is hooked up.
In iommu land we should never have a "RID" but always some kind of device-specific "device ID" which is the index into the particular HW table, and that ID is naturally scoped to within the IOMMU instance that owns the table - so it is very much not a global ID that can be used alone in any of the uAPI.
The uAPI should use the iommufd device ID to refer to specific devices.
Jason
On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
Per the prior discussion[1], we agreed to move the error reporting into the driver specific part. On Intel side, we want to report two devTLB invalidation errors: ICE (invalid completion error) and ITE (invalidation timeout error). Such errors have an additional SID information to tell which device failed the devTLB invalidation. I've got the below structure.
IMHO all of this complexity is a consequence of the decision to hide the devtlb invalidation from the VM..
On the other hand I guess you want to do this because of the SIOV troubles where the vPCI function in the VM is entirely virtual and can't be trivially mapped to a real PCI function for ATC invalidation like ARM and AMD can do (but they also can't support SIOV because of this). :(
However it also makes it very confusing about how the VM would perceive an error - eg if it invalidates an SIOV device single PASID and that devtlb fails then the error should be connected back to the vPCI function for the SIOV's specific PASID and not back to the physical PCI function for the SIOV owner.
As the iommu driver itself has no idea about the vPCI functions this seems like it is going to get really confusing. The API I suggested in the other email is not entirely going to work as the vPCI function for SIOV cases will have to be identified by the (struct device, PASID) - while it would be easy enough for the iommu driver to provide the PASID, I'm not sure how the iommufd core will relate the PASID back to the iommu_device to understand SIOV without actually being aware of SIOV to some degree :\
(Given SIOVr1 seems on track to be replaced by SIOVr2 so this is all a one-off I was hoping to minimize such awareness)
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, January 4, 2024 10:37 PM
On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
Per the prior discussion[1], we agreed to move the error reporting into the driver specific part. On Intel side, we want to report two devTLB invalidation errors: ICE (invalid completion error) and ITE (invalidation timeout error). Such errors have an additional SID information to tell which device failed the devTLB invalidation. I've got the below structure.
IMHO all of this complexity is a consequence of the decision to hide the devtlb invalidation from the VM..
On the other hand I guess you want to do this because of the SIOV troubles where the vPCI function in the VM is entirely virtual and can't be trivially mapped to a real PCI function for ATC invalidation like ARM and AMD can do (but they also can't support SIOV because of this). :(
However it also makes it very confusing about how the VM would perceive an error - eg if it invalidates an SIOV device single PASID and that devtlb fails then the error should be connected back to the vPCI function for the SIOV's specific PASID and not back to the physical PCI function for the SIOV owner.
As the iommu driver itself has no idea about the vPCI functions this seems like it is going to get really confusing. The API I suggested in the other email is not entirely going to work as the vPCI function for SIOV cases will have to be identified by the (struct device, PASID) - while it would be easy enough for the iommu driver to provide the PASID, I'm not sure how the iommufd core will relate the PASID back to the iommu_device to understand SIOV without actually being aware of SIOV to some degree :\
we plan to add such awareness with a new binding helper:
https://lore.kernel.org/lkml/20231009085123.463179-4-yi.l.liu@intel.com/
and with metadata to track association between PASID's and iommufd vdev.
but in reality the relation could be identified in an easy way due to a SIOV restriction which we discussed before - shared PASID space of PF disallows assigning sibling vdev's to a same VM (otherwise no way to identify which sibling vdev triggering an iopf when a pasid is used on both vdev's). That restriction implies that within an iommufd context every iommufd_device object should contain a unique struct device pointer. So PASID can be instead ignored in the lookup then just always do iommufd_get_dev_id() using struct device.
From: Tian, Kevin Sent: Friday, January 5, 2024 10:16 AM
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, January 4, 2024 10:37 PM
On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
Per the prior discussion[1], we agreed to move the error reporting into
the
driver specific part. On Intel side, we want to report two devTLB invalidation errors: ICE (invalid completion error) and ITE (invalidation timeout error). Such errors have an additional SID information to tell which device failed the devTLB invalidation. I've got the below structure.
IMHO all of this complexity is a consequence of the decision to hide the devtlb invalidation from the VM..
On the other hand I guess you want to do this because of the SIOV troubles where the vPCI function in the VM is entirely virtual and can't be trivially mapped to a real PCI function for ATC invalidation like ARM and AMD can do (but they also can't support SIOV because of this). :(
However it also makes it very confusing about how the VM would perceive an error - eg if it invalidates an SIOV device single PASID and that devtlb fails then the error should be connected back to the vPCI function for the SIOV's specific PASID and not back to the physical PCI function for the SIOV owner.
As the iommu driver itself has no idea about the vPCI functions this seems like it is going to get really confusing. The API I suggested in the other email is not entirely going to work as the vPCI function for SIOV cases will have to be identified by the (struct device, PASID) - while it would be easy enough for the iommu driver to provide the PASID, I'm not sure how the iommufd core will relate the PASID back to the iommu_device to understand SIOV without actually being aware of SIOV to some degree :\
we plan to add such awareness with a new binding helper:
https://lore.kernel.org/lkml/20231009085123.463179-4-yi.l.liu@intel.com/
and with metadata to track association between PASID's and iommufd vdev.
but in reality the relation could be identified in an easy way due to a SIOV restriction which we discussed before - shared PASID space of PF disallows assigning sibling vdev's to a same VM (otherwise no way to identify which sibling vdev triggering an iopf when a pasid is used on both vdev's). That restriction implies that within an iommufd context every iommufd_device object should contain a unique struct device pointer. So PASID can be instead ignored in the lookup then just always do iommufd_get_dev_id() using struct device.
A bit more background.
Previously we thought this restriction only applies to SIOV+vSVA, as a guest process may bind to both sibling vdev's, leading to the same pasid situation.
In concept w/o vSVA it's still possible to assign sibling vdev's to a same VM as each vdev is allocated with a unique pasid to mark vRID so can be differentiated from each other in the fault/error path.
But when looking at this err code issue with Yi closely, we found there is another gap in the VT-d spec. Upon devtlb invalidation timeout the hw doesn't report pasid in the error info register. this makes it impossible to identify the source vdev if a hwpt invalidation request involves sibling vdev's from a same PF.
with that I'm inclined to always imposing this restriction for SIOV. One may argue that SIOV w/o vSVA w/o devtlb is conceptually immune but I'm with you that given SIOVr1 is one-off I prefer to limiting its usability other than complexing the kernel.
On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
but in reality the relation could be identified in an easy way due to a SIOV restriction which we discussed before - shared PASID space of PF disallows assigning sibling vdev's to a same VM (otherwise no way to identify which sibling vdev triggering an iopf when a pasid is used on both vdev's). That restriction implies that within an iommufd context every iommufd_device object should contain a unique struct device pointer. So PASID can be instead ignored in the lookup then just always do iommufd_get_dev_id() using struct device.
A bit more background.
Previously we thought this restriction only applies to SIOV+vSVA, as a guest process may bind to both sibling vdev's, leading to the same pasid situation.
In concept w/o vSVA it's still possible to assign sibling vdev's to a same VM as each vdev is allocated with a unique pasid to mark vRID so can be differentiated from each other in the fault/error path.
I thought the SIOV plan was that each "vdev" ie vpci function would get a slice of the pRID's PASID space statically selected at creation?
So SVA/etc doesn't matter, you reliably get a disjoint set of pRID & pPASID into each VM.
From that view you can't identify the iommufd dev_id without knowing both the pRID and pPASID which will disambiguate the different SIOV iommufd dev_id instances sharing a rid.
But when looking at this err code issue with Yi closely, we found there is another gap in the VT-d spec. Upon devtlb invalidation timeout the hw doesn't report pasid in the error info register. this makes it impossible to identify the source vdev if a hwpt invalidation request involves sibling vdev's from a same PF.
Don't you know which command timed out?
with that I'm inclined to always imposing this restriction for SIOV. One may argue that SIOV w/o vSVA w/o devtlb is conceptually immune but I'm with you that given SIOVr1 is one-off I prefer to limiting its usability other than complexing the kernel.
By this you mean give up on SIOV entirely and always assign the full pRID to an entire VM? I'm confused what restriction you mean if you can't rely on the PASID?
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, January 5, 2024 10:45 PM
On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
but in reality the relation could be identified in an easy way due to a SIOV restriction which we discussed before - shared PASID space of PF
disallows
assigning sibling vdev's to a same VM (otherwise no way to identify which sibling vdev triggering an iopf when a pasid is used on both vdev's). That restriction implies that within an iommufd context every iommufd_device object should contain a unique struct device pointer. So PASID can be instead ignored in the lookup then just always do iommufd_get_dev_id() using struct device.
A bit more background.
Previously we thought this restriction only applies to SIOV+vSVA, as a guest process may bind to both sibling vdev's, leading to the same pasid situation.
In concept w/o vSVA it's still possible to assign sibling vdev's to a same VM as each vdev is allocated with a unique pasid to mark vRID so can be differentiated from each other in the fault/error path.
I thought the SIOV plan was that each "vdev" ie vpci function would get a slice of the pRID's PASID space statically selected at creation?
So SVA/etc doesn't matter, you reliably get a disjoint set of pRID & pPASID into each VM.
From that view you can't identify the iommufd dev_id without knowing both the pRID and pPASID which will disambiguate the different SIOV iommufd dev_id instances sharing a rid.
true when assigning those instances to different VMs.
Here I was talking about assigning them to a same VM being a problem. with rid sharing plus same ENQCMD pPASID potentially used on both instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.
we agreed before on preventing sibling vdev's in one VM for above reason, but only as far as vSVA is concerned.
then given the new finding in err reporting I wondered whether this restriction should be applied to all SIOV scenarios (but irrelevant now with below explanation after another thinking)
But when looking at this err code issue with Yi closely, we found there is another gap in the VT-d spec. Upon devtlb invalidation timeout the hw doesn't report pasid in the error info register. this makes it impossible to identify the source vdev if a hwpt invalidation request involves sibling vdev's from a same PF.
Don't you know which command timed out?
unfortunately no.
for errors related to descriptor fetch the driver can tell the command by looking at the head pointer of the invalidation queue.
command completion is indirectly detected by inserting a wait descriptor as fence. completion timeout error is reported in an error register. but this register doesn't record pasid, nor does the command location. if there are multiple pending devtlb invalidation commands upon timeout error the spec suggests the driver to treat all of them timeout as the register can only record one rid.
this is kind of moot. If the driver submits only one command (plus wait) at a time it doesn't need hw's help to identify the timeout command. If the driver batches invalidation commands it must treat all timeout if an timeout error is reported.
from this angle whether to record pasid doesn't really matter.
intel-iommu driver doesn't batch commands. so it's possible for the driver to figure out the timeout device itself and identify rid plus pasid to find dev_id from iommufd.
Thanks Kevin
On Mon, Jan 08, 2024 at 04:07:12AM +0000, Tian, Kevin wrote:
In concept w/o vSVA it's still possible to assign sibling vdev's to a same VM as each vdev is allocated with a unique pasid to mark vRID so can be differentiated from each other in the fault/error path.
I thought the SIOV plan was that each "vdev" ie vpci function would get a slice of the pRID's PASID space statically selected at creation?
So SVA/etc doesn't matter, you reliably get a disjoint set of pRID & pPASID into each VM.
From that view you can't identify the iommufd dev_id without knowing both the pRID and pPASID which will disambiguate the different SIOV iommufd dev_id instances sharing a rid.
true when assigning those instances to different VMs.
Here I was talking about assigning them to a same VM being a problem. with rid sharing plus same ENQCMD pPASID potentially used on both instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.
Oh you imaging sharing the pPASID if things have the same translation? I guess I can see why, but given where things are overall I'd say just don't do that.
Indeed we can't do that because it makes the vRID unknowable.
(again I continue to think that vt-d cache design is messed up, using the PASID for the cache tag is a *terrible* design, and causes exactly these kinds of problems)
for errors related to descriptor fetch the driver can tell the command by looking at the head pointer of the invalidation queue.
command completion is indirectly detected by inserting a wait descriptor as fence. completion timeout error is reported in an error register. but this register doesn't record pasid, nor does the command location. if there are multiple pending devtlb invalidation commands upon timeout error the spec suggests the driver to treat all of them timeout as the register can only record one rid.
Makes sense, or at least you have to re-issue them one by one
this is kind of moot. If the driver submits only one command (plus wait) at a time it doesn't need hw's help to identify the timeout command. If the driver batches invalidation commands it must treat all timeout if an timeout error is reported.
Yes
from this angle whether to record pasid doesn't really matter.
At least for error handling..
Jason
On 2024/1/8 12:07, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, January 5, 2024 10:45 PM
On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
but in reality the relation could be identified in an easy way due to a SIOV restriction which we discussed before - shared PASID space of PF
disallows
assigning sibling vdev's to a same VM (otherwise no way to identify which sibling vdev triggering an iopf when a pasid is used on both vdev's). That restriction implies that within an iommufd context every iommufd_device object should contain a unique struct device pointer. So PASID can be instead ignored in the lookup then just always do iommufd_get_dev_id() using struct device.
A bit more background.
Previously we thought this restriction only applies to SIOV+vSVA, as a guest process may bind to both sibling vdev's, leading to the same pasid situation.
In concept w/o vSVA it's still possible to assign sibling vdev's to a same VM as each vdev is allocated with a unique pasid to mark vRID so can be differentiated from each other in the fault/error path.
I thought the SIOV plan was that each "vdev" ie vpci function would get a slice of the pRID's PASID space statically selected at creation?
So SVA/etc doesn't matter, you reliably get a disjoint set of pRID & pPASID into each VM.
From that view you can't identify the iommufd dev_id without knowing both the pRID and pPASID which will disambiguate the different SIOV iommufd dev_id instances sharing a rid.
true when assigning those instances to different VMs.
Here I was talking about assigning them to a same VM being a problem. with rid sharing plus same ENQCMD pPASID potentially used on both instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.
we agreed before on preventing sibling vdev's in one VM for above reason, but only as far as vSVA is concerned.
then given the new finding in err reporting I wondered whether this restriction should be applied to all SIOV scenarios (but irrelevant now with below explanation after another thinking)
But when looking at this err code issue with Yi closely, we found there is another gap in the VT-d spec. Upon devtlb invalidation timeout the hw doesn't report pasid in the error info register. this makes it impossible to identify the source vdev if a hwpt invalidation request involves sibling vdev's from a same PF.
Don't you know which command timed out?
unfortunately no.
for errors related to descriptor fetch the driver can tell the command by looking at the head pointer of the invalidation queue.
command completion is indirectly detected by inserting a wait descriptor as fence. completion timeout error is reported in an error register. but this register doesn't record pasid, nor does the command location. if there are multiple pending devtlb invalidation commands upon timeout error the spec suggests the driver to treat all of them timeout as the register can only record one rid.
this is kind of moot. If the driver submits only one command (plus wait) at a time it doesn't need hw's help to identify the timeout command. If the driver batches invalidation commands it must treat all timeout if an timeout error is reported.
from this angle whether to record pasid doesn't really matter.
intel-iommu driver doesn't batch commands. so it's possible for the driver to figure out the timeout device itself and identify rid plus pasid to find dev_id from iommufd.
based on this, even RID is unnecessary. Software should know which device it has sent a devTLB invalidation.
This makes the helpers visible to nested.c.
Reviewed-by: Kevin Tian kevin.tian@intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 10 +++++----- drivers/iommu/intel/iommu.h | 6 ++++++ 2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 3531b956556c..074b728a3e93 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1512,10 +1512,10 @@ static void domain_flush_pasid_iotlb(struct intel_iommu *iommu, spin_unlock_irqrestore(&domain->lock, flags); }
-static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, - struct dmar_domain *domain, - unsigned long pfn, unsigned int pages, - int ih, int map) +void iommu_flush_iotlb_psi(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages, + int ih, int map) { unsigned int aligned_pages = __roundup_pow_of_two(pages); unsigned int mask = ilog2(aligned_pages); @@ -1588,7 +1588,7 @@ static inline void __mapping_notify_one(struct intel_iommu *iommu, iommu_flush_write_buffer(iommu); }
-static void intel_flush_iotlb_all(struct iommu_domain *domain) +void intel_flush_iotlb_all(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct iommu_domain_info *info; diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 65d37a138c75..4b779d5f7215 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -891,6 +891,12 @@ void device_block_translation(struct device *dev); int prepare_domain_attach_device(struct iommu_domain *domain, struct device *dev); void domain_update_iommu_cap(struct dmar_domain *domain); +void iommu_flush_iotlb_psi(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages, + int ih, int map); +void intel_flush_iotlb_all(struct iommu_domain *domain); +
int dmar_ir_support(void);
This implements the .cache_invalidate_user() callback to support iotlb flush for nested domain.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/nested.c | 54 ++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+)
diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c index b5a5563ab32c..44ad48db7ea0 100644 --- a/drivers/iommu/intel/nested.c +++ b/drivers/iommu/intel/nested.c @@ -73,9 +73,63 @@ static void intel_nested_domain_free(struct iommu_domain *domain) kfree(to_dmar_domain(domain)); }
+static void domain_flush_iotlb_psi(struct dmar_domain *domain, + u64 addr, unsigned long npages) +{ + struct iommu_domain_info *info; + unsigned long i; + + xa_for_each(&domain->iommu_array, i, info) + iommu_flush_iotlb_psi(info->iommu, domain, + addr >> VTD_PAGE_SHIFT, npages, 1, 0); +} + +static int intel_nested_cache_invalidate_user(struct iommu_domain *domain, + struct iommu_user_data_array *array, + u32 *cerror_idx) +{ + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct iommu_hwpt_vtd_s1_invalidate inv_info; + u32 index; + int ret; + + /* REVISIT: + * VT-d has defined ITE, ICE, IQE for invalidation failure per hardware, + * but no error code yet, so just set the error code to be 0. + */ + *cerror_idx = 0; + + for (index = 0; index < array->entry_num; index++) { + ret = iommu_copy_struct_from_user_array(&inv_info, array, + IOMMU_HWPT_DATA_VTD_S1, + index, __reserved); + if (ret) { + pr_err_ratelimited("Failed to fetch invalidation request\n"); + break; + } + + if (inv_info.__reserved || (inv_info.flags & ~IOMMU_VTD_INV_FLAGS_LEAF) || + !IS_ALIGNED(inv_info.addr, VTD_PAGE_SIZE)) { + ret = -EINVAL; + break; + } + + if (inv_info.addr == 0 && inv_info.npages == -1) + intel_flush_iotlb_all(domain); + else + domain_flush_iotlb_psi(dmar_domain, + inv_info.addr, inv_info.npages); + } + + array->entry_num = index; + + return ret; +} + static const struct iommu_domain_ops intel_nested_domain_ops = { .attach_dev = intel_nested_attach_dev, .free = intel_nested_domain_free, + .cache_invalidate_user = intel_nested_cache_invalidate_user, };
struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *parent,
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
if (inv_info.__reserved || (inv_info.flags &
~IOMMU_VTD_INV_FLAGS_LEAF) ||
!IS_ALIGNED(inv_info.addr, VTD_PAGE_SIZE)) {
ret = -EINVAL;
break;
}
-EOPNOTSUPP for the first two checks.
On 2023/11/20 16:32, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Friday, November 17, 2023 9:18 PM
if (inv_info.__reserved || (inv_info.flags &
~IOMMU_VTD_INV_FLAGS_LEAF) ||
!IS_ALIGNED(inv_info.addr, VTD_PAGE_SIZE)) {
ret = -EINVAL;
break;
}
-EOPNOTSUPP for the first two checks.
yes.
On Fri, Nov 17, 2023 at 05:18:16AM -0800, Yi Liu wrote:
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
struct iommu_user_data_array *array,
u32 *cerror_idx)
+{
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- struct iommu_hwpt_vtd_s1_invalidate inv_info;
- u32 index;
- int ret;
- /* REVISIT:
* VT-d has defined ITE, ICE, IQE for invalidation failure per hardware,
* but no error code yet, so just set the error code to be 0.
*/
- *cerror_idx = 0;
- for (index = 0; index < array->entry_num; index++) {
ret = iommu_copy_struct_from_user_array(&inv_info, array,
IOMMU_HWPT_DATA_VTD_S1,
index, __reserved);
if (ret) {
pr_err_ratelimited("Failed to fetch invalidation request\n");
break;
No error prints on ioctls!
if (inv_info.addr == 0 && inv_info.npages == -1)
intel_flush_iotlb_all(domain);
-1 is clearer written as U64_MAX - same remark for the comment documenting it.
Jason
On 2023/12/7 02:56, Jason Gunthorpe wrote:
On Fri, Nov 17, 2023 at 05:18:16AM -0800, Yi Liu wrote:
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
struct iommu_user_data_array *array,
u32 *cerror_idx)
+{
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- struct iommu_hwpt_vtd_s1_invalidate inv_info;
- u32 index;
- int ret;
- /* REVISIT:
* VT-d has defined ITE, ICE, IQE for invalidation failure per hardware,
* but no error code yet, so just set the error code to be 0.
*/
- *cerror_idx = 0;
- for (index = 0; index < array->entry_num; index++) {
ret = iommu_copy_struct_from_user_array(&inv_info, array,
IOMMU_HWPT_DATA_VTD_S1,
index, __reserved);
if (ret) {
pr_err_ratelimited("Failed to fetch invalidation request\n");
break;
No error prints on ioctls!
ok, will remove it.
if (inv_info.addr == 0 && inv_info.npages == -1)
intel_flush_iotlb_all(domain);
-1 is clearer written as U64_MAX - same remark for the comment documenting it.
sure.
linux-kselftest-mirror@lists.linaro.org