From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, November 22, 2023 9:26 PM
On Wed, Nov 22, 2023 at 04:58:24AM +0000, Tian, Kevin wrote:
then we just define hwpt 'cache' invalidation in vtd always refers to both iotlb and devtlb. Then viommu just needs to call invalidation uapi once when emulating virtual iotlb invalidation descriptor while emulating the following devtlb invalidation descriptor as a nop.
In principle ATC and IOMMU TLB invalidations should not always be linked.
Any scenario that allows devices to share an IOTLB cache tag requires fewer IOMMU TLB invalidations than ATC invalidations.
as long as the host iommu driver has the same knowledge then it will always do the right thing.
e.g. one iotlb entry shared by 4 devices.
guest issues: 1) iotlb invalidation 2) devtlb invalidation for dev1 3) devtlb invalidation for dev2 4) devtlb invalidation for dev3 5) devtlb invalidation for dev4
intel-viommu calls HWPT cache invalidation for 1) and treats 2-5) as nop.
intel-iommu driver internally knows the iotlb is shared by 4 devices (given the same domain is attached to those devices) to handle HWPT cache invalidation:
1) iotlb invalidation 2) devtlb invalidation for dev1 3) devtlb invalidation for dev2 4) devtlb invalidation for dev3 5) devtlb invalidation for dev4
this is a good optimization by reducing 5 syscalls to 1, with the assumption that the guest shouldn't expect any deterministic behavior before 5) is completed to bring iotlb/devtlbs in sync.
another alternative is to have guest batch 1-5) in one request which allows viommu to batch them in one invalidation call too. But this is an orthogonal optimization in guest which we don't want to rely on.
I like the view of this invalidation interface as reflecting the actual HW and not trying to be smarter an real HW.
the guest-oriented interface e.g. viommu reflects the HW.
uAPI is kind of viommu internal implementation. IMHO it's not a bad thing to make it smarter as long as no guest observable breakage.
I'm fully expecting that Intel will adopt an direct-DMA flush queue like SMMU and AMD have already done as a performance optimization. In this world it makes no sense that the behavior of the direct DMA queue and driver mediated queue would be different.
that's a orthogonal topic. I don't think the value of direct-DMA flush queue should prevent possible optimization in the mediation path (as long as guest-expected deterministic behavior is sustained).