From: Tian, Kevin Sent: Friday, January 5, 2024 10:16 AM
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, January 4, 2024 10:37 PM
On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
Per the prior discussion[1], we agreed to move the error reporting into
the
driver specific part. On Intel side, we want to report two devTLB invalidation errors: ICE (invalid completion error) and ITE (invalidation timeout error). Such errors have an additional SID information to tell which device failed the devTLB invalidation. I've got the below structure.
IMHO all of this complexity is a consequence of the decision to hide the devtlb invalidation from the VM..
On the other hand I guess you want to do this because of the SIOV troubles where the vPCI function in the VM is entirely virtual and can't be trivially mapped to a real PCI function for ATC invalidation like ARM and AMD can do (but they also can't support SIOV because of this). :(
However it also makes it very confusing about how the VM would perceive an error - eg if it invalidates an SIOV device single PASID and that devtlb fails then the error should be connected back to the vPCI function for the SIOV's specific PASID and not back to the physical PCI function for the SIOV owner.
As the iommu driver itself has no idea about the vPCI functions this seems like it is going to get really confusing. The API I suggested in the other email is not entirely going to work as the vPCI function for SIOV cases will have to be identified by the (struct device, PASID) - while it would be easy enough for the iommu driver to provide the PASID, I'm not sure how the iommufd core will relate the PASID back to the iommu_device to understand SIOV without actually being aware of SIOV to some degree :\
we plan to add such awareness with a new binding helper:
https://lore.kernel.org/lkml/20231009085123.463179-4-yi.l.liu@intel.com/
and with metadata to track association between PASID's and iommufd vdev.
but in reality the relation could be identified in an easy way due to a SIOV restriction which we discussed before - shared PASID space of PF disallows assigning sibling vdev's to a same VM (otherwise no way to identify which sibling vdev triggering an iopf when a pasid is used on both vdev's). That restriction implies that within an iommufd context every iommufd_device object should contain a unique struct device pointer. So PASID can be instead ignored in the lookup then just always do iommufd_get_dev_id() using struct device.
A bit more background.
Previously we thought this restriction only applies to SIOV+vSVA, as a guest process may bind to both sibling vdev's, leading to the same pasid situation.
In concept w/o vSVA it's still possible to assign sibling vdev's to a same VM as each vdev is allocated with a unique pasid to mark vRID so can be differentiated from each other in the fault/error path.
But when looking at this err code issue with Yi closely, we found there is another gap in the VT-d spec. Upon devtlb invalidation timeout the hw doesn't report pasid in the error info register. this makes it impossible to identify the source vdev if a hwpt invalidation request involves sibling vdev's from a same PF.
with that I'm inclined to always imposing this restriction for SIOV. One may argue that SIOV w/o vSVA w/o devtlb is conceptually immune but I'm with you that given SIOVr1 is one-off I prefer to limiting its usability other than complexing the kernel.