On 5/31/23 8:33 AM, Jason Gunthorpe wrote:
On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
Hi folks,
This series implements the functionality of delivering IO page faults to user space through the IOMMUFD framework. The use case is nested translation, where modern IOMMU hardware supports two-stage translation tables. The second-stage translation table is managed by the host VMM while the first-stage translation table is owned by the user space. Hence, any IO page fault that occurs on the first-stage page table should be delivered to the user space and handled there. The user space should respond the page fault handling result to the device top-down through the IOMMUFD response uAPI.
User space indicates its capablity of handling IO page faults by setting a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD will then setup its infrastructure for page fault delivery. Together with the iopf-capable flag, user space should also provide an eventfd where it will listen on any down-top page fault messages.
On a successful return of the allocation of iopf-capable HWPT, a fault fd will be returned. User space can open and read fault messages from it once the eventfd is signaled.
This is a performance path so we really need to think about this more, polling on an eventfd and then reading a different fd is not a good design.
What I would like is to have a design from the start that fits into io_uring, so we can have pre-posted 'recvs' in io_uring that just get completed at high speed when PRIs come in.
This suggests that the PRI should be delivered via read() on a single FD and pollability on the single FD without any eventfd.
Good suggestion. I will head in this direction.
Besides the overall design, I'd like to hear comments about below designs:
- The IOMMUFD fault message format. It is very similar to that in uapi/linux/iommu which has been discussed before and partially used by the IOMMU SVA implementation. I'd like to get more comments on the format when it comes to IOMMUFD.
We have to have the same discussion as always, does a generic fault message format make any sense here?
PRI seems more likely that it would but it needs a big carefull cross vendor check out.
Yeah, good point.
As far as I can see, there are at least three types of IOPF hardware implementation.
- PCI/PRI: Vendors might have their own additions. For example, VT-d 3.0 allows root-complex integrated endpoints to carry device specific private data in their page requests. This has been removed from the spec since v4.0.
- DMA stalls.
- Device-specific (non-PRI, not through IOMMU).
Does IOMMUFD want to support the last case?
Best regards, baolu