Hi Nicolin,
On Fri, 10 Jan 2025 19:32:16 -0800 Nicolin Chen nicolinc@nvidia.com wrote:
[ Background ] On ARM GIC systems and others, the target address of the MSI is translated by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the IOMMU is disabled, the MSI address is programmed to the physical location of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI address is programmed to an allocated IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage translation is enabled, IOVA will be still used to program the MSI address, though the mappings will be in two stages: IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for Intermediate Physical Address).
If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the IOVA is dynamically allocated from the top of the IOVA space. If attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.
So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge of the IOMMU translation (1-stage translation), since the IOVA for the ITS page is fixed and known by kernel. However, with virtual machine enabling a nested IOMMU translation (2-stage), a guest kernel directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level IOVA to program the MSI address.
There have been two approaches to solve this problem:
- Create an identity mapping in the stage-1. VMM could insert a few
RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel would fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT region per iommu group for a direct mapping. Eventually, the mappings would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
Should this RMR be in a separate range than MSI_IOVA_BASE? The guest will have MSI_IOVA_BASE in a reserved region already, no? e.g. # cat /sys/bus/pci/devices/0015:01:00.0/iommu_group/reserved_regions 0x0000000008000000 0x00000000080fffff msi