This is to add Intel VT-d nested translation based on IOMMUFD nesting infrastructure. As the iommufd nesting infrastructure series[1], iommu core supports new ops to report iommu hardware information, allocate domains with user data and sync stage-1 IOTLB. The data required in the three paths are vendor-specific, so
1) IOMMU_HW_INFO_TYPE_INTEL_VTD and struct iommu_device_info_vtd are defined to report iommu hardware information for Intel VT-d . 2) IOMMU_HWPT_DATA_VTD_S1 is defined for the Intel VT-d stage-1 page table, it will be used in the stage-1 domain allocation and IOTLB syncing path. struct iommu_hwpt_intel_vtd is defined to pass user_data for the Intel VT-d stage-1 domain allocation. struct iommu_hwpt_invalidate_intel_vtd is defined to pass the data for the Intel VT-d stage-1 IOTLB invalidation.
With above IOMMUFD extensions, the intel iommu driver implements the three paths to support nested translation.
The first Intel platform supporting nested translation is Sapphire Rapids which, unfortunately, has a hardware errata [2] requiring special treatment. This errata happens when a stage-1 page table page (either level) is located in a stage-2 read-only region. In that case the IOMMU hardware may ignore the stage-2 RO permission and still set the A/D bit in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report this errata to userspace. With that restriction the user should either disable nested translation to favor RO stage-2 mappings or ensure no RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix in patch10 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu simply maps all valid regions in the GPA address space which certainly includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS region. Hence finding a way to allow user opt-out RO mappings in Qemu might be an acceptable tradeoff. But how to achieve it cleanly needs more discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], QEMU could can be found in [4].
base-commit: ce9b593b1f74ccd090edc5d2ad397da84baa9946
[1] https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.co... [2] https://www.intel.com/content/www/us/en/content-details/772415/content-detai... [3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [4] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%2...
Change log: v3: - Further split the patches into an order of adding helpers for nested domain, iotlb flush, nested domain attachment and nested domain allocation callback, then report the hw_info to userspace. - Add batch support in cache invalidation from userspace - Disallow nested translation usage if RO mappings exists in stage-2 domain due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.c... - The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.co...
Regards, Yi Liu
Lu Baolu (5): iommu/vt-d: Extend dmar_domain to support nested domain iommu/vt-d: Add helper for nested domain allocation iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add nested domain allocation iommu/vt-d: Disallow nesting on domains with read-only mappings
Yi Liu (5): iommufd: Add data structure for Intel VT-d stage-1 domain allocation iommu/vt-d: Make domain attach helpers to be extern iommu/vt-d: Set the nested domain to a device iommu/vt-d: Add iotlb flush for nested domain iommu/vt-d: Implement hw_info for iommu capability query
drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.c | 78 ++++++++++++--- drivers/iommu/intel/iommu.h | 55 +++++++++-- drivers/iommu/intel/nested.c | 181 +++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.c | 151 +++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.h | 2 + drivers/iommu/iommufd/main.c | 6 ++ include/linux/iommu.h | 1 + include/uapi/linux/iommufd.h | 149 ++++++++++++++++++++++++++++ 9 files changed, 603 insertions(+), 22 deletions(-) create mode 100644 drivers/iommu/intel/nested.c
This adds IOMMU_HWPT_TYPE_VTD_S1 for stage-1 hw_pagetable of Intel VT-d and the corressponding data structure for userspace specified parameter for the domain allocation.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- include/uapi/linux/iommufd.h | 57 ++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 06dcad6ab082..c2658394827a 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -353,9 +353,64 @@ struct iommu_vfio_ioas { /** * enum iommu_hwpt_type - IOMMU HWPT Type * @IOMMU_HWPT_TYPE_DEFAULT: default + * @IOMMU_HWPT_TYPE_VTD_S1: Intel VT-d stage-1 page table */ enum iommu_hwpt_type { IOMMU_HWPT_TYPE_DEFAULT, + IOMMU_HWPT_TYPE_VTD_S1, +}; + +/** + * enum iommu_hwpt_intel_vtd_flags - Intel VT-d stage-1 page + * table entry attributes + * @IOMMU_VTD_PGTBL_SRE: Supervisor request + * @IOMMU_VTD_PGTBL_EAFE: Extended access enable + * @IOMMU_VTD_PGTBL_PCD: Page-level cache disable + * @IOMMU_VTD_PGTBL_PWT: Page-level write through + * @IOMMU_VTD_PGTBL_EMTE: Extended mem type enable + * @IOMMU_VTD_PGTBL_CD: PASID-level cache disable + * @IOMMU_VTD_PGTBL_WPE: Write protect enable + */ +enum iommu_hwpt_intel_vtd_flags { + IOMMU_VTD_PGTBL_SRE = 1 << 0, + IOMMU_VTD_PGTBL_EAFE = 1 << 1, + IOMMU_VTD_PGTBL_PCD = 1 << 2, + IOMMU_VTD_PGTBL_PWT = 1 << 3, + IOMMU_VTD_PGTBL_EMTE = 1 << 4, + IOMMU_VTD_PGTBL_CD = 1 << 5, + IOMMU_VTD_PGTBL_WPE = 1 << 6, + IOMMU_VTD_PGTBL_LAST = 1 << 7, +}; + +/** + * struct iommu_hwpt_intel_vtd - Intel VT-d specific user-managed + * stage-1 page table info + * @flags: Combination of enum iommu_hwpt_intel_vtd_flags + * @pgtbl_addr: The base address of the user-managed stage-1 page table. + * @pat: Page attribute table data to compute effective memory type + * @emt: Extended memory type + * @addr_width: The address width of the untranslated addresses that are + * subjected to the user-managed stage-1 page table. + * @__reserved: Must be 0 + * + * The Intel VT-d specific data for creating hw_pagetable to represent + * the user-managed stage-1 page table that is used in nested translation. + * + * In nested translation, the stage-1 page table locates in the address + * space that defined by the corresponding stage-2 page table. Hence the + * stage-1 page table base address value should not be higher than the + * maximum untranslated address of stage-2 page table. + * + * The paging level of the stage-1 page table should be compatible with + * the hardware iommu. Otherwise, the allocation would be failed. + */ +struct iommu_hwpt_intel_vtd { + __u64 flags; + __u64 pgtbl_addr; + __u32 pat; + __u32 emt; + __u32 addr_width; + __u32 __reserved; };
/** @@ -391,6 +446,8 @@ enum iommu_hwpt_type { * +------------------------------+-------------------------------------+-----------+ * | IOMMU_HWPT_TYPE_DEFAULT | N/A | IOAS | * +------------------------------+-------------------------------------+-----------+ + * | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_intel_vtd | HWPT | + * +------------------------------+-------------------------------------+-----------+ */ struct iommu_hwpt_alloc { __u32 size;
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
@@ -353,9 +353,64 @@ struct iommu_vfio_ioas { /**
- enum iommu_hwpt_type - IOMMU HWPT Type
- @IOMMU_HWPT_TYPE_DEFAULT: default
*/
- @IOMMU_HWPT_TYPE_VTD_S1: Intel VT-d stage-1 page table
enum iommu_hwpt_type { IOMMU_HWPT_TYPE_DEFAULT,
- IOMMU_HWPT_TYPE_VTD_S1,
+};
+/**
- enum iommu_hwpt_intel_vtd_flags - Intel VT-d stage-1 page
table entry attributes
No need to have both 'intel' and 'vtd' in one name. There won't be another vtd. 😊
Also since it's for s1, let's be specific on the naming:
enum iommu_hwpt_vtd_s1_flags
- @IOMMU_VTD_PGTBL_SRE: Supervisor request
- @IOMMU_VTD_PGTBL_EAFE: Extended access enable
- @IOMMU_VTD_PGTBL_PCD: Page-level cache disable
- @IOMMU_VTD_PGTBL_PWT: Page-level write through
- @IOMMU_VTD_PGTBL_EMTE: Extended mem type enable
- @IOMMU_VTD_PGTBL_CD: PASID-level cache disable
- @IOMMU_VTD_PGTBL_WPE: Write protect enable
similarly above should be IOMMU_VTD_S1_SRE.
PGTBL can be skipped.
- */
+enum iommu_hwpt_intel_vtd_flags {
- IOMMU_VTD_PGTBL_SRE = 1 << 0,
- IOMMU_VTD_PGTBL_EAFE = 1 << 1,
- IOMMU_VTD_PGTBL_PCD = 1 << 2,
- IOMMU_VTD_PGTBL_PWT = 1 << 3,
- IOMMU_VTD_PGTBL_EMTE = 1 << 4,
- IOMMU_VTD_PGTBL_CD = 1 << 5,
- IOMMU_VTD_PGTBL_WPE = 1 << 6,
- IOMMU_VTD_PGTBL_LAST = 1 << 7,
+};
+/**
- struct iommu_hwpt_intel_vtd - Intel VT-d specific user-managed
stage-1 page table info
struct iommu_hwpt_vtd_s1
- @flags: Combination of enum iommu_hwpt_intel_vtd_flags
- @pgtbl_addr: The base address of the user-managed stage-1 page table.
no need to highlight 'user-managed' in every reference to stage-1
- @pat: Page attribute table data to compute effective memory type
- @emt: Extended memory type
- @addr_width: The address width of the untranslated addresses that are
subjected to the user-managed stage-1 page table.
"The address width of the stage-1 page table"
- @__reserved: Must be 0
- The Intel VT-d specific data for creating hw_pagetable to represent
- the user-managed stage-1 page table that is used in nested translation.
"VT-d specific data for creating a stage-1 page table that ..."
- In nested translation, the stage-1 page table locates in the address
- space that defined by the corresponding stage-2 page table. Hence the
"locates in the address space of the specified stage-2 page table"
- stage-1 page table base address value should not be higher than the
- maximum untranslated address of stage-2 page table.
"should not exceed the maximum allowed input address of stage-2"
- The paging level of the stage-1 page table should be compatible with
- the hardware iommu. Otherwise, the allocation would be failed.
there is no information about paging level in this structure
- */
+struct iommu_hwpt_intel_vtd {
- __u64 flags;
- __u64 pgtbl_addr;
- __u32 pat;
- __u32 emt;
- __u32 addr_width;
- __u32 __reserved;
};
I'd prefer to move vendor specific definitions before 'enum iommu_hwpt_type' so any new added type can be easily correlated between the type enumeration and following change in the comment of 'struct iommu_hwpt_alloc'
/** @@ -391,6 +446,8 @@ enum iommu_hwpt_type {
- +------------------------------+-------------------------------------+-----------+
- | IOMMU_HWPT_TYPE_DEFAULT | N/A | IOAS |
- +------------------------------+-------------------------------------+-----------+
- | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_intel_vtd |
HWPT |
*/
- +------------------------------+-------------------------------------+-----------+
struct iommu_hwpt_alloc { __u32 size; -- 2.34.1
Hi,
-----Original Message----- From: Yi Liu yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM To: joro@8bytes.org; alex.williamson@redhat.com; jgg@nvidia.com; Tian, Kevin kevin.tian@intel.com; robin.murphy@arm.com; baolu.lu@linux.intel.com Cc: cohuck@redhat.com; eric.auger@redhat.com; nicolinc@nvidia.com; kvm@vger.kernel.org; mjrosato@linux.ibm.com; chao.p.peng@linux.intel.com; Liu, Yi L yi.l.liu@intel.com; yi.y.sun@linux.intel.com; peterx@redhat.com; jasowang@redhat.com; shameerali.kolothum.thodi@huawei.com; lulu@redhat.com; suravee.suthikulpanit@amd.com; iommu@lists.linux.dev; linux- kernel@vger.kernel.org; linux-kselftest@vger.kernel.org; Duan, Zhenzhong zhenzhong.duan@intel.com Subject: [PATCH v3 01/10] iommufd: Add data structure for Intel VT-d stage-1 domain allocation
This adds IOMMU_HWPT_TYPE_VTD_S1 for stage-1 hw_pagetable of Intel VT- d and the corressponding data structure for userspace specified parameter for the domain allocation.
Signed-off-by: Yi Liu yi.l.liu@intel.com
include/uapi/linux/iommufd.h | 57 ++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 06dcad6ab082..c2658394827a 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -353,9 +353,64 @@ struct iommu_vfio_ioas { /**
- enum iommu_hwpt_type - IOMMU HWPT Type
- @IOMMU_HWPT_TYPE_DEFAULT: default
*/
- @IOMMU_HWPT_TYPE_VTD_S1: Intel VT-d stage-1 page table
enum iommu_hwpt_type { IOMMU_HWPT_TYPE_DEFAULT,
- IOMMU_HWPT_TYPE_VTD_S1,
+};
+/**
- enum iommu_hwpt_intel_vtd_flags - Intel VT-d stage-1 page
table entry attributes
- @IOMMU_VTD_PGTBL_SRE: Supervisor request
- @IOMMU_VTD_PGTBL_EAFE: Extended access enable
- @IOMMU_VTD_PGTBL_PCD: Page-level cache disable
- @IOMMU_VTD_PGTBL_PWT: Page-level write through
- @IOMMU_VTD_PGTBL_EMTE: Extended mem type enable
- @IOMMU_VTD_PGTBL_CD: PASID-level cache disable
- @IOMMU_VTD_PGTBL_WPE: Write protect enable */ enum
+iommu_hwpt_intel_vtd_flags {
- IOMMU_VTD_PGTBL_SRE = 1 << 0,
- IOMMU_VTD_PGTBL_EAFE = 1 << 1,
- IOMMU_VTD_PGTBL_PCD = 1 << 2,
- IOMMU_VTD_PGTBL_PWT = 1 << 3,
- IOMMU_VTD_PGTBL_EMTE = 1 << 4,
- IOMMU_VTD_PGTBL_CD = 1 << 5,
- IOMMU_VTD_PGTBL_WPE = 1 << 6,
- IOMMU_VTD_PGTBL_LAST = 1 << 7,
+};
+/**
- struct iommu_hwpt_intel_vtd - Intel VT-d specific user-managed
stage-1 page table info
- @flags: Combination of enum iommu_hwpt_intel_vtd_flags
- @pgtbl_addr: The base address of the user-managed stage-1 page table.
- @pat: Page attribute table data to compute effective memory type
- @emt: Extended memory type
- @addr_width: The address width of the untranslated addresses that are
subjected to the user-managed stage-1 page table.
- @__reserved: Must be 0
- The Intel VT-d specific data for creating hw_pagetable to represent
- the user-managed stage-1 page table that is used in nested translation.
- In nested translation, the stage-1 page table locates in the address
- space that defined by the corresponding stage-2 page table. Hence
+the
- stage-1 page table base address value should not be higher than the
- maximum untranslated address of stage-2 page table.
- The paging level of the stage-1 page table should be compatible with
- the hardware iommu. Otherwise, the allocation would be failed.
- */
+struct iommu_hwpt_intel_vtd {
- __u64 flags;
- __u64 pgtbl_addr;
- __u32 pat;
- __u32 emt;
Do we need the emt field as part of the stage-1 page table info? IIUC, according to vt-d spec, the emt field is in stage-2 page table entries. Since in nested mode we only expose stage-1 page table to guest vt-d driver, I'm curious how does guest vt-d driver populate this EMT?
Thanks -Tina
- __u32 addr_width;
- __u32 __reserved;
};
/** @@ -391,6 +446,8 @@ enum iommu_hwpt_type {
- +------------------------------+-------------------------------------+-----------+
- | IOMMU_HWPT_TYPE_DEFAULT | N/A | IOAS |
- +------------------------------+-------------------------------------+-----------+
- | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_intel_vtd |
HWPT |
- +------------------------------+-------------------------------------+
- -----------+ */
struct iommu_hwpt_alloc { __u32 size; -- 2.34.1
On Thu, May 25, 2023 at 02:28:18AM +0000, Zhang, Tina wrote:
+struct iommu_hwpt_intel_vtd {
- __u64 flags;
- __u64 pgtbl_addr;
- __u32 pat;
- __u32 emt;
Do we need the emt field as part of the stage-1 page table info? IIUC, according to vt-d spec, the emt field is in stage-2 page table entries. Since in nested mode we only expose stage-1 page table to guest vt-d driver, I'm curious how does guest vt-d driver populate this EMT?
Indeed. The EMT is controlling how the iommu HW parses memory that is controlled by the kernel - this simply should not be something that userspace controls.
The S2 itself has to decide if it is populating the EMT bits in the IOPTE and if so it could enable EMT. Does userspace need to be involved here?
The seemingly more tricky thing is that it feels like EMT and PAT would like to be per-iova and we don't have a means for that right now. (and even if we did, how would qemu decide what to do ?)
So probably drop it.
Jason
On Thu, May 11, 2023 at 07:51:01AM -0700, Yi Liu wrote:
This adds IOMMU_HWPT_TYPE_VTD_S1 for stage-1 hw_pagetable of Intel VT-d +/**
- struct iommu_hwpt_intel_vtd - Intel VT-d specific user-managed
stage-1 page table info
- @flags: Combination of enum iommu_hwpt_intel_vtd_flags
- @pgtbl_addr: The base address of the user-managed stage-1 page table.
- @pat: Page attribute table data to compute effective memory type
- @emt: Extended memory type
- @addr_width: The address width of the untranslated addresses that are
subjected to the user-managed stage-1 page table.
- @__reserved: Must be 0
- The Intel VT-d specific data for creating hw_pagetable to represent
- the user-managed stage-1 page table that is used in nested translation.
- In nested translation, the stage-1 page table locates in the address
- space that defined by the corresponding stage-2 page table. Hence the
- stage-1 page table base address value should not be higher than the
- maximum untranslated address of stage-2 page table.
- The paging level of the stage-1 page table should be compatible with
- the hardware iommu. Otherwise, the allocation would be failed.
- */
+struct iommu_hwpt_intel_vtd {
- __u64 flags;
- __u64 pgtbl_addr;
__aligned_u64
- __u32 pat;
- __u32 emt;
- __u32 addr_width;
- __u32 __reserved;
}; /** @@ -391,6 +446,8 @@ enum iommu_hwpt_type {
- +------------------------------+-------------------------------------+-----------+
- | IOMMU_HWPT_TYPE_DEFAULT | N/A | IOAS |
- +------------------------------+-------------------------------------+-----------+
- | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_intel_vtd | HWPT |
- +------------------------------+-------------------------------------+-----------+
Please don't make ascii art tables.
Note beside the struct what enum it is for
Jason
From: Lu Baolu baolu.lu@linux.intel.com
The nested domain fields are exclusive to those that used for a DMA remapping domain. Use union to avoid memory waste.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.h | 35 +++++++++++++++++++++++++++++------ 1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 1c5e1d88862b..e818520f4068 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -596,15 +596,38 @@ struct dmar_domain { spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */
- struct dma_pte *pgd; /* virtual address */ - int gaw; /* max guest address width */ - - /* adjusted guest address width, 0 is level 2 30-bit */ - int agaw; int iommu_superpage;/* Level of superpages supported: 0 == 4KiB (no superpages), 1 == 2MiB, 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */ - u64 max_addr; /* maximum mapped address */ + union { + /* DMA remapping domain */ + struct { + /* virtual address */ + struct dma_pte *pgd; + /* max guest address width */ + int gaw; + /* + * adjusted guest address width: + * 0: level 2 30-bit + * 1: level 3 39-bit + * 2: level 4 48-bit + * 3: level 5 57-bit + */ + int agaw; + /* maximum mapped address */ + u64 max_addr; + }; + + /* Nested user domain */ + struct { + /* 2-level page table the user domain nested */ + struct dmar_domain *s2_domain; + /* user page table pointer (in GPA) */ + unsigned long s1_pgtbl; + /* page table attributes */ + struct iommu_hwpt_intel_vtd s1_cfg; + }; + };
struct iommu_domain domain; /* generic domain data structure for iommu core */
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
From: Lu Baolu baolu.lu@linux.intel.com
The nested domain fields are exclusive to those that used for a DMA remapping domain. Use union to avoid memory waste.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/intel/iommu.h | 35 +++++++++++++++++++++++++++++------ 1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 1c5e1d88862b..e818520f4068 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -596,15 +596,38 @@ struct dmar_domain { spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */
- struct dma_pte *pgd; /* virtual address */
- int gaw; /* max guest address width */
- /* adjusted guest address width, 0 is level 2 30-bit */
- int agaw; int iommu_superpage;/* Level of superpages supported: 0 == 4KiB (no superpages), 1 ==
2MiB, 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
- u64 max_addr; /* maximum mapped address */
- union {
/* DMA remapping domain */
struct {
/* virtual address */
struct dma_pte *pgd;
/* max guest address width */
int gaw;
/*
* adjusted guest address width:
* 0: level 2 30-bit
* 1: level 3 39-bit
* 2: level 4 48-bit
* 3: level 5 57-bit
*/
int agaw;
/* maximum mapped address */
u64 max_addr;
};
what about 'nid'?
/* Nested user domain */
struct {
/* 2-level page table the user domain nested */
/* parent page table which the user domain is nested on */
struct dmar_domain *s2_domain;
/* user page table pointer (in GPA) */
unsigned long s1_pgtbl;
/* page table attributes */
struct iommu_hwpt_intel_vtd s1_cfg;
};
};
struct iommu_domain domain; /* generic domain data structure for iommu core */
-- 2.34.1
On 5/24/23 3:02 PM, Tian, Kevin wrote:
From: Liu, Yi Lyi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
From: Lu Baolubaolu.lu@linux.intel.com
The nested domain fields are exclusive to those that used for a DMA remapping domain. Use union to avoid memory waste.
Signed-off-by: Lu Baolubaolu.lu@linux.intel.com Signed-off-by: Yi Liuyi.l.liu@intel.com
drivers/iommu/intel/iommu.h | 35 +++++++++++++++++++++++++++++------ 1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 1c5e1d88862b..e818520f4068 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -596,15 +596,38 @@ struct dmar_domain { spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */
- struct dma_pte *pgd; /* virtual address */
- int gaw; /* max guest address width */
- /* adjusted guest address width, 0 is level 2 30-bit */
- int agaw; int iommu_superpage;/* Level of superpages supported: 0 == 4KiB (no superpages), 1 ==
2MiB, 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
- u64 max_addr; /* maximum mapped address */
- union {
/* DMA remapping domain */
struct {
/* virtual address */
struct dma_pte *pgd;
/* max guest address width */
int gaw;
/*
* adjusted guest address width:
* 0: level 2 30-bit
* 1: level 3 39-bit
* 2: level 4 48-bit
* 3: level 5 57-bit
*/
int agaw;
/* maximum mapped address */
u64 max_addr;
};
what about 'nid'?
"nid" represents which NUMA node should we allocate pages from for this domain. It's updated every time when a domain is attached/detached to/from a device or pasid.
Generally speaking, "nid" is common for all types of domain. But in this case, only a DMA remapping domain has a need to allocate pages. I intend to keep it as it for now. There's more cleanup rooms if we limit it only for DMA remapping domain.
Best regards, baolu
From: Lu Baolu baolu.lu@linux.intel.com
This adds helper for accepting user parameters and allocate a nested domain.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.h | 2 ++ drivers/iommu/intel/nested.c | 47 ++++++++++++++++++++++++++++++++++++ 3 files changed, 50 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/intel/nested.c
diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile index 7af3b8a4f2a0..5dabf081a779 100644 --- a/drivers/iommu/intel/Makefile +++ b/drivers/iommu/intel/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DMAR_TABLE) += dmar.o -obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o +obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o nested.o obj-$(CONFIG_DMAR_TABLE) += trace.o cap_audit.o obj-$(CONFIG_DMAR_PERF) += perf.o obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index e818520f4068..8d0c7970c1ad 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -858,6 +858,8 @@ void *alloc_pgtable_page(int node, gfp_t gfp); void free_pgtable_page(void *vaddr); void iommu_flush_write_buffer(struct intel_iommu *iommu); struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn); +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain, + const union iommu_domain_user_data *user_data);
#ifdef CONFIG_INTEL_IOMMU_SVM void intel_svm_check(struct intel_iommu *iommu); diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c new file mode 100644 index 000000000000..f83931bb44b6 --- /dev/null +++ b/drivers/iommu/intel/nested.c @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * nested.c - nested mode translation support + * + * Copyright (C) 2023 Intel Corporation + * + * Author: Lu Baolu baolu.lu@linux.intel.com + * Jacob Pan jacob.jun.pan@linux.intel.com + */ + +#define pr_fmt(fmt) "DMAR: " fmt + +#include <linux/iommu.h> + +#include "iommu.h" + +static void intel_nested_domain_free(struct iommu_domain *domain) +{ + kfree(to_dmar_domain(domain)); +} + +static const struct iommu_domain_ops intel_nested_domain_ops = { + .free = intel_nested_domain_free, +}; + +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain, + const union iommu_domain_user_data *user_data) +{ + const struct iommu_hwpt_intel_vtd *vtd = (struct iommu_hwpt_intel_vtd *)user_data; + struct dmar_domain *domain; + + domain = kzalloc(sizeof(*domain), GFP_KERNEL_ACCOUNT); + if (!domain) + return NULL; + + domain->use_first_level = true; + domain->s2_domain = to_dmar_domain(s2_domain); + domain->s1_pgtbl = vtd->pgtbl_addr; + domain->s1_cfg = *vtd; + domain->domain.ops = &intel_nested_domain_ops; + domain->domain.type = IOMMU_DOMAIN_NESTED; + INIT_LIST_HEAD(&domain->devices); + spin_lock_init(&domain->lock); + xa_init(&domain->iommu_array); + + return &domain->domain; +}
From: Lu Baolu baolu.lu@linux.intel.com
The configurations are passed in from the user when the user domain is allocated. This helper interprets these configurations according to the data structure defined in uapi/linux/iommufd.h. The EINVAL error will be returned if any of configurations are not compatible with the hardware capabilities. The caller can retry with another compatible user domain. The encoding of fields of each pasid entry is defined in section 9.6 of the VT-d spec.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/pasid.c | 151 ++++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.h | 2 + 2 files changed, 153 insertions(+)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index c5d479770e12..ee4c7b69c6c6 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -21,6 +21,11 @@ #include "iommu.h" #include "pasid.h"
+#define IOMMU_VTD_PGTBL_MTS_MASK (IOMMU_VTD_PGTBL_CD | \ + IOMMU_VTD_PGTBL_EMTE | \ + IOMMU_VTD_PGTBL_PCD | \ + IOMMU_VTD_PGTBL_PWT) + /* * Intel IOMMU system wide PASID name space: */ @@ -335,6 +340,15 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe) pasid_set_bits(&pe->val[0], 1 << 1, 0); }
+/* + * Setup the SRE(Supervisor Request Enable) field (Bit 128) of a + * scalable mode PASID entry. + */ +static inline void pasid_set_sre(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[2], 1 << 0, 1); +} + /* * Setup the WPE(Write Protect Enable) field (Bit 132) of a * scalable mode PASID entry. @@ -402,6 +416,15 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2); }
+/* + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135) + * of a scalable mode PASID entry. + */ +static inline void pasid_set_eafe(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7); +} + static void pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu, u16 did, u32 pasid) @@ -713,3 +736,131 @@ void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu, if (!cap_caching_mode(iommu->cap)) devtlb_invalidation_with_pasid(iommu, dev, pasid); } + +/** + * intel_pasid_setup_nested() - Set up PASID entry for nested translation. + * This could be used for guest shared virtual address. In this case, the + * first level page tables are used for GVA-GPA translation in the guest, + * second level page tables are used for GPA-HPA translation. + * + * @iommu: IOMMU which the device belong to + * @dev: Device to be set up for translation + * @pasid: PASID to be programmed in the device PASID table + * @domain: User domain nested on a s2 domain + */ +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, + u32 pasid, struct dmar_domain *domain) +{ + struct iommu_hwpt_intel_vtd *s1_cfg = &domain->s1_cfg; + pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl; + struct dmar_domain *s2_domain = domain->s2_domain; + u16 did = domain_id_iommu(domain, iommu); + struct dma_pte *pgd = s2_domain->pgd; + struct pasid_entry *pte; + int agaw; + + if (!ecap_nest(iommu->ecap)) { + pr_err_ratelimited("%s: No nested translation support\n", + iommu->name); + return -ENODEV; + } + + /* + * Sanity checking performed by caller to make sure address width + * matching in two dimensions: CPU vs. IOMMU, guest vs. host. + */ + switch (s1_cfg->addr_width) { + case ADDR_WIDTH_4LEVEL: + break; +#ifdef CONFIG_X86 + case ADDR_WIDTH_5LEVEL: + if (!cpu_feature_enabled(X86_FEATURE_LA57) || + !cap_fl5lp_support(iommu->cap)) { + dev_err_ratelimited(dev, + "5-level paging not supported\n"); + return -EINVAL; + } + break; +#endif + default: + dev_err_ratelimited(dev, "Invalid guest address width %d\n", + s1_cfg->addr_width); + return -EINVAL; + } + + if ((s1_cfg->flags & IOMMU_VTD_PGTBL_SRE) && !ecap_srs(iommu->ecap)) { + pr_err_ratelimited("No supervisor request support on %s\n", + iommu->name); + return -EINVAL; + } + + if ((s1_cfg->flags & IOMMU_VTD_PGTBL_EAFE) && !ecap_eafs(iommu->ecap)) { + pr_err_ratelimited("No extended access flag support on %s\n", + iommu->name); + return -EINVAL; + } + + /* + * Memory type is only applicable to devices inside processor coherent + * domain. Will add MTS support once coherent devices are available. + */ + if (s1_cfg->flags & IOMMU_VTD_PGTBL_MTS_MASK) { + pr_warn_ratelimited("No memory type support %s\n", + iommu->name); + return -EINVAL; + } + + agaw = iommu_skip_agaw(s2_domain, iommu, &pgd); + if (agaw < 0) { + dev_err_ratelimited(dev, "Invalid domain page table\n"); + return -EINVAL; + } + + /* First level PGD (in GPA) must be supported by the second level. */ + if ((uintptr_t)s1_gpgd > (1ULL << s2_domain->gaw)) { + dev_err_ratelimited(dev, + "Guest PGD %lx not supported, max %llx\n", + (uintptr_t)s1_gpgd, s2_domain->max_addr); + return -EINVAL; + } + + spin_lock(&iommu->lock); + pte = intel_pasid_get_entry(dev, pasid); + if (!pte) { + spin_unlock(&iommu->lock); + return -ENODEV; + } + if (pasid_pte_is_present(pte)) { + spin_unlock(&iommu->lock); + return -EBUSY; + } + + pasid_clear_entry(pte); + + if (s1_cfg->addr_width == ADDR_WIDTH_5LEVEL) + pasid_set_flpm(pte, 1); + + pasid_set_flptr(pte, (uintptr_t)s1_gpgd); + + if (s1_cfg->flags & IOMMU_VTD_PGTBL_SRE) { + pasid_set_sre(pte); + if (s1_cfg->flags & IOMMU_VTD_PGTBL_WPE) + pasid_set_wpe(pte); + } + + if (s1_cfg->flags & IOMMU_VTD_PGTBL_EAFE) + pasid_set_eafe(pte); + + pasid_set_slptr(pte, virt_to_phys(pgd)); + pasid_set_fault_enable(pte); + pasid_set_domain_id(pte, did); + pasid_set_address_width(pte, agaw); + pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED); + pasid_set_present(pte); + spin_unlock(&iommu->lock); + + pasid_flush_caches(iommu, pte, pasid, did); + + return 0; +} diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index d6b7d21244b1..864b12848392 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -111,6 +111,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, int intel_pasid_setup_pass_through(struct intel_iommu *iommu, struct dmar_domain *domain, struct device *dev, u32 pasid); +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, + u32 pasid, struct dmar_domain *domain); void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, u32 pasid, bool fault_ignore);
From: Yi Liu yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
+/**
- intel_pasid_setup_nested() - Set up PASID entry for nested translation.
- This could be used for guest shared virtual address. In this case, the
- first level page tables are used for GVA-GPA translation in the guest,
- second level page tables are used for GPA-HPA translation.
it's not just for guest SVA. Actually in this series it's RID_PASID nested translation.
- @iommu: IOMMU which the device belong to
- @dev: Device to be set up for translation
- @pasid: PASID to be programmed in the device PASID table
- @domain: User domain nested on a s2 domain
"User stage-1 domain"
- */
+int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
u32 pasid, struct dmar_domain *domain)
+{
- struct iommu_hwpt_intel_vtd *s1_cfg = &domain->s1_cfg;
- pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl;
- struct dmar_domain *s2_domain = domain->s2_domain;
- u16 did = domain_id_iommu(domain, iommu);
- struct dma_pte *pgd = s2_domain->pgd;
- struct pasid_entry *pte;
- int agaw;
- if (!ecap_nest(iommu->ecap)) {
pr_err_ratelimited("%s: No nested translation support\n",
iommu->name);
return -ENODEV;
- }
- /*
* Sanity checking performed by caller to make sure address width
"by caller"? it's checked in this function.
* matching in two dimensions: CPU vs. IOMMU, guest vs. host.
*/
- switch (s1_cfg->addr_width) {
- case ADDR_WIDTH_4LEVEL:
break;
+#ifdef CONFIG_X86
- case ADDR_WIDTH_5LEVEL:
if (!cpu_feature_enabled(X86_FEATURE_LA57) ||
!cap_fl5lp_support(iommu->cap)) {
dev_err_ratelimited(dev,
"5-level paging not supported\n");
return -EINVAL;
}
break;
+#endif
- default:
dev_err_ratelimited(dev, "Invalid guest address width %d\n",
s1_cfg->addr_width);
return -EINVAL;
- }
- if ((s1_cfg->flags & IOMMU_VTD_PGTBL_SRE) && !ecap_srs(iommu-
ecap)) {
pr_err_ratelimited("No supervisor request support on %s\n",
iommu->name);
return -EINVAL;
- }
- if ((s1_cfg->flags & IOMMU_VTD_PGTBL_EAFE)
&& !ecap_eafs(iommu->ecap)) {
pr_err_ratelimited("No extended access flag support
on %s\n",
iommu->name);
return -EINVAL;
- }
- /*
* Memory type is only applicable to devices inside processor
coherent
* domain. Will add MTS support once coherent devices are available.
*/
- if (s1_cfg->flags & IOMMU_VTD_PGTBL_MTS_MASK) {
pr_warn_ratelimited("No memory type support %s\n",
iommu->name);
return -EINVAL;
- }
If it's unsupported why exposing them in the uAPI at this point?
- agaw = iommu_skip_agaw(s2_domain, iommu, &pgd);
- if (agaw < 0) {
dev_err_ratelimited(dev, "Invalid domain page table\n");
return -EINVAL;
- }
this looks problematic.
static inline int iommu_skip_agaw(struct dmar_domain *domain, struct intel_iommu *iommu, struct dma_pte **pgd) { int agaw;
for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) { *pgd = phys_to_virt(dma_pte_addr(*pgd)); if (!dma_pte_present(*pgd)) return -EINVAL; }
return agaw; }
why is it safe to change pgd level of s2 domain when it's used as the parent? this s2 pgtbl might be used by other devices behind other iommus which already maps GPAs in a level which this iommu doesn't support...
shouldn't we simply fail it as another incompatible condition?
- /* First level PGD (in GPA) must be supported by the second level. */
- if ((uintptr_t)s1_gpgd > (1ULL << s2_domain->gaw)) {
dev_err_ratelimited(dev,
"Guest PGD %lx not supported,
max %llx\n",
(uintptr_t)s1_gpgd, s2_domain-
max_addr);
return -EINVAL;
- }
I'm not sure how useful this check is. Even if the pgd is sane the lower level PTEs could include unsupported GPA's. If a guest really doesn't want to follow the GPA restriction which vIOMMU reports, it can easily cause IOMMU fault in many ways.
Then why treating pgd specially?
On 5/24/23 3:16 PM, Tian, Kevin wrote:
From: Yi Liu yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
+/**
- intel_pasid_setup_nested() - Set up PASID entry for nested translation.
- This could be used for guest shared virtual address. In this case, the
- first level page tables are used for GVA-GPA translation in the guest,
- second level page tables are used for GPA-HPA translation.
it's not just for guest SVA. Actually in this series it's RID_PASID nested translation.
Yes.
- @iommu: IOMMU which the device belong to
- @dev: Device to be set up for translation
- @pasid: PASID to be programmed in the device PASID table
- @domain: User domain nested on a s2 domain
"User stage-1 domain"
Yes.
- */
+int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
u32 pasid, struct dmar_domain *domain)
+{
- struct iommu_hwpt_intel_vtd *s1_cfg = &domain->s1_cfg;
- pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl;
- struct dmar_domain *s2_domain = domain->s2_domain;
- u16 did = domain_id_iommu(domain, iommu);
- struct dma_pte *pgd = s2_domain->pgd;
- struct pasid_entry *pte;
- int agaw;
- if (!ecap_nest(iommu->ecap)) {
pr_err_ratelimited("%s: No nested translation support\n",
iommu->name);
return -ENODEV;
- }
- /*
* Sanity checking performed by caller to make sure address width
"by caller"? it's checked in this function.
This comment need to be updated.
* matching in two dimensions: CPU vs. IOMMU, guest vs. host.
*/
- switch (s1_cfg->addr_width) {
- case ADDR_WIDTH_4LEVEL:
break;
+#ifdef CONFIG_X86
- case ADDR_WIDTH_5LEVEL:
if (!cpu_feature_enabled(X86_FEATURE_LA57) ||
!cap_fl5lp_support(iommu->cap)) {
dev_err_ratelimited(dev,
"5-level paging not supported\n");
return -EINVAL;
}
break;
+#endif
- default:
dev_err_ratelimited(dev, "Invalid guest address width %d\n",
s1_cfg->addr_width);
return -EINVAL;
- }
- if ((s1_cfg->flags & IOMMU_VTD_PGTBL_SRE) && !ecap_srs(iommu-
ecap)) {
pr_err_ratelimited("No supervisor request support on %s\n",
iommu->name);
return -EINVAL;
- }
- if ((s1_cfg->flags & IOMMU_VTD_PGTBL_EAFE)
&& !ecap_eafs(iommu->ecap)) {
pr_err_ratelimited("No extended access flag support
on %s\n",
iommu->name);
return -EINVAL;
- }
- /*
* Memory type is only applicable to devices inside processor
coherent
* domain. Will add MTS support once coherent devices are available.
*/
- if (s1_cfg->flags & IOMMU_VTD_PGTBL_MTS_MASK) {
pr_warn_ratelimited("No memory type support %s\n",
iommu->name);
return -EINVAL;
- }
If it's unsupported why exposing them in the uAPI at this point?
Agreed. We can remove this flag for now.
- agaw = iommu_skip_agaw(s2_domain, iommu, &pgd);
- if (agaw < 0) {
dev_err_ratelimited(dev, "Invalid domain page table\n");
return -EINVAL;
- }
this looks problematic.
static inline int iommu_skip_agaw(struct dmar_domain *domain, struct intel_iommu *iommu, struct dma_pte **pgd) { int agaw;
for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) { *pgd = phys_to_virt(dma_pte_addr(*pgd)); if (!dma_pte_present(*pgd)) return -EINVAL; }
return agaw; }
why is it safe to change pgd level of s2 domain when it's used as the parent? this s2 pgtbl might be used by other devices behind other iommus which already maps GPAs in a level which this iommu doesn't support...
shouldn't we simply fail it as another incompatible condition?
You are right. We can change it to this:
if (domain->agaw > iommu->agaw) return -EINVAL;
- /* First level PGD (in GPA) must be supported by the second level. */
- if ((uintptr_t)s1_gpgd > (1ULL << s2_domain->gaw)) {
dev_err_ratelimited(dev,
"Guest PGD %lx not supported,
max %llx\n",
(uintptr_t)s1_gpgd, s2_domain-
max_addr);
return -EINVAL;
- }
I'm not sure how useful this check is. Even if the pgd is sane the lower level PTEs could include unsupported GPA's. If a guest really doesn't want to follow the GPA restriction which vIOMMU reports, it can easily cause IOMMU fault in many ways.
You are right.
Then why treating pgd specially?
I have no memory about this check for now. Yi, any thought?
Best regards, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Friday, May 26, 2023 12:16 PM
- /* First level PGD (in GPA) must be supported by the second level. */
- if ((uintptr_t)s1_gpgd > (1ULL << s2_domain->gaw)) {
dev_err_ratelimited(dev,
"Guest PGD %lx not supported,
max %llx\n",
(uintptr_t)s1_gpgd, s2_domain-
max_addr);
return -EINVAL;
- }
I'm not sure how useful this check is. Even if the pgd is sane the lower level PTEs could include unsupported GPA's. If a guest really doesn't want to follow the GPA restriction which vIOMMU reports, it can easily cause IOMMU fault in many ways.
You are right.
Then why treating pgd specially?
I have no memory about this check for now. Yi, any thought?
I don’t think there is another special reason. Just want to ensure the page table base address follows the GPA restriction. But as Kevin mentioned, hw can detect it anyway. So, I think this check can be dropped.
Regards, Yi Liu
From: Liu, Yi L yi.l.liu@intel.com Sent: Wednesday, June 7, 2023 4:34 PM
From: Baolu Lu baolu.lu@linux.intel.com Sent: Friday, May 26, 2023 12:16 PM
- /* First level PGD (in GPA) must be supported by the second level. */
- if ((uintptr_t)s1_gpgd > (1ULL << s2_domain->gaw)) {
dev_err_ratelimited(dev,
"Guest PGD %lx not supported,
max %llx\n",
(uintptr_t)s1_gpgd, s2_domain-
max_addr);
return -EINVAL;
- }
I'm not sure how useful this check is. Even if the pgd is sane the lower level PTEs could include unsupported GPA's. If a guest really doesn't want to follow the GPA restriction which vIOMMU reports, it can easily cause IOMMU fault in many ways.
You are right.
Then why treating pgd specially?
I have no memory about this check for now. Yi, any thought?
I don’t think there is another special reason. Just want to ensure the page table base address follows the GPA restriction. But as Kevin mentioned, hw can detect it anyway. So, I think this check can be dropped.
Then we also need to remove below words in the uapi header.
+ Hence the + * stage-1 page table base address value should not be higher than the + * maximum untranslated address of stage-2 page table.
Regards, Yi Liu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Friday, May 26, 2023 12:16 PM On 5/24/23 3:16 PM, Tian, Kevin wrote:
From: Yi Liu yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
- /*
* Memory type is only applicable to devices inside processor
coherent
* domain. Will add MTS support once coherent devices are available.
*/
- if (s1_cfg->flags & IOMMU_VTD_PGTBL_MTS_MASK) {
pr_warn_ratelimited("No memory type support %s\n",
iommu->name);
return -EINVAL;
- }
If it's unsupported why exposing them in the uAPI at this point?
Agreed. We can remove this flag for now.
So we shall remove the below flags in uapi as well, is it?
+#define IOMMU_VTD_PGTBL_MTS_MASK (IOMMU_VTD_PGTBL_CD | \ + IOMMU_VTD_PGTBL_EMTE | \ + IOMMU_VTD_PGTBL_PCD | \ + IOMMU_VTD_PGTBL_PWT)
Regards, Yi Liu
On 6/8/23 11:35 AM, Liu, Yi L wrote:
From: Baolu Lubaolu.lu@linux.intel.com Sent: Friday, May 26, 2023 12:16 PM On 5/24/23 3:16 PM, Tian, Kevin wrote:
From: Yi Liuyi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
- /*
* Memory type is only applicable to devices inside processor
coherent
* domain. Will add MTS support once coherent devices are available.
*/
- if (s1_cfg->flags & IOMMU_VTD_PGTBL_MTS_MASK) {
pr_warn_ratelimited("No memory type support %s\n",
iommu->name);
return -EINVAL;
- }
If it's unsupported why exposing them in the uAPI at this point?
Agreed. We can remove this flag for now.
So we shall remove the below flags in uapi as well, is it?
+#define IOMMU_VTD_PGTBL_MTS_MASK (IOMMU_VTD_PGTBL_CD | \
IOMMU_VTD_PGTBL_EMTE | \
IOMMU_VTD_PGTBL_PCD | \
IOMMU_VTD_PGTBL_PWT)
I suppose so.
Best regards, baolu
hence can be used in the nested.c
Suggested-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 17 +++++++---------- drivers/iommu/intel/iommu.h | 8 ++++++++ 2 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index b871a6afd803..e6536a43dd82 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -277,7 +277,6 @@ static LIST_HEAD(dmar_satc_units); #define for_each_rmrr_units(rmrr) \ list_for_each_entry(rmrr, &dmar_rmrr_units, list)
-static void device_block_translation(struct device *dev); static void intel_iommu_domain_free(struct iommu_domain *domain);
int dmar_disabled = !IS_ENABLED(CONFIG_INTEL_IOMMU_DEFAULT_ON); @@ -555,7 +554,7 @@ static unsigned long domain_super_pgsize_bitmap(struct dmar_domain *domain) }
/* Some capabilities may be different across iommus */ -static void domain_update_iommu_cap(struct dmar_domain *domain) +void domain_update_iommu_cap(struct dmar_domain *domain) { domain_update_iommu_coherency(domain); domain->iommu_superpage = domain_update_iommu_superpage(domain, NULL); @@ -1740,8 +1739,7 @@ static struct dmar_domain *alloc_domain(unsigned int type) return domain; }
-static int domain_attach_iommu(struct dmar_domain *domain, - struct intel_iommu *iommu) +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) { struct iommu_domain_info *info, *curr; unsigned long ndomains; @@ -1790,8 +1788,7 @@ static int domain_attach_iommu(struct dmar_domain *domain, return ret; }
-static void domain_detach_iommu(struct dmar_domain *domain, - struct intel_iommu *iommu) +void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) { struct iommu_domain_info *info;
@@ -3994,7 +3991,7 @@ static void dmar_remove_one_dev_info(struct device *dev) * all DMA requests without PASID from the device are blocked. If the page * table has been set, clean up the data structures. */ -static void device_block_translation(struct device *dev) +void device_block_translation(struct device *dev) { struct device_domain_info *info = dev_iommu_priv_get(dev); struct intel_iommu *iommu = info->iommu; @@ -4101,8 +4098,8 @@ static void intel_iommu_domain_free(struct iommu_domain *domain) domain_exit(to_dmar_domain(domain)); }
-static int prepare_domain_attach_device(struct iommu_domain *domain, - struct device *dev) +int prepare_domain_attach_device(struct iommu_domain *domain, + struct device *dev) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct intel_iommu *iommu; @@ -4342,7 +4339,7 @@ static void domain_set_force_snooping(struct dmar_domain *domain) PASID_RID2PASID); }
-static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain) +bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); unsigned long flags; diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 8d0c7970c1ad..ccb93aed6cf2 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -852,6 +852,14 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, */ #define QI_OPT_WAIT_DRAIN BIT(0)
+int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu); +void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu); +void device_block_translation(struct device *dev); +int prepare_domain_attach_device(struct iommu_domain *domain, + struct device *dev); +bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain); +void domain_update_iommu_cap(struct dmar_domain *domain); + int dmar_ir_support(void);
void *alloc_pgtable_page(int node, gfp_t gfp);
hence enable nested domain usage on Intel VT-d.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/nested.c | 47 ++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+)
diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c index f83931bb44b6..fd38424b78f0 100644 --- a/drivers/iommu/intel/nested.c +++ b/drivers/iommu/intel/nested.c @@ -11,8 +11,53 @@ #define pr_fmt(fmt) "DMAR: " fmt
#include <linux/iommu.h> +#include <linux/pci.h> +#include <linux/pci-ats.h>
#include "iommu.h" +#include "pasid.h" + +static int intel_nested_attach_dev(struct iommu_domain *domain, + struct device *dev) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct intel_iommu *iommu = info->iommu; + unsigned long flags; + int ret = 0; + + if (info->domain) + device_block_translation(dev); + + /* Is s2_domain compatible with this IOMMU? */ + ret = prepare_domain_attach_device(&dmar_domain->s2_domain->domain, dev); + if (ret) { + dev_err_ratelimited(dev, "s2 domain is not compatible\n"); + return ret; + } + + ret = domain_attach_iommu(dmar_domain, iommu); + if (ret) { + dev_err_ratelimited(dev, "Failed to attach domain to iommu\n"); + return ret; + } + + ret = intel_pasid_setup_nested(iommu, dev, + PASID_RID2PASID, dmar_domain); + if (ret) { + domain_detach_iommu(dmar_domain, iommu); + dev_err_ratelimited(dev, "Failed to setup pasid entry\n"); + return ret; + } + + info->domain = dmar_domain; + spin_lock_irqsave(&dmar_domain->lock, flags); + list_add(&info->link, &dmar_domain->devices); + spin_unlock_irqrestore(&dmar_domain->lock, flags); + domain_update_iommu_cap(dmar_domain); + + return 0; +}
static void intel_nested_domain_free(struct iommu_domain *domain) { @@ -20,7 +65,9 @@ static void intel_nested_domain_free(struct iommu_domain *domain) }
static const struct iommu_domain_ops intel_nested_domain_ops = { + .attach_dev = intel_nested_attach_dev, .free = intel_nested_domain_free, + .enforce_cache_coherency = intel_iommu_enforce_cache_coherency, };
struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain,
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
+static int intel_nested_attach_dev(struct iommu_domain *domain,
struct device *dev)
+{
- struct device_domain_info *info = dev_iommu_priv_get(dev);
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- struct intel_iommu *iommu = info->iommu;
- unsigned long flags;
- int ret = 0;
- if (info->domain)
device_block_translation(dev);
- /* Is s2_domain compatible with this IOMMU? */
- ret = prepare_domain_attach_device(&dmar_domain->s2_domain-
domain, dev);
- if (ret) {
dev_err_ratelimited(dev, "s2 domain is not compatible\n");
return ret;
- }
this also includes logic to trim higher page levels:
/* * Knock out extra levels of page tables if necessary */ while (iommu->agaw < dmar_domain->agaw) { struct dma_pte *pte;
pte = dmar_domain->pgd; if (dma_pte_present(pte)) { dmar_domain->pgd = phys_to_virt(dma_pte_addr(pte)); free_pgtable_page(pte); } dmar_domain->agaw--; }
What's the background of doing such truncation instead of simply failing the request?
In any means it's probably fine before the domain includes any mapping but really unreasonable to apply it to an existing s2 when it's used as a parent.
On 5/24/23 3:22 PM, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
+static int intel_nested_attach_dev(struct iommu_domain *domain,
struct device *dev)
+{
- struct device_domain_info *info = dev_iommu_priv_get(dev);
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- struct intel_iommu *iommu = info->iommu;
- unsigned long flags;
- int ret = 0;
- if (info->domain)
device_block_translation(dev);
- /* Is s2_domain compatible with this IOMMU? */
- ret = prepare_domain_attach_device(&dmar_domain->s2_domain-
domain, dev);
- if (ret) {
dev_err_ratelimited(dev, "s2 domain is not compatible\n");
return ret;
- }
this also includes logic to trim higher page levels:
/* * Knock out extra levels of page tables if necessary */ while (iommu->agaw < dmar_domain->agaw) { struct dma_pte *pte;
pte = dmar_domain->pgd; if (dma_pte_present(pte)) { dmar_domain->pgd = phys_to_virt(dma_pte_addr(pte)); free_pgtable_page(pte); } dmar_domain->agaw--;
}
What's the background of doing such truncation instead of simply failing the request?
This code existed a long time ago. I'm not sure if it's still reasonable so far.
In any means it's probably fine before the domain includes any mapping but really unreasonable to apply it to an existing s2 when it's used as a parent.
But for the new nested translation, it is obviously unreasonable.
Let me revisit it.
Best regards, baolu
This is needed as the stage-1 page table of the nested domain is maintained outside the iommu subsystem, hence, needs to support iotlb flush requests.
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_TYPE_VTD_S1 type and the related callback to accept iotlb flush request from IOMMUFD.
This only exposes the interface for invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in IOTLB invalidation if the affected device is ATS-capable.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 10 +++--- drivers/iommu/intel/iommu.h | 6 ++++ drivers/iommu/intel/nested.c | 69 ++++++++++++++++++++++++++++++++++++ drivers/iommu/iommufd/main.c | 6 ++++ include/uapi/linux/iommufd.h | 59 ++++++++++++++++++++++++++++++ 5 files changed, 145 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index e6536a43dd82..5f27cee4656a 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1474,10 +1474,10 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain, spin_unlock_irqrestore(&domain->lock, flags); }
-static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, - struct dmar_domain *domain, - unsigned long pfn, unsigned int pages, - int ih, int map) +void iommu_flush_iotlb_psi(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages, + int ih, int map) { unsigned int aligned_pages = __roundup_pow_of_two(pages); unsigned int mask = ilog2(aligned_pages); @@ -1550,7 +1550,7 @@ static inline void __mapping_notify_one(struct intel_iommu *iommu, iommu_flush_write_buffer(iommu); }
-static void intel_flush_iotlb_all(struct iommu_domain *domain) +void intel_flush_iotlb_all(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct iommu_domain_info *info; diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index ccb93aed6cf2..581596d90c1b 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -859,6 +859,12 @@ int prepare_domain_attach_device(struct iommu_domain *domain, struct device *dev); bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain); void domain_update_iommu_cap(struct dmar_domain *domain); +void iommu_flush_iotlb_psi(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages, + int ih, int map); +void intel_flush_iotlb_all(struct iommu_domain *domain); +
int dmar_ir_support(void);
diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c index fd38424b78f0..d13fbcd3f5a6 100644 --- a/drivers/iommu/intel/nested.c +++ b/drivers/iommu/intel/nested.c @@ -64,8 +64,77 @@ static void intel_nested_domain_free(struct iommu_domain *domain) kfree(to_dmar_domain(domain)); }
+static void intel_nested_invalidate(struct device *dev, + struct dmar_domain *domain, + u64 addr, unsigned long npages) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct intel_iommu *iommu = info->iommu; + + if (addr == 0 && npages == -1) + intel_flush_iotlb_all(&domain->domain); + else + iommu_flush_iotlb_psi(iommu, domain, + addr >> VTD_PAGE_SHIFT, + npages, 1, 0); +} + +static int intel_nested_cache_invalidate_user(struct iommu_domain *domain, + void *user_data) +{ + struct iommu_hwpt_invalidate_request_intel_vtd *req = user_data; + struct iommu_hwpt_invalidate_intel_vtd *inv_info = user_data; + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + unsigned int entry_size = inv_info->entry_size; + u64 uptr = inv_info->inv_data_uptr; + u64 nr_uptr = inv_info->entry_nr_uptr; + struct device_domain_info *info; + u32 entry_nr, index; + unsigned long flags; + int ret = 0; + + if (WARN_ON(!user_data)) + return 0; + + if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr))) + return -EFAULT; + + if (!entry_nr) + return -EINVAL; + + for (index = 0; index < entry_nr; index++) { + ret = copy_struct_from_user(req, sizeof(*req), + u64_to_user_ptr(uptr + index * entry_size), + entry_size); + if (ret) { + pr_err_ratelimited("Failed to fetch invalidation request\n"); + break; + } + + if (req->__reserved || (req->flags & ~IOMMU_VTD_QI_FLAGS_LEAF) || + !IS_ALIGNED(req->addr, VTD_PAGE_SIZE)) { + ret = -EINVAL; + break; + } + + spin_lock_irqsave(&dmar_domain->lock, flags); + list_for_each_entry(info, &dmar_domain->devices, link) + intel_nested_invalidate(info->dev, dmar_domain, + req->addr, req->npages); + spin_unlock_irqrestore(&dmar_domain->lock, flags); + } + + if (ret && put_user(index, (uint32_t __user *)u64_to_user_ptr(nr_uptr))) + return -EFAULT; + + return ret; +} + static const struct iommu_domain_ops intel_nested_domain_ops = { .attach_dev = intel_nested_attach_dev, + .cache_invalidate_user = intel_nested_cache_invalidate_user, + .cache_invalidate_user_data_len = + sizeof(struct iommu_hwpt_invalidate_intel_vtd), .free = intel_nested_domain_free, .enforce_cache_coherency = intel_iommu_enforce_cache_coherency, }; diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 39922f83ce34..b338b082950b 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -282,6 +282,12 @@ union ucmd_buffer { #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif + /* + * hwpt_type specific structure used in the cache invalidation + * path. + */ + struct iommu_hwpt_invalidate_intel_vtd vtd; + struct iommu_hwpt_invalidate_request_intel_vtd req_vtd; };
struct iommufd_ioctl_op { diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index c2658394827a..2e658fa346ad 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -505,6 +505,63 @@ struct iommu_hw_info { }; #define IOMMU_DEVICE_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DEVICE_GET_HW_INFO)
+/** + * enum iommu_hwpt_intel_vtd_invalidate_flags - Flags for Intel VT-d + * stage-1 page table cache + * invalidation + * @IOMMU_VTD_QI_FLAGS_LEAF: The LEAF flag indicates whether only the + * leaf PTE caching needs to be invalidated + * and other paging structure caches can be + * preserved. + */ +enum iommu_hwpt_intel_vtd_invalidate_flags { + IOMMU_VTD_QI_FLAGS_LEAF = 1 << 0, +}; + +/** + * struct iommu_hwpt_invalidate_request_intel_vtd - Intel VT-d cache invalidation request + * @addr: The start address of the addresses to be invalidated. + * @npages: Number of contiguous 4K pages to be invalidated. + * @flags: Combination of enum iommu_hwpt_intel_vtd_invalidate_flags + * @__reserved: Must be 0 + * + * The Intel VT-d specific invalidation data for user-managed stage-1 cache + * invalidation under nested translation. Userspace uses this structure to + * tell host about the impacted caches after modifying the stage-1 page table. + * + * Invalidating all the caches related to the hw_pagetable by setting + * @addr==0 and @npages==__u64(-1). + */ +struct iommu_hwpt_invalidate_request_intel_vtd { + __u64 addr; + __u64 npages; + __u32 flags; + __u32 __reserved; +}; + +/** + * struct iommu_hwpt_invalidate_intel_vtd - Intel VT-d cache invalidation info + * @flags: Must be 0 + * @entry_size: Size in bytes of each cache invalidation request + * @entry_nr_uptr: User pointer to the number of invalidation requests. + * Kernel reads it to get the number of requests and + * updates the buffer with the number of requests that + * have been processed successfully. This pointer must + * point to a __u32 type of memory location. + * @inv_data_uptr: Pointer to the cache invalidation requests + * + * The Intel VT-d specific invalidation data for a set of cache invalidation + * requests. Kernel loops the requests one-by-one and stops when failure + * is encountered. The number of handled requests is reported to user by + * writing the buffer pointed by @entry_nr_uptr. + */ +struct iommu_hwpt_invalidate_intel_vtd { + __u32 flags; + __u32 entry_size; + __u64 entry_nr_uptr; + __u64 inv_data_uptr; +}; + /** * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) * @size: sizeof(struct iommu_hwpt_invalidate) @@ -520,6 +577,8 @@ struct iommu_hw_info { * +==============================+========================================+ * | @hwpt_type | Data structure in @data_uptr | * +------------------------------+----------------------------------------+ + * | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_invalidate_intel_vtd | + * +------------------------------+----------------------------------------+ */ struct iommu_hwpt_invalidate { __u32 size;
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
This is needed as the stage-1 page table of the nested domain is maintained outside the iommu subsystem, hence, needs to support iotlb flush requests.
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_TYPE_VTD_S1 type and the related callback to accept iotlb flush request from IOMMUFD.
This only exposes the interface for invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in IOTLB invalidation if the affected device is ATS-capable.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Following how you split patches in former part of the series this should be split into three patches: one to introduce the uAPI changes, the 2nd to export symbols and the last to actually add iotlb flush.
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
+{
- struct iommu_hwpt_invalidate_request_intel_vtd *req = user_data;
- struct iommu_hwpt_invalidate_intel_vtd *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (WARN_ON(!user_data))
return 0;
WARN_ON should lead to error returned.
- if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- if (!entry_nr)
return -EINVAL;
Having zero number of entries is instead not an error. Just means no work to do.
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
if (ret) {
pr_err_ratelimited("Failed to fetch invalidation
request\n");
break;
}
if (req->__reserved || (req->flags &
~IOMMU_VTD_QI_FLAGS_LEAF) ||
!IS_ALIGNED(req->addr, VTD_PAGE_SIZE)) {
ret = -EINVAL;
break;
}
spin_lock_irqsave(&dmar_domain->lock, flags);
list_for_each_entry(info, &dmar_domain->devices, link)
intel_nested_invalidate(info->dev, dmar_domain,
req->addr, req->npages);
spin_unlock_irqrestore(&dmar_domain->lock, flags);
- }
- if (ret && put_user(index, (uint32_t __user
*)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
You want to always update the nr no matter success or failure
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 39922f83ce34..b338b082950b 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -282,6 +282,12 @@ union ucmd_buffer { #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif
- /*
* hwpt_type specific structure used in the cache invalidation
* path.
*/
- struct iommu_hwpt_invalidate_intel_vtd vtd;
- struct iommu_hwpt_invalidate_request_intel_vtd req_vtd;
};
Can you add some explanation in commit msg why such vendor specific structures must be put in the generic ucmd_buffer?
+/**
- enum iommu_hwpt_intel_vtd_invalidate_flags - Flags for Intel VT-d
enum iommu_hwpt_vtd_s1_invalidate_flags
stage-1 page table cache
invalidation
- @IOMMU_VTD_QI_FLAGS_LEAF: The LEAF flag indicates whether only the
leaf PTE caching needs to be invalidated
and other paging structure caches can be
preserved.
- */
what about "Drain Reads" and "Drain Writes"? Is the user allowed/required to provide those hints?
+/**
- struct iommu_hwpt_invalidate_request_intel_vtd - Intel VT-d cache
invalidation request
here you put "intel_vtd" in the end of the name. let's follow the same order as earlier definitions.
struct iommu_hwpt_vtd_s1_invalidate_desc
- @addr: The start address of the addresses to be invalidated.
- @npages: Number of contiguous 4K pages to be invalidated.
- @flags: Combination of enum iommu_hwpt_intel_vtd_invalidate_flags
- @__reserved: Must be 0
- The Intel VT-d specific invalidation data for user-managed stage-1 cache
- invalidation under nested translation. Userspace uses this structure to
- tell host about the impacted caches after modifying the stage-1 page
table.
- Invalidating all the caches related to the hw_pagetable by setting
- @addr==0 and @npages==__u64(-1).
- */
+struct iommu_hwpt_invalidate_request_intel_vtd {
- __u64 addr;
- __u64 npages;
- __u32 flags;
- __u32 __reserved;
+};
+/**
- struct iommu_hwpt_invalidate_intel_vtd - Intel VT-d cache invalidation
info
iommu_hwpt_vtd_s1_invalidate
- @flags: Must be 0
- @entry_size: Size in bytes of each cache invalidation request
- @entry_nr_uptr: User pointer to the number of invalidation requests.
Kernel reads it to get the number of requests and
updates the buffer with the number of requests that
have been processed successfully. This pointer must
point to a __u32 type of memory location.
- @inv_data_uptr: Pointer to the cache invalidation requests
- The Intel VT-d specific invalidation data for a set of cache invalidation
- requests. Kernel loops the requests one-by-one and stops when failure
- is encountered. The number of handled requests is reported to user by
- writing the buffer pointed by @entry_nr_uptr.
- */
+struct iommu_hwpt_invalidate_intel_vtd {
- __u32 flags;
- __u32 entry_size;
- __u64 entry_nr_uptr;
- __u64 inv_data_uptr;
+};
/**
- struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
- @size: sizeof(struct iommu_hwpt_invalidate)
@@ -520,6 +577,8 @@ struct iommu_hw_info {
+==============================+================================ ========+
- | @hwpt_type | Data structure in @data_uptr |
- +------------------------------+----------------------------------------+
- | IOMMU_HWPT_TYPE_VTD_S1 | struct
iommu_hwpt_invalidate_intel_vtd |
*/
- +------------------------------+----------------------------------------+
struct iommu_hwpt_invalidate { __u32 size; -- 2.34.1
From: Tian, Kevin kevin.tian@intel.com Sent: Wednesday, May 24, 2023 3:34 PM
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
This is needed as the stage-1 page table of the nested domain is maintained outside the iommu subsystem, hence, needs to support iotlb flush requests.
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_TYPE_VTD_S1 type and the related callback to accept iotlb flush request from IOMMUFD.
This only exposes the interface for invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in IOTLB invalidation if the affected device is ATS-capable.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Following how you split patches in former part of the series this should be split into three patches: one to introduce the uAPI changes, the 2nd to export symbols and the last to actually add iotlb flush.
Will do.
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
+{
- struct iommu_hwpt_invalidate_request_intel_vtd *req = user_data;
- struct iommu_hwpt_invalidate_intel_vtd *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (WARN_ON(!user_data))
return 0;
WARN_ON should lead to error returned.
Yes. or may just remove it. caller should provide a valid pointer anyhow.
- if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- if (!entry_nr)
return -EINVAL;
Having zero number of entries is instead not an error. Just means no work to do.
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
if (ret) {
pr_err_ratelimited("Failed to fetch invalidation
request\n");
break;
}
if (req->__reserved || (req->flags &
~IOMMU_VTD_QI_FLAGS_LEAF) ||
!IS_ALIGNED(req->addr, VTD_PAGE_SIZE)) {
ret = -EINVAL;
break;
}
spin_lock_irqsave(&dmar_domain->lock, flags);
list_for_each_entry(info, &dmar_domain->devices, link)
intel_nested_invalidate(info->dev, dmar_domain,
req->addr, req->npages);
spin_unlock_irqrestore(&dmar_domain->lock, flags);
- }
- if (ret && put_user(index, (uint32_t __user
*)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
You want to always update the nr no matter success or failure
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 39922f83ce34..b338b082950b 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -282,6 +282,12 @@ union ucmd_buffer { #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif
- /*
* hwpt_type specific structure used in the cache invalidation
* path.
*/
- struct iommu_hwpt_invalidate_intel_vtd vtd;
- struct iommu_hwpt_invalidate_request_intel_vtd req_vtd;
};
Can you add some explanation in commit msg why such vendor specific structures must be put in the generic ucmd_buffer?
+/**
- enum iommu_hwpt_intel_vtd_invalidate_flags - Flags for Intel VT-d
enum iommu_hwpt_vtd_s1_invalidate_flags
stage-1 page table cache
invalidation
- @IOMMU_VTD_QI_FLAGS_LEAF: The LEAF flag indicates whether only the
leaf PTE caching needs to be invalidated
and other paging structure caches can be
preserved.
- */
what about "Drain Reads" and "Drain Writes"? Is the user allowed/required to provide those hints?
All other comments got. For these two hints, the two flags are from the IOTLB Invalidation descriptor. Per below description, the hardware that supports nested should support drain and does not require software to ask for it. So it appears no need to define them in uapi.
"Hardware implementation with Major Version 2 or higher (VER_REG), always performs required drain without software explicitly requesting a drain in IOTLB invalidation. This field is deprecated and hardware will always report it as 1 to maintain backward compatibility with software"
Regards, Yi Liu
+/**
- struct iommu_hwpt_invalidate_request_intel_vtd - Intel VT-d cache
invalidation request
here you put "intel_vtd" in the end of the name. let's follow the same order as earlier definitions.
struct iommu_hwpt_vtd_s1_invalidate_desc
- @addr: The start address of the addresses to be invalidated.
- @npages: Number of contiguous 4K pages to be invalidated.
- @flags: Combination of enum iommu_hwpt_intel_vtd_invalidate_flags
- @__reserved: Must be 0
- The Intel VT-d specific invalidation data for user-managed stage-1 cache
- invalidation under nested translation. Userspace uses this structure to
- tell host about the impacted caches after modifying the stage-1 page
table.
- Invalidating all the caches related to the hw_pagetable by setting
- @addr==0 and @npages==__u64(-1).
- */
+struct iommu_hwpt_invalidate_request_intel_vtd {
- __u64 addr;
- __u64 npages;
- __u32 flags;
- __u32 __reserved;
+};
+/**
- struct iommu_hwpt_invalidate_intel_vtd - Intel VT-d cache invalidation
info
iommu_hwpt_vtd_s1_invalidate
- @flags: Must be 0
- @entry_size: Size in bytes of each cache invalidation request
- @entry_nr_uptr: User pointer to the number of invalidation requests.
Kernel reads it to get the number of requests and
updates the buffer with the number of requests that
have been processed successfully. This pointer must
point to a __u32 type of memory location.
- @inv_data_uptr: Pointer to the cache invalidation requests
- The Intel VT-d specific invalidation data for a set of cache invalidation
- requests. Kernel loops the requests one-by-one and stops when failure
- is encountered. The number of handled requests is reported to user by
- writing the buffer pointed by @entry_nr_uptr.
- */
+struct iommu_hwpt_invalidate_intel_vtd {
- __u32 flags;
- __u32 entry_size;
- __u64 entry_nr_uptr;
- __u64 inv_data_uptr;
+};
/**
- struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
- @size: sizeof(struct iommu_hwpt_invalidate)
@@ -520,6 +577,8 @@ struct iommu_hw_info {
+==============================+================================ ========+
- | @hwpt_type | Data structure in @data_uptr |
- +------------------------------+----------------------------------------+
- | IOMMU_HWPT_TYPE_VTD_S1 | struct
iommu_hwpt_invalidate_intel_vtd |
*/
- +------------------------------+----------------------------------------+
struct iommu_hwpt_invalidate { __u32 size; -- 2.34.1
On 2023/6/8 15:14, Liu, Yi L wrote:
stage-1 page table cache
invalidation
- @IOMMU_VTD_QI_FLAGS_LEAF: The LEAF flag indicates whether only the
leaf PTE caching needs to be invalidated
and other paging structure caches can be
preserved.
- */
what about "Drain Reads" and "Drain Writes"? Is the user allowed/required to provide those hints?
All other comments got. For these two hints, the two flags are from the IOTLB Invalidation descriptor. Per below description, the hardware that supports nested should support drain and does not require software to ask for it. So it appears no need to define them in uapi.
"Hardware implementation with Major Version 2 or higher (VER_REG), always performs required drain without software explicitly requesting a drain in IOTLB invalidation. This field is deprecated and hardware will always report it as 1 to maintain backward compatibility with software"
Make sense. Perhaps we can also remove below code in __iommu_flush_iotlb():
/* Note: set drain read/write */ #if 0 /* * This is probably to be super secure.. Looks like we can * ignore it without any impact. */ if (cap_read_drain(iommu->cap)) val |= DMA_TLB_READ_DRAIN; #endif
Best regards, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, June 8, 2023 4:08 PM
On 2023/6/8 15:14, Liu, Yi L wrote:
stage-1 page table cache
invalidation
- @IOMMU_VTD_QI_FLAGS_LEAF: The LEAF flag indicates whether only the
leaf PTE caching needs to be invalidated
and other paging structure caches can be
preserved.
- */
what about "Drain Reads" and "Drain Writes"? Is the user allowed/required to provide those hints?
All other comments got. For these two hints, the two flags are from the IOTLB Invalidation descriptor. Per below description, the hardware that supports nested should support drain and does not require software to ask for it. So it appears no need to define them in uapi.
"Hardware implementation with Major Version 2 or higher (VER_REG), always performs required drain without software explicitly requesting a drain in IOTLB invalidation. This field is deprecated and hardware will always report it as 1 to maintain backward compatibility with software"
Make sense. Perhaps we can also remove below code in __iommu_flush_iotlb():
/* Note: set drain read/write */
#if 0 /* * This is probably to be super secure.. Looks like we can * ignore it without any impact. */ if (cap_read_drain(iommu->cap)) val |= DMA_TLB_READ_DRAIN; #endif
This seems dead code. But it is there for a long time since below commit.
ba39592764ed20cee09aae5352e603a27bf56b0d
Regards, Yi Liu
From: Lu Baolu baolu.lu@linux.intel.com
This adds the support for IOMMU_HWPT_TYPE_VTD_S1 type.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/linux/iommu.h | 1 + 2 files changed, 20 insertions(+)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 5f27cee4656a..51612da4a1ea 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4092,6 +4092,16 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) return NULL; }
+static struct iommu_domain * +intel_iommu_domain_alloc_user(struct device *dev, struct iommu_domain *parent, + const union iommu_domain_user_data *user_data) +{ + if (parent) + return intel_nested_domain_alloc(parent, user_data); + else + return iommu_domain_alloc(dev->bus); +} + static void intel_iommu_domain_free(struct iommu_domain *domain) { if (domain != &si_domain->domain && domain != &blocking_domain) @@ -4736,9 +4746,18 @@ static void intel_iommu_remove_dev_pasid(struct device *dev, ioasid_t pasid) intel_pasid_tear_down_entry(iommu, dev, pasid, false); }
+static int intel_iommu_domain_user_data_len(u32 hwpt_type) +{ + if (hwpt_type != IOMMU_HWPT_TYPE_VTD_S1) + return -EOPNOTSUPP; + return sizeof(struct iommu_hwpt_intel_vtd); +}; + const struct iommu_ops intel_iommu_ops = { .capable = intel_iommu_capable, .domain_alloc = intel_iommu_domain_alloc, + .domain_alloc_user = intel_iommu_domain_alloc_user, + .domain_alloc_user_data_len = intel_iommu_domain_user_data_len, .probe_device = intel_iommu_probe_device, .probe_finalize = intel_iommu_probe_finalize, .release_device = intel_iommu_release_device, diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c696d8e7f43b..32324f0756de 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -234,6 +234,7 @@ union iommu_domain_user_data { #ifdef CONFIG_IOMMUFD_TEST __u64 test[2]; #endif + struct iommu_hwpt_intel_vtd vtd; };
/**
Add intel_iommu_hw_info() to report cap_reg and ecap_reg information.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++ 2 files changed, 42 insertions(+)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 51612da4a1ea..20d4ae1cb8a6 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4746,6 +4746,23 @@ static void intel_iommu_remove_dev_pasid(struct device *dev, ioasid_t pasid) intel_pasid_tear_down_entry(iommu, dev, pasid, false); }
+static void *intel_iommu_hw_info(struct device *dev, u32 *length) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct intel_iommu *iommu = info->iommu; + struct iommu_hw_info_vtd *vtd; + + vtd = kzalloc(sizeof(*vtd), GFP_KERNEL); + if (!vtd) + return ERR_PTR(-ENOMEM); + + vtd->cap_reg = iommu->cap; + vtd->ecap_reg = iommu->ecap; + *length = sizeof(*vtd); + + return vtd; +} + static int intel_iommu_domain_user_data_len(u32 hwpt_type) { if (hwpt_type != IOMMU_HWPT_TYPE_VTD_S1) @@ -4755,6 +4772,7 @@ static int intel_iommu_domain_user_data_len(u32 hwpt_type)
const struct iommu_ops intel_iommu_ops = { .capable = intel_iommu_capable, + .hw_info = intel_iommu_hw_info, .domain_alloc = intel_iommu_domain_alloc, .domain_alloc_user = intel_iommu_domain_alloc_user, .domain_alloc_user_data_len = intel_iommu_domain_user_data_len, @@ -4769,6 +4787,7 @@ const struct iommu_ops intel_iommu_ops = { .def_domain_type = device_def_domain_type, .remove_dev_pasid = intel_iommu_remove_dev_pasid, .pgsize_bitmap = SZ_4K, + .hw_info_type = IOMMU_HW_INFO_TYPE_INTEL_VTD, #ifdef CONFIG_INTEL_IOMMU_SVM .page_response = intel_svm_page_response, #endif diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 2e658fa346ad..b46270a4cf46 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -464,9 +464,32 @@ struct iommu_hwpt_alloc {
/** * enum iommu_hw_info_type - IOMMU Hardware Info Types + * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type */ enum iommu_hw_info_type { IOMMU_HW_INFO_TYPE_NONE, + IOMMU_HW_INFO_TYPE_INTEL_VTD, +}; + +/** + * struct iommu_hw_info_vtd - Intel VT-d hardware information + * + * @flags: Must be 0 + * @__reserved: Must be 0 + * + * @cap_reg: Value of Intel VT-d capability register defined in VT-d spec + * section 11.4.2 Capability Register. + * @ecap_reg: Value of Intel VT-d capability register defined in VT-d spec + * section 11.4.3 Extended Capability Register. + * + * User needs to understand the Intel VT-d specification to decode the + * register value. + */ +struct iommu_hw_info_vtd { + __u32 flags; + __u32 __reserved; + __aligned_u64 cap_reg; + __aligned_u64 ecap_reg; };
/**
From: Lu Baolu baolu.lu@linux.intel.com
When remapping hardware is configured by system software in scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled) in first-stage page-table entries even when second-stage mappings indicate that corresponding first-stage page-table is Read-Only.
As the result, contents of pages designated by VMM as Read-Only can be modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of address translation process due to DMAs issued by Guest.
Disallow the nested translation when there are read-only pages in the corresponding second-stage mappings. And, no read-only pages are allowed to be configured in the second-stage table of a nested translation. For the latter, an alternative is to disallow read-only mappings in any stage-2 domain as long as it's ever been used as a parent. In this way, we can simply replace the user counter with a flag.
In concept if the user understands this errata and does expect to enable nested translation it should never install any RO mapping in stage-2 in the entire VM life cycle."
Reference from Sapphire Rapids Specification Update [1], errata details, SPR17.
[1] https://www.intel.com/content/www/us/en/content-details/772415/content-detai...
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 13 +++++++++++++ drivers/iommu/intel/iommu.h | 4 ++++ drivers/iommu/intel/nested.c | 22 ++++++++++++++++++++-- include/uapi/linux/iommufd.h | 12 +++++++++++- 4 files changed, 48 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 20d4ae1cb8a6..42288bd449a0 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2150,6 +2150,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, struct dma_pte *first_pte = NULL, *pte = NULL; unsigned int largepage_lvl = 0; unsigned long lvl_pages = 0; + unsigned long flags; phys_addr_t pteval; u64 attr;
@@ -2159,6 +2160,17 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0) return -EINVAL;
+ if (!(prot & DMA_PTE_WRITE) && !domain->read_only_mapped) { + spin_lock_irqsave(&domain->lock, flags); + if (domain->nested_users > 0) { + spin_unlock_irqrestore(&domain->lock, flags); + return -EINVAL; + } + + domain->read_only_mapped = true; + spin_unlock_irqrestore(&domain->lock, flags); + } + attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP); attr |= DMA_FL_PTE_PRESENT; if (domain->use_first_level) { @@ -4756,6 +4768,7 @@ static void *intel_iommu_hw_info(struct device *dev, u32 *length) if (!vtd) return ERR_PTR(-ENOMEM);
+ vtd->flags = IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17; vtd->cap_reg = iommu->cap; vtd->ecap_reg = iommu->ecap; *length = sizeof(*vtd); diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 581596d90c1b..95644c6815af 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -616,6 +616,10 @@ struct dmar_domain { int agaw; /* maximum mapped address */ u64 max_addr; + /* domain has mappings with read-only permission */ + bool read_only_mapped; + /* user nested domain count */ + int nested_users; };
/* Nested user domain */ diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c index d13fbcd3f5a6..9092ce28382c 100644 --- a/drivers/iommu/intel/nested.c +++ b/drivers/iommu/intel/nested.c @@ -61,7 +61,14 @@ static int intel_nested_attach_dev(struct iommu_domain *domain,
static void intel_nested_domain_free(struct iommu_domain *domain) { - kfree(to_dmar_domain(domain)); + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct dmar_domain *s2_domain = dmar_domain->s2_domain; + unsigned long flags; + + spin_lock_irqsave(&s2_domain->lock, flags); + s2_domain->nested_users--; + spin_unlock_irqrestore(&s2_domain->lock, flags); + kfree(dmar_domain); }
static void intel_nested_invalidate(struct device *dev, @@ -143,14 +150,25 @@ struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain, const union iommu_domain_user_data *user_data) { const struct iommu_hwpt_intel_vtd *vtd = (struct iommu_hwpt_intel_vtd *)user_data; + struct dmar_domain *s2_dmar_domain = to_dmar_domain(s2_domain); struct dmar_domain *domain; + unsigned long flags;
domain = kzalloc(sizeof(*domain), GFP_KERNEL_ACCOUNT); if (!domain) return NULL;
+ spin_lock_irqsave(&s2_dmar_domain->lock, flags); + if (s2_dmar_domain->read_only_mapped) { + spin_unlock_irqrestore(&s2_dmar_domain->lock, flags); + kfree(domain); + return NULL; + } + s2_dmar_domain->nested_users++; + spin_unlock_irqrestore(&s2_dmar_domain->lock, flags); + domain->use_first_level = true; - domain->s2_domain = to_dmar_domain(s2_domain); + domain->s2_domain = s2_dmar_domain; domain->s1_pgtbl = vtd->pgtbl_addr; domain->s1_cfg = *vtd; domain->domain.ops = &intel_nested_domain_ops; diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index b46270a4cf46..8626f36e0353 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -471,10 +471,20 @@ enum iommu_hw_info_type { IOMMU_HW_INFO_TYPE_INTEL_VTD, };
+/** + * enum iommu_hw_info_vtd_flags - Flags for VT-d hw_info + * @IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17: If set, disallow nesting on domains + * with read-only mapping. + * https://www.intel.com/content/www/us/en/content-details/772415/content-detai... + */ +enum iommu_hw_info_vtd_flags { + IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 = 1 << 0, +}; + /** * struct iommu_hw_info_vtd - Intel VT-d hardware information * - * @flags: Must be 0 + * @flags: Combination of enum iommu_hw_info_vtd_flags * @__reserved: Must be 0 * * @cap_reg: Value of Intel VT-d capability register defined in VT-d spec
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
From: Lu Baolu baolu.lu@linux.intel.com
When remapping hardware is configured by system software in scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled) in first-stage page-table entries even when second-stage mappings indicate that corresponding first-stage page-table is Read-Only.
As the result, contents of pages designated by VMM as Read-Only can be modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of address translation process due to DMAs issued by Guest.
Disallow the nested translation when there are read-only pages in the corresponding second-stage mappings. And, no read-only pages are allowed to be configured in the second-stage table of a nested translation. For the latter, an alternative is to disallow read-only mappings in any stage-2 domain as long as it's ever been used as a parent. In this way, we can simply replace the user counter with a flag.
In concept if the user understands this errata and does expect to enable nested translation it should never install any RO mapping in stage-2 in the entire VM life cycle."
IMHO the alternative is reasonable and simpler. If the user decides to enable nesting it should keep the nesting-friendly configuration static since whether nesting is enabled on a device is according to viommu configuration (i.e. whether the guest attaches the device to identity domain or non-identity domain) and it's not good to break the nesting setup just because the host inadvertently adds a RO mapping to s2 in the middle between guest is detached/put back to identity domain and then re-attach to an unmanaged domain.
- if (!(prot & DMA_PTE_WRITE) && !domain->read_only_mapped) {
spin_lock_irqsave(&domain->lock, flags);
if (domain->nested_users > 0) {
spin_unlock_irqrestore(&domain->lock, flags);
return -EINVAL;
}
this is worth a one-off warning. Same in the other path.
On 5/24/23 3:44 PM, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
From: Lu Baolu baolu.lu@linux.intel.com
When remapping hardware is configured by system software in scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled) in first-stage page-table entries even when second-stage mappings indicate that corresponding first-stage page-table is Read-Only.
As the result, contents of pages designated by VMM as Read-Only can be modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of address translation process due to DMAs issued by Guest.
Disallow the nested translation when there are read-only pages in the corresponding second-stage mappings. And, no read-only pages are allowed to be configured in the second-stage table of a nested translation. For the latter, an alternative is to disallow read-only mappings in any stage-2 domain as long as it's ever been used as a parent. In this way, we can simply replace the user counter with a flag.
In concept if the user understands this errata and does expect to enable nested translation it should never install any RO mapping in stage-2 in the entire VM life cycle."
IMHO the alternative is reasonable and simpler. If the user decides to enable nesting it should keep the nesting-friendly configuration static since whether nesting is enabled on a device is according to viommu configuration (i.e. whether the guest attaches the device to identity domain or non-identity domain) and it's not good to break the nesting setup just because the host inadvertently adds a RO mapping to s2 in the middle between guest is detached/put back to identity domain and then re-attach to an unmanaged domain.
Fair enough.
- if (!(prot & DMA_PTE_WRITE) && !domain->read_only_mapped) {
spin_lock_irqsave(&domain->lock, flags);
if (domain->nested_users > 0) {
spin_unlock_irqrestore(&domain->lock, flags);
return -EINVAL;
}
this is worth a one-off warning. Same in the other path.
Sure.
Best regards, baolu
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
The first Intel platform supporting nested translation is Sapphire Rapids which, unfortunately, has a hardware errata [2] requiring special treatment. This errata happens when a stage-1 page table page (either level) is located in a stage-2 read-only region. In that case the IOMMU hardware may ignore the stage-2 RO permission and still set the A/D bit in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report this errata to userspace. With that restriction the user should either disable nested translation to favor RO stage-2 mappings or ensure no RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix in patch10 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu simply maps all valid regions in the GPA address space which certainly includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS region. Hence finding a way to allow user opt-out RO mappings in Qemu might be an acceptable tradeoff. But how to achieve it cleanly needs more discussion in Qemu community. For now we just hacked Qemu to test.
Hi, Alex,
Want to touch base on your thoughts about this errata before we actually go to discuss how to handle it in Qemu.
Overall it affects all Sapphire Rapids platforms. Fully disabling nested translation in the kernel just for this rare vulnerability sounds an overkill.
So we decide to enforce the exclusive check (RO in stage-2 vs. nesting) in the kernel and expose the restriction to userspace so the VMM can choose which one to enable based on its own requirement.
At least this looks a reasonable tradeoff to some proprietary VMMs which never adds RO mappings in stage-2 today.
But we do want to get Qemu support nested translation on those platform as the widely-used reference VMM!
Do you see any major oversight before pursuing such change in Qemu e.g. having a way for the user to opt-out adding RO mappings in stage-2? 😊
Thanks Kevin
On Wed, 24 May 2023 08:59:43 +0000 "Tian, Kevin" kevin.tian@intel.com wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
The first Intel platform supporting nested translation is Sapphire Rapids which, unfortunately, has a hardware errata [2] requiring special treatment. This errata happens when a stage-1 page table page (either level) is located in a stage-2 read-only region. In that case the IOMMU hardware may ignore the stage-2 RO permission and still set the A/D bit in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report this errata to userspace. With that restriction the user should either disable nested translation to favor RO stage-2 mappings or ensure no RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix in patch10 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu simply maps all valid regions in the GPA address space which certainly includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS region. Hence finding a way to allow user opt-out RO mappings in Qemu might be an acceptable tradeoff. But how to achieve it cleanly needs more discussion in Qemu community. For now we just hacked Qemu to test.
Hi, Alex,
Want to touch base on your thoughts about this errata before we actually go to discuss how to handle it in Qemu.
Overall it affects all Sapphire Rapids platforms. Fully disabling nested translation in the kernel just for this rare vulnerability sounds an overkill.
So we decide to enforce the exclusive check (RO in stage-2 vs. nesting) in the kernel and expose the restriction to userspace so the VMM can choose which one to enable based on its own requirement.
At least this looks a reasonable tradeoff to some proprietary VMMs which never adds RO mappings in stage-2 today.
But we do want to get Qemu support nested translation on those platform as the widely-used reference VMM!
Do you see any major oversight before pursuing such change in Qemu e.g. having a way for the user to opt-out adding RO mappings in stage-2? 😊
I don't feel like I have enough info to know what common scenarios are going to make use of 2-stage and nested configurations and how likely a user is to need such an opt-out. If it's likely that a user is going to encounter this configuration, an opt-out is at best a workaround. It's a significant support issue if a user needs to generate a failure in QEMU, notice and decipher any log messages that failure may have generated, and take action to introduce specific changes in their VM configuration to support a usage restriction.
For QEMU I might lean more towards an effort to better filter the mappings we create to avoid these read-only ranges that likely don't require DMA mappings anyway.
How much does this affect arbitrary userspace vfio drivers? For example are there scenarios where running in a VM with a vIOMMU introduces nested support that's unknown to the user which now prevents this usage? An example might be running an L2 guest with a version of QEMU that does create read-only mappings. If necessary, how would lack of read-only mapping support be conveyed to those nested use cases? Thanks,
Alex
From: Alex Williamson alex.williamson@redhat.com Sent: Friday, May 26, 2023 2:07 AM
On Wed, 24 May 2023 08:59:43 +0000 "Tian, Kevin" kevin.tian@intel.com wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Thursday, May 11, 2023 10:51 PM
The first Intel platform supporting nested translation is Sapphire Rapids which, unfortunately, has a hardware errata [2] requiring special treatment. This errata happens when a stage-1 page table page (either level) is located in a stage-2 read-only region. In that case the IOMMU hardware may ignore the stage-2 RO permission and still set the A/D bit in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report this errata to userspace. With that restriction the user should either disable nested translation to favor RO stage-2 mappings or ensure no RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix in patch10 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu simply maps all valid regions in the GPA address space which certainly includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS region. Hence finding a way to allow user opt-out RO mappings in Qemu might be an acceptable tradeoff. But how to achieve it cleanly needs more discussion in Qemu community. For now we just hacked
Qemu
to test.
Hi, Alex,
Want to touch base on your thoughts about this errata before we actually go to discuss how to handle it in Qemu.
Overall it affects all Sapphire Rapids platforms. Fully disabling nested translation in the kernel just for this rare vulnerability sounds an overkill.
So we decide to enforce the exclusive check (RO in stage-2 vs. nesting) in the kernel and expose the restriction to userspace so the VMM can choose which one to enable based on its own requirement.
At least this looks a reasonable tradeoff to some proprietary VMMs which never adds RO mappings in stage-2 today.
But we do want to get Qemu support nested translation on those platform as the widely-used reference VMM!
Do you see any major oversight before pursuing such change in Qemu e.g. having a way for the user to opt-out adding RO mappings in stage-2?
😊
I don't feel like I have enough info to know what common scenarios are going to make use of 2-stage and nested configurations and how likely a user is to need such an opt-out. If it's likely that a user is going to encounter this configuration, an opt-out is at best a workaround. It's a significant support issue if a user needs to generate a failure in QEMU, notice and decipher any log messages that failure may have generated, and take action to introduce specific changes in their VM configuration to support a usage restriction.
Thanks. This is a good point.
For QEMU I might lean more towards an effort to better filter the mappings we create to avoid these read-only ranges that likely don't require DMA mappings anyway.
We thought about having intel-viommu to register a discard memory manager to filter in case the kernel reports this errata.
Our originally thought was that even with it we may still want to explicitly let user to opt given this configuration doesn't match the bare metal. But with your explanation probably doing so instead causes more trouble than what it tries to achieve.
How much does this affect arbitrary userspace vfio drivers? For example are there scenarios where running in a VM with a vIOMMU introduces nested support that's unknown to the user which now prevents this usage? An example might be running an L2 guest with a version of QEMU that does create read-only mappings. If necessary, how would lack of read-only mapping support be conveyed to those nested use cases?
To enable nested translation it's expected to have the guest use stage-1 while the host uses stage-2. So the L0 QEMU will expose a vIOMMU with only stage-1 capability to L1.
In that case it's perfectly fine to have RO mappings in stage-1 no matter whether L1 further create L2 guest inside.
Then only L0 QEMU needs to care about this RO thing in stage-2.
In case L0 QEMU exposes a legacy vIOMMU which supports only stage-2 then nesting cannot be enabled. Instead it will fallback to the old shadowing path then RO mapping from guest doesn't matter either.
Exposing a vIOMMU which supports both stage-1/stage-2/nesting is another story. But I believe it's far from when this becomes useful and it's reasonable to just have L0 QEMU not support this configuration before this errata is fixed. 😊
Thanks, Kevin
On Wed, May 24, 2023 at 08:59:43AM +0000, Tian, Kevin wrote:
At least this looks a reasonable tradeoff to some proprietary VMMs which never adds RO mappings in stage-2 today.
What is the reason for the RO anyhow?
Would it be so bad if it was DMA mapped as RW due to the errata?
Jason
On Mon, 29 May 2023 15:43:02 -0300 Jason Gunthorpe jgg@nvidia.com wrote:
On Wed, May 24, 2023 at 08:59:43AM +0000, Tian, Kevin wrote:
At least this looks a reasonable tradeoff to some proprietary VMMs which never adds RO mappings in stage-2 today.
What is the reason for the RO anyhow?
Would it be so bad if it was DMA mapped as RW due to the errata?
What if it's the zero page? Thanks,
Alex
On Mon, May 29, 2023 at 06:16:44PM -0600, Alex Williamson wrote:
On Mon, 29 May 2023 15:43:02 -0300 Jason Gunthorpe jgg@nvidia.com wrote:
On Wed, May 24, 2023 at 08:59:43AM +0000, Tian, Kevin wrote:
At least this looks a reasonable tradeoff to some proprietary VMMs which never adds RO mappings in stage-2 today.
What is the reason for the RO anyhow?
Would it be so bad if it was DMA mapped as RW due to the errata?
What if it's the zero page? Thanks,
GUP doesn't return the zero page if FOL_WRITE is specified
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, May 30, 2023 2:43 AM
On Wed, May 24, 2023 at 08:59:43AM +0000, Tian, Kevin wrote:
At least this looks a reasonable tradeoff to some proprietary VMMs which never adds RO mappings in stage-2 today.
What is the reason for the RO anyhow?
vfio simply follows the permission in the CPU address space.
vBIOS regions are marked as RO there hence also carried to vfio mappings.
Would it be so bad if it was DMA mapped as RW due to the errata?
think of a scenario where the vbios memory is shared by multiple qemu instances then RW allows a malicious VM to modify the shared content then potentially attacking other VMs.
skipping the mapping is safest in this regard.
On Wed, Jun 14, 2023 at 08:07:30AM +0000, Tian, Kevin wrote:
think of a scenario where the vbios memory is shared by multiple qemu instances then RW allows a malicious VM to modify the shared content then potentially attacking other VMs.
qemu would have to map the vbios as MAP_PRIVATE WRITE before the iommu side could map it writable, so this is not a real worry.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, June 14, 2023 7:53 PM
On Wed, Jun 14, 2023 at 08:07:30AM +0000, Tian, Kevin wrote:
think of a scenario where the vbios memory is shared by multiple qemu instances then RW allows a malicious VM to modify the shared content then potentially attacking other VMs.
qemu would have to map the vbios as MAP_PRIVATE WRITE before the iommu side could map it writable, so this is not a real worry.
Make sense.
but IMHO it's still safer to reduce the permission (RO->NP) than increasing the permission (RO->RW) when faithfully emulating bare metal behavior is impossible, especially when there is no real usage counting on it. 😊
linux-kselftest-mirror@lists.linaro.org