This is to add Intel VT-d nested translation based on IOMMUFD nesting infrastructure. As the iommufd nesting infrastructure series[1], iommu core supports new ops to report iommu hardware information, allocate domains with user data and invalidate stage-1 IOTLB when there is mapping changed in stage-1 page table. The data required in the three paths are vendor-specific, so
1) IOMMU_HWPT_TYPE_VTD_S1 is defined for the Intel VT-d stage-1 page table, it will be used in the stage-1 domain allocation and IOTLB syncing path. struct iommu_hwpt_vtd_s1 is defined to pass user_data for the Intel VT-d stage-1 domain allocation. struct iommu_hwpt_vtd_s1_invalidate is defined to pass the data for the Intel VT-d stage-1 IOTLB invalidation. 2) IOMMU_HW_INFO_TYPE_INTEL_VTD and struct iommu_hw_info_vtd are defined to report iommu hardware information for Intel VT-d.
With above IOMMUFD extensions, the intel iommu driver implements the three paths to support nested translation.
The first Intel platform supporting nested translation is Sapphire Rapids which, unfortunately, has a hardware errata [2] requiring special treatment. This errata happens when a stage-1 page table page (either level) is located in a stage-2 read-only region. In that case the IOMMU hardware may ignore the stage-2 RO permission and still set the A/D bit in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report this errata to userspace. With that restriction the user should either disable nested translation to favor RO stage-2 mappings or ensure no RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix in patch12 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu simply maps all valid regions in the GPA address space which certainly includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS region. Hence finding a way to skip RO regions (e.g. via a discard manager) in Qemu might be an acceptable tradeoff. The actual change needs more discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found in [4].
[1] https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.c... [2] https://www.intel.com/content/www/us/en/content-details/772415/content-detai... [3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [4] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4_nesting
Change log: v4: - Remove ascii art tables (Jason) - Drop EMT (Tina, Jason) - Drop MTS and related definitions (Kevin) - Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin) - Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin) - Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin) - Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin) - Do not trim the higher page levels of S2 domain in nested domain attachment as the S2 domain may have been used independently. (Kevin) - Remove the first-stage pgd check against the maximum address of s2_domain as hw can check it anyhow. It makes sense to check every pfns used in the stage-1 page table. But it cannot make it. So just leave it to hw. (Kevin) - Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin) - Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used as parent domain of a nested domain. This removes the nested_users counting. (Kevin) - Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.co... - Further split the patches into an order of adding helpers for nested domain, iotlb flush, nested domain attachment and nested domain allocation callback, then report the hw_info to userspace. - Add batch support in cache invalidation from userspace - Disallow nested translation usage if RO mappings exists in stage-2 domain due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.c... - The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.co...
Regards, Yi Liu
Lu Baolu (5): iommu/vt-d: Extend dmar_domain to support nested domain iommu/vt-d: Add helper for nested domain allocation iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add nested domain allocation iommu/vt-d: Disallow nesting on domains with read-only mappings
Yi Liu (7): iommufd: Add data structure for Intel VT-d stage-1 domain allocation iommu/vt-d: Make domain attach helpers to be extern iommu/vt-d: Set the nested domain to a device iommufd: Add data structure for Intel VT-d stage-1 cache invalidation iommu/vt-d: Make iotlb flush helpers to be extern iommu/vt-d: Add iotlb flush for nested domain iommu/vt-d: Implement hw_info for iommu capability query
drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.c | 80 +++++++++++++--- drivers/iommu/intel/iommu.h | 55 +++++++++-- drivers/iommu/intel/nested.c | 174 +++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.c | 127 +++++++++++++++++++++++++ drivers/iommu/intel/pasid.h | 2 + drivers/iommu/iommufd/main.c | 6 ++ include/linux/iommu.h | 1 + include/uapi/linux/iommufd.h | 124 +++++++++++++++++++++++++ 9 files changed, 549 insertions(+), 22 deletions(-) create mode 100644 drivers/iommu/intel/nested.c
This adds IOMMU_HWPT_TYPE_VTD_S1 for stage-1 hw_pagetable of Intel VT-d and the corressponding data structure for userspace specified parameter for the domain allocation.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- include/uapi/linux/iommufd.h | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index ede822e5acbb..90b0d3f603a7 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -351,12 +351,45 @@ struct iommu_vfio_ioas { }; #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+/** + * enum iommu_hwpt_vtd_s1_flags - Intel VT-d stage-1 page table + * entry attributes + * @IOMMU_VTD_S1_SRE: Supervisor request + * @IOMMU_VTD_S1_EAFE: Extended access enable + * @IOMMU_VTD_S1_WPE: Write protect enable + */ +enum iommu_hwpt_vtd_s1_flags { + IOMMU_VTD_S1_SRE = 1 << 0, + IOMMU_VTD_S1_EAFE = 1 << 1, + IOMMU_VTD_S1_WPE = 1 << 2, +}; + +/** + * struct iommu_hwpt_vtd_s1 - Intel VT-d specific user-managed stage-1 + * page table info (IOMMU_HWPT_TYPE_VTD_S1) + * @flags: Combination of enum iommu_hwpt_vtd_s1_flags + * @pgtbl_addr: The base address of the stage-1 page table. + * @addr_width: The address width of the stage-1 page table + * @__reserved: Must be 0 + * + * VT-d specific data for creating a stage-1 page table that is used + * in nested translation. + */ +struct iommu_hwpt_vtd_s1 { + __aligned_u64 flags; + __aligned_u64 pgtbl_addr; + __u32 addr_width; + __u32 __reserved; +}; + /** * enum iommu_hwpt_type - IOMMU HWPT Type * @IOMMU_HWPT_TYPE_DEFAULT: default + * @IOMMU_HWPT_TYPE_VTD_S1: Intel VT-d stage-1 page table */ enum iommu_hwpt_type { IOMMU_HWPT_TYPE_DEFAULT, + IOMMU_HWPT_TYPE_VTD_S1, };
/**
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM
+/**
- struct iommu_hwpt_vtd_s1 - Intel VT-d specific user-managed stage-1
page table info (IOMMU_HWPT_TYPE_VTD_S1)
remove "specific user-managed"
- @flags: Combination of enum iommu_hwpt_vtd_s1_flags
- @pgtbl_addr: The base address of the stage-1 page table.
- @addr_width: The address width of the stage-1 page table
- @__reserved: Must be 0
- VT-d specific data for creating a stage-1 page table that is used
- in nested translation.
this doesn't add more info. could be removed.
From: Lu Baolu baolu.lu@linux.intel.com
The nested domain fields are exclusive to those that used for a DMA remapping domain. Use union to avoid memory waste.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.h | 35 +++++++++++++++++++++++++++++------ 1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 1c5e1d88862b..565e6ae54d32 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -596,15 +596,38 @@ struct dmar_domain { spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */
- struct dma_pte *pgd; /* virtual address */ - int gaw; /* max guest address width */ - - /* adjusted guest address width, 0 is level 2 30-bit */ - int agaw; int iommu_superpage;/* Level of superpages supported: 0 == 4KiB (no superpages), 1 == 2MiB, 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */ - u64 max_addr; /* maximum mapped address */ + union { + /* DMA remapping domain */ + struct { + /* virtual address */ + struct dma_pte *pgd; + /* max guest address width */ + int gaw; + /* + * adjusted guest address width: + * 0: level 2 30-bit + * 1: level 3 39-bit + * 2: level 4 48-bit + * 3: level 5 57-bit + */ + int agaw; + /* maximum mapped address */ + u64 max_addr; + }; + + /* Nested user domain */ + struct { + /* parent page table which the user domain is nested on */ + struct dmar_domain *s2_domain; + /* user page table pointer (in GPA) */ + unsigned long s1_pgtbl; + /* page table attributes */ + struct iommu_hwpt_vtd_s1 s1_cfg; + }; + };
struct iommu_domain domain; /* generic domain data structure for iommu core */
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM
From: Lu Baolu baolu.lu@linux.intel.com
The nested domain fields are exclusive to those that used for a DMA remapping domain. Use union to avoid memory waste.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
From: Lu Baolu baolu.lu@linux.intel.com
This adds helper for accepting user parameters and allocate a nested domain.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.h | 2 ++ drivers/iommu/intel/nested.c | 47 ++++++++++++++++++++++++++++++++++++ 3 files changed, 50 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/intel/nested.c
diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile index 7af3b8a4f2a0..5dabf081a779 100644 --- a/drivers/iommu/intel/Makefile +++ b/drivers/iommu/intel/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DMAR_TABLE) += dmar.o -obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o +obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o nested.o obj-$(CONFIG_DMAR_TABLE) += trace.o cap_audit.o obj-$(CONFIG_DMAR_PERF) += perf.o obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 565e6ae54d32..b3edad7359c9 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -858,6 +858,8 @@ void *alloc_pgtable_page(int node, gfp_t gfp); void free_pgtable_page(void *vaddr); void iommu_flush_write_buffer(struct intel_iommu *iommu); struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn); +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain, + const union iommu_domain_user_data *user_data);
#ifdef CONFIG_INTEL_IOMMU_SVM void intel_svm_check(struct intel_iommu *iommu); diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c new file mode 100644 index 000000000000..80a64ba87d46 --- /dev/null +++ b/drivers/iommu/intel/nested.c @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * nested.c - nested mode translation support + * + * Copyright (C) 2023 Intel Corporation + * + * Author: Lu Baolu baolu.lu@linux.intel.com + * Jacob Pan jacob.jun.pan@linux.intel.com + */ + +#define pr_fmt(fmt) "DMAR: " fmt + +#include <linux/iommu.h> + +#include "iommu.h" + +static void intel_nested_domain_free(struct iommu_domain *domain) +{ + kfree(to_dmar_domain(domain)); +} + +static const struct iommu_domain_ops intel_nested_domain_ops = { + .free = intel_nested_domain_free, +}; + +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain, + const union iommu_domain_user_data *user_data) +{ + const struct iommu_hwpt_vtd_s1 *vtd = (struct iommu_hwpt_vtd_s1 *)user_data; + struct dmar_domain *domain; + + domain = kzalloc(sizeof(*domain), GFP_KERNEL_ACCOUNT); + if (!domain) + return NULL; + + domain->use_first_level = true; + domain->s2_domain = to_dmar_domain(s2_domain); + domain->s1_pgtbl = vtd->pgtbl_addr; + domain->s1_cfg = *vtd; + domain->domain.ops = &intel_nested_domain_ops; + domain->domain.type = IOMMU_DOMAIN_NESTED; + INIT_LIST_HEAD(&domain->devices); + spin_lock_init(&domain->lock); + xa_init(&domain->iommu_array); + + return &domain->domain; +}
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM
From: Lu Baolu baolu.lu@linux.intel.com
This adds helper for accepting user parameters and allocate a nested domain.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
From: Lu Baolu baolu.lu@linux.intel.com
The configurations are passed in from the user when the user domain is allocated. This helper interprets these configurations according to the data structure defined in uapi/linux/iommufd.h. The EINVAL error will be returned if any of configurations are not compatible with the hardware capabilities. The caller can retry with another compatible user domain. The encoding of fields of each pasid entry is defined in section 9.6 of the VT-d spec.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/pasid.c | 127 ++++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.h | 2 + 2 files changed, 129 insertions(+)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index c5d479770e12..af9cfd2d5c52 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -335,6 +335,15 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe) pasid_set_bits(&pe->val[0], 1 << 1, 0); }
+/* + * Setup the SRE(Supervisor Request Enable) field (Bit 128) of a + * scalable mode PASID entry. + */ +static inline void pasid_set_sre(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[2], 1 << 0, 1); +} + /* * Setup the WPE(Write Protect Enable) field (Bit 132) of a * scalable mode PASID entry. @@ -402,6 +411,15 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2); }
+/* + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135) + * of a scalable mode PASID entry. + */ +static inline void pasid_set_eafe(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7); +} + static void pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu, u16 did, u32 pasid) @@ -713,3 +731,112 @@ void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu, if (!cap_caching_mode(iommu->cap)) devtlb_invalidation_with_pasid(iommu, dev, pasid); } + +/** + * intel_pasid_setup_nested() - Set up PASID entry for nested translation. + * This could be used for nested translation based vIOMMU. e.g. guest IOVA + * and guest shared virtual address. In this case, the first level page + * tables are used for GVA/GIOVA-GPA translation in the guest, second level + * page tables are used for GPA-HPA translation. + * + * @iommu: IOMMU which the device belong to + * @dev: Device to be set up for translation + * @pasid: PASID to be programmed in the device PASID table + * @domain: User stage-1 domain nested on a s2 domain + */ +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, + u32 pasid, struct dmar_domain *domain) +{ + struct iommu_hwpt_vtd_s1 *s1_cfg = &domain->s1_cfg; + pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl; + struct dmar_domain *s2_domain = domain->s2_domain; + u16 did = domain_id_iommu(domain, iommu); + struct dma_pte *pgd = s2_domain->pgd; + struct pasid_entry *pte; + + if (!ecap_nest(iommu->ecap)) { + pr_err_ratelimited("%s: No nested translation support\n", + iommu->name); + return -ENODEV; + } + + /* + * Address width should match in two dimensions: CPU vs. IOMMU, + * guest vs. host. + */ + switch (s1_cfg->addr_width) { + case ADDR_WIDTH_4LEVEL: + break; +#ifdef CONFIG_X86 + case ADDR_WIDTH_5LEVEL: + if (!cpu_feature_enabled(X86_FEATURE_LA57) || + !cap_fl5lp_support(iommu->cap)) { + dev_err_ratelimited(dev, + "5-level paging not supported\n"); + return -EINVAL; + } + break; +#endif + default: + dev_err_ratelimited(dev, "Invalid guest address width %d\n", + s1_cfg->addr_width); + return -EINVAL; + } + + if ((s1_cfg->flags & IOMMU_VTD_S1_SRE) && !ecap_srs(iommu->ecap)) { + pr_err_ratelimited("No supervisor request support on %s\n", + iommu->name); + return -EINVAL; + } + + if ((s1_cfg->flags & IOMMU_VTD_S1_EAFE) && !ecap_eafs(iommu->ecap)) { + pr_err_ratelimited("No extended access flag support on %s\n", + iommu->name); + return -EINVAL; + } + + if (s2_domain->agaw > iommu->agaw) { + pr_err_ratelimited("Incompatible agaw %s\n", iommu->name); + return -EINVAL; + } + + spin_lock(&iommu->lock); + pte = intel_pasid_get_entry(dev, pasid); + if (!pte) { + spin_unlock(&iommu->lock); + return -ENODEV; + } + if (pasid_pte_is_present(pte)) { + spin_unlock(&iommu->lock); + return -EBUSY; + } + + pasid_clear_entry(pte); + + if (s1_cfg->addr_width == ADDR_WIDTH_5LEVEL) + pasid_set_flpm(pte, 1); + + pasid_set_flptr(pte, (uintptr_t)s1_gpgd); + + if (s1_cfg->flags & IOMMU_VTD_S1_SRE) { + pasid_set_sre(pte); + if (s1_cfg->flags & IOMMU_VTD_S1_WPE) + pasid_set_wpe(pte); + } + + if (s1_cfg->flags & IOMMU_VTD_S1_EAFE) + pasid_set_eafe(pte); + + pasid_set_slptr(pte, virt_to_phys(pgd)); + pasid_set_fault_enable(pte); + pasid_set_domain_id(pte, did); + pasid_set_address_width(pte, s2_domain->agaw); + pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED); + pasid_set_present(pte); + spin_unlock(&iommu->lock); + + pasid_flush_caches(iommu, pte, pasid, did); + + return 0; +} diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index d6b7d21244b1..864b12848392 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -111,6 +111,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, int intel_pasid_setup_pass_through(struct intel_iommu *iommu, struct dmar_domain *domain, struct device *dev, u32 pasid); +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, + u32 pasid, struct dmar_domain *domain); void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, u32 pasid, bool fault_ignore);
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM }
+/**
- intel_pasid_setup_nested() - Set up PASID entry for nested translation.
- This could be used for nested translation based vIOMMU. e.g. guest IOVA
s/could be/is/
- and guest shared virtual address. In this case, the first level page
- tables are used for GVA/GIOVA-GPA translation in the guest, second level
- page tables are used for GPA-HPA translation.
let's be consistent on using stage-1/stage-2
btw the convention is to have 1-line summary, then the list of parameters followed by detail explanation of the function.
- @iommu: IOMMU which the device belong to
- @dev: Device to be set up for translation
- @pasid: PASID to be programmed in the device PASID table
- @domain: User stage-1 domain nested on a s2 domain
- */
+int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
u32 pasid, struct dmar_domain *domain)
+{
- struct iommu_hwpt_vtd_s1 *s1_cfg = &domain->s1_cfg;
- pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl;
- struct dmar_domain *s2_domain = domain->s2_domain;
- u16 did = domain_id_iommu(domain, iommu);
- struct dma_pte *pgd = s2_domain->pgd;
- struct pasid_entry *pte;
- if (!ecap_nest(iommu->ecap)) {
pr_err_ratelimited("%s: No nested translation support\n",
iommu->name);
return -ENODEV;
- }
-EINVAL
- if (s2_domain->agaw > iommu->agaw) {
pr_err_ratelimited("Incompatible agaw %s\n", iommu-
name);
return -EINVAL;
- }
there is a duplicated check in intel_nested_attach_dev().
- pasid_set_slptr(pte, virt_to_phys(pgd));
- pasid_set_fault_enable(pte);
- pasid_set_domain_id(pte, did);
- pasid_set_address_width(pte, s2_domain->agaw);
- pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
- pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
- pasid_set_present(pte);
- spin_unlock(&iommu->lock);
this lacks of handling of force_snooping
On 2023/8/2 15:10, Tian, Kevin wrote:
- pasid_set_slptr(pte, virt_to_phys(pgd));
- pasid_set_fault_enable(pte);
- pasid_set_domain_id(pte, did);
- pasid_set_address_width(pte, s2_domain->agaw);
- pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
- pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
- pasid_set_present(pte);
- spin_unlock(&iommu->lock);
this lacks of handling of force_snooping
If stage-2 domain has force_snooping attribute set, then we should set the bit field in the pasid entry, right?
How about below additional change?
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 19ac4084913b..86db81ec91f1 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -827,6 +827,9 @@ int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, if (s1_cfg->flags & IOMMU_VTD_S1_EAFE) pasid_set_eafe(pte);
+ if (s2_domain>force_snooping) + pasid_set_pgsnp(pte); + pasid_set_slptr(pte, virt_to_phys(pgd)); pasid_set_fault_enable(pte); pasid_set_domain_id(pte, did);
Best regards, baolu
On 2023/8/2 16:09, Baolu Lu wrote:
On 2023/8/2 15:10, Tian, Kevin wrote:
+Â Â Â pasid_set_slptr(pte, virt_to_phys(pgd)); +Â Â Â pasid_set_fault_enable(pte); +Â Â Â pasid_set_domain_id(pte, did); +Â Â Â pasid_set_address_width(pte, s2_domain->agaw); +Â Â Â pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); +Â Â Â pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED); +Â Â Â pasid_set_present(pte); +Â Â Â spin_unlock(&iommu->lock);
this lacks of handling of force_snooping
If stage-2 domain has force_snooping attribute set, then we should set the bit field in the pasid entry, right?
How about below additional change?
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 19ac4084913b..86db81ec91f1 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -827,6 +827,9 @@ int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, Â Â Â Â Â Â Â if (s1_cfg->flags & IOMMU_VTD_S1_EAFE) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â pasid_set_eafe(pte);
+Â Â Â Â Â Â if (s2_domain>force_snooping)
+ if (s2_domain->force_snooping)
Sorry for typo.
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â pasid_set_pgsnp(pte);
pasid_set_slptr(pte, virt_to_phys(pgd)); Â Â Â Â Â Â Â pasid_set_fault_enable(pte); Â Â Â Â Â Â Â pasid_set_domain_id(pte, did);
Best regards, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Wednesday, August 2, 2023 4:12 PM
On 2023/8/2 16:09, Baolu Lu wrote:
On 2023/8/2 15:10, Tian, Kevin wrote:
+Â Â Â pasid_set_slptr(pte, virt_to_phys(pgd)); +Â Â Â pasid_set_fault_enable(pte); +Â Â Â pasid_set_domain_id(pte, did); +Â Â Â pasid_set_address_width(pte, s2_domain->agaw); +Â Â Â pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); +Â Â Â pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED); +Â Â Â pasid_set_present(pte); +Â Â Â spin_unlock(&iommu->lock);
this lacks of handling of force_snooping
If stage-2 domain has force_snooping attribute set, then we should set the bit field in the pasid entry, right?
How about below additional change?
yes
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 19ac4084913b..86db81ec91f1 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -827,6 +827,9 @@ int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev, Â Â Â Â Â Â Â if (s1_cfg->flags & IOMMU_VTD_S1_EAFE) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â pasid_set_eafe(pte);
+Â Â Â Â Â Â if (s2_domain>force_snooping)
if (s2_domain->force_snooping)
Sorry for typo.
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â pasid_set_pgsnp(pte);
pasid_set_slptr(pte, virt_to_phys(pgd)); Â Â Â Â Â Â Â pasid_set_fault_enable(pte); Â Â Â Â Â Â Â pasid_set_domain_id(pte, did);
Best regards, baolu
On 2023/8/2 15:10, Tian, Kevin wrote:
From: Liu, Yi Lyi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM }
+/**
- intel_pasid_setup_nested() - Set up PASID entry for nested translation.
- This could be used for nested translation based vIOMMU. e.g. guest IOVA
s/could be/is/
Ack.
- and guest shared virtual address. In this case, the first level page
- tables are used for GVA/GIOVA-GPA translation in the guest, second level
- page tables are used for GPA-HPA translation.
let's be consistent on using stage-1/stage-2
btw the convention is to have 1-line summary, then the list of parameters followed by detail explanation of the function.
This patch just follows the existing code style in this file. Need a separated patch to cleanup this.
- @iommu: IOMMU which the device belong to
- @dev: Device to be set up for translation
- @pasid: PASID to be programmed in the device PASID table
- @domain: User stage-1 domain nested on a s2 domain
- */
+int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
u32 pasid, struct dmar_domain *domain)
+{
- struct iommu_hwpt_vtd_s1 *s1_cfg = &domain->s1_cfg;
- pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl;
- struct dmar_domain *s2_domain = domain->s2_domain;
- u16 did = domain_id_iommu(domain, iommu);
- struct dma_pte *pgd = s2_domain->pgd;
- struct pasid_entry *pte;
- if (!ecap_nest(iommu->ecap)) {
pr_err_ratelimited("%s: No nested translation support\n",
iommu->name);
return -ENODEV;
- }
-EINVAL
This is in the attach domain path. -EINVAL has the special meaning of "this domain is not compatible with iommu for the device".
So here, I still think we should return -ENODEV and the caller doesn't need to retry anymore.
- if (s2_domain->agaw > iommu->agaw) {
pr_err_ratelimited("Incompatible agaw %s\n", iommu-
name);
return -EINVAL;
- }
there is a duplicated check in intel_nested_attach_dev().
Yeah, should be removed.
Best regards, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, August 3, 2023 11:13 AM
On 2023/8/2 15:10, Tian, Kevin wrote:
From: Liu, Yi Lyi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM
- if (!ecap_nest(iommu->ecap)) {
pr_err_ratelimited("%s: No nested translation support\n",
iommu->name);
return -ENODEV;
- }
-EINVAL
This is in the attach domain path. -EINVAL has the special meaning of "this domain is not compatible with iommu for the device".
So here, I still think we should return -ENODEV and the caller doesn't need to retry anymore.
You are right. I overlooked that this validation is for a device cap.
This makes the helpers visible to nested.c.
Suggested-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 17 +++++++---------- drivers/iommu/intel/iommu.h | 8 ++++++++ 2 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 5c8c5cdc36cf..289e8c2417ad 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -277,7 +277,6 @@ static LIST_HEAD(dmar_satc_units); #define for_each_rmrr_units(rmrr) \ list_for_each_entry(rmrr, &dmar_rmrr_units, list)
-static void device_block_translation(struct device *dev); static void intel_iommu_domain_free(struct iommu_domain *domain);
int dmar_disabled = !IS_ENABLED(CONFIG_INTEL_IOMMU_DEFAULT_ON); @@ -555,7 +554,7 @@ static unsigned long domain_super_pgsize_bitmap(struct dmar_domain *domain) }
/* Some capabilities may be different across iommus */ -static void domain_update_iommu_cap(struct dmar_domain *domain) +void domain_update_iommu_cap(struct dmar_domain *domain) { domain_update_iommu_coherency(domain); domain->iommu_superpage = domain_update_iommu_superpage(domain, NULL); @@ -1732,8 +1731,7 @@ static struct dmar_domain *alloc_domain(unsigned int type) return domain; }
-static int domain_attach_iommu(struct dmar_domain *domain, - struct intel_iommu *iommu) +int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) { struct iommu_domain_info *info, *curr; unsigned long ndomains; @@ -1782,8 +1780,7 @@ static int domain_attach_iommu(struct dmar_domain *domain, return ret; }
-static void domain_detach_iommu(struct dmar_domain *domain, - struct intel_iommu *iommu) +void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) { struct iommu_domain_info *info;
@@ -3987,7 +3984,7 @@ static void dmar_remove_one_dev_info(struct device *dev) * all DMA requests without PASID from the device are blocked. If the page * table has been set, clean up the data structures. */ -static void device_block_translation(struct device *dev) +void device_block_translation(struct device *dev) { struct device_domain_info *info = dev_iommu_priv_get(dev); struct intel_iommu *iommu = info->iommu; @@ -4093,8 +4090,8 @@ static void intel_iommu_domain_free(struct iommu_domain *domain) domain_exit(to_dmar_domain(domain)); }
-static int prepare_domain_attach_device(struct iommu_domain *domain, - struct device *dev) +int prepare_domain_attach_device(struct iommu_domain *domain, + struct device *dev) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct intel_iommu *iommu; @@ -4334,7 +4331,7 @@ static void domain_set_force_snooping(struct dmar_domain *domain) PASID_RID2PASID); }
-static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain) +bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); unsigned long flags; diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index b3edad7359c9..4b12166a9c3f 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -852,6 +852,14 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, */ #define QI_OPT_WAIT_DRAIN BIT(0)
+int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu); +void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu); +void device_block_translation(struct device *dev); +int prepare_domain_attach_device(struct iommu_domain *domain, + struct device *dev); +bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain); +void domain_update_iommu_cap(struct dmar_domain *domain); + int dmar_ir_support(void);
void *alloc_pgtable_page(int node, gfp_t gfp);
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM
This makes the helpers visible to nested.c.
Suggested-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
This adds the helper for setting the nested domain to a device hence enable nested domain usage on Intel VT-d.
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/nested.c | 52 ++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+)
diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c index 80a64ba87d46..98164894f22f 100644 --- a/drivers/iommu/intel/nested.c +++ b/drivers/iommu/intel/nested.c @@ -11,8 +11,58 @@ #define pr_fmt(fmt) "DMAR: " fmt
#include <linux/iommu.h> +#include <linux/pci.h> +#include <linux/pci-ats.h>
#include "iommu.h" +#include "pasid.h" + +static int intel_nested_attach_dev(struct iommu_domain *domain, + struct device *dev) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct intel_iommu *iommu = info->iommu; + unsigned long flags; + int ret = 0; + + if (info->domain) + device_block_translation(dev); + + if (iommu->agaw < dmar_domain->s2_domain->agaw) { + dev_err_ratelimited(dev, "Adjusted guest address width not compatible\n"); + return -ENODEV; + } + + /* Is s2_domain compatible with this IOMMU? */ + ret = prepare_domain_attach_device(&dmar_domain->s2_domain->domain, dev); + if (ret) { + dev_err_ratelimited(dev, "s2 domain is not compatible\n"); + return ret; + } + + ret = domain_attach_iommu(dmar_domain, iommu); + if (ret) { + dev_err_ratelimited(dev, "Failed to attach domain to iommu\n"); + return ret; + } + + ret = intel_pasid_setup_nested(iommu, dev, + PASID_RID2PASID, dmar_domain); + if (ret) { + domain_detach_iommu(dmar_domain, iommu); + dev_err_ratelimited(dev, "Failed to setup pasid entry\n"); + return ret; + } + + info->domain = dmar_domain; + spin_lock_irqsave(&dmar_domain->lock, flags); + list_add(&info->link, &dmar_domain->devices); + spin_unlock_irqrestore(&dmar_domain->lock, flags); + domain_update_iommu_cap(dmar_domain); + + return 0; +}
static void intel_nested_domain_free(struct iommu_domain *domain) { @@ -20,7 +70,9 @@ static void intel_nested_domain_free(struct iommu_domain *domain) }
static const struct iommu_domain_ops intel_nested_domain_ops = { + .attach_dev = intel_nested_attach_dev, .free = intel_nested_domain_free, + .enforce_cache_coherency = intel_iommu_enforce_cache_coherency, };
struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain,
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM
+static int intel_nested_attach_dev(struct iommu_domain *domain,
struct device *dev)
+{
- struct device_domain_info *info = dev_iommu_priv_get(dev);
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- struct intel_iommu *iommu = info->iommu;
- unsigned long flags;
- int ret = 0;
- if (info->domain)
device_block_translation(dev);
- if (iommu->agaw < dmar_domain->s2_domain->agaw) {
dev_err_ratelimited(dev, "Adjusted guest address width not
compatible\n");
return -ENODEV;
- }
this is the check duplicated with patch04.
- ret = domain_attach_iommu(dmar_domain, iommu);
- if (ret) {
dev_err_ratelimited(dev, "Failed to attach domain to
iommu\n");
return ret;
- }
[...]
- domain_update_iommu_cap(dmar_domain);
iommu_cap is already updated in domain_attach_iommu().
static const struct iommu_domain_ops intel_nested_domain_ops = {
- .attach_dev = intel_nested_attach_dev, .free = intel_nested_domain_free,
- .enforce_cache_coherency = intel_iommu_enforce_cache_coherency,
this is not required. enforce_cache_coherency() will be called on parent hwpt when it's being created. patch04 should check parent's force_snooping to set pgsnp in the pasid entry.
As Jason explained it should be done only for kernel owned page table.
On 2023/8/2 15:22, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM
+static int intel_nested_attach_dev(struct iommu_domain *domain,
struct device *dev)
+{
- struct device_domain_info *info = dev_iommu_priv_get(dev);
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- struct intel_iommu *iommu = info->iommu;
- unsigned long flags;
- int ret = 0;
- if (info->domain)
device_block_translation(dev);
- if (iommu->agaw < dmar_domain->s2_domain->agaw) {
dev_err_ratelimited(dev, "Adjusted guest address width not
compatible\n");
return -ENODEV;
- }
this is the check duplicated with patch04.
Ack. No need for a check here.
- ret = domain_attach_iommu(dmar_domain, iommu);
- if (ret) {
dev_err_ratelimited(dev, "Failed to attach domain to
iommu\n");
return ret;
- }
[...]
- domain_update_iommu_cap(dmar_domain);
iommu_cap is already updated in domain_attach_iommu().
Ack. We will eventually remove this helper when every domain is allocated for a specific device.
static const struct iommu_domain_ops intel_nested_domain_ops = {
- .attach_dev = intel_nested_attach_dev, .free = intel_nested_domain_free,
- .enforce_cache_coherency = intel_iommu_enforce_cache_coherency,
this is not required. enforce_cache_coherency() will be called on parent hwpt when it's being created. patch04 should check parent's force_snooping to set pgsnp in the pasid entry.
As Jason explained it should be done only for kernel owned page table.
Ack and thanks!
Best regards, baolu
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_TYPE_VTD_S1 type.
Cache invalidation path is performance path, so it's better to avoid memory allocation in such path. To achieve it, this path reuses the ucmd_buffer to copy user data. So the new data structures are added in the ucmd_buffer union to avoid overflow.
This only supports invalidating IOTLB, but no for device-TLB as device-TLB invalidation will be covered automatically in the IOTLB invalidation if the underlying IOMMU driver has enabled ATS for the affected device.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/iommufd/main.c | 6 ++++ include/uapi/linux/iommufd.h | 58 ++++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+)
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index d49837397dfa..b927ace7f3af 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -485,6 +485,12 @@ union ucmd_buffer { #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif + /* + * hwpt_type specific structure used in the cache invalidation + * path. + */ + struct iommu_hwpt_vtd_s1_invalidate vtd; + struct iommu_hwpt_vtd_s1_invalidate_desc req_vtd; };
struct iommufd_ioctl_op { diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 90b0d3f603a7..2c1241448c87 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -523,6 +523,64 @@ struct iommu_resv_iova_ranges { }; #define IOMMU_RESV_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_RESV_IOVA_RANGES)
+/** + * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d + * stage-1 cache invalidation + * @IOMMU_VTD_QI_FLAGS_LEAF: The LEAF flag indicates whether only the + * leaf PTE caching needs to be invalidated + * and other paging structure caches can be + * preserved. + */ +enum iommu_hwpt_vtd_s1_invalidate_flags { + IOMMU_VTD_QI_FLAGS_LEAF = 1 << 0, +}; + +/** + * struct iommu_hwpt_vtd_s1_invalidate_desc - Intel VT-d stage-1 cache + * invalidation descriptor + * @addr: The start address of the addresses to be invalidated. + * @npages: Number of contiguous 4K pages to be invalidated. + * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags + * @__reserved: Must be 0 + * + * The Intel VT-d specific invalidation data for user-managed stage-1 cache + * invalidation under nested translation. Userspace uses this structure to + * tell host about the impacted caches after modifying the stage-1 page table. + * + * Invalidating all the caches related to the hw_pagetable by setting @addr + * to be 0 and @npages to be __aligned_u64(-1). + */ +struct iommu_hwpt_vtd_s1_invalidate_desc { + __aligned_u64 addr; + __aligned_u64 npages; + __u32 flags; + __u32 __reserved; +}; + +/** + * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation + * (IOMMU_HWPT_TYPE_VTD_S1) + * @flags: Must be 0 + * @entry_size: Size in bytes of each cache invalidation request + * @entry_nr_uptr: User pointer to the number of invalidation requests. + * Kernel reads it to get the number of requests and + * updates the buffer with the number of requests that + * have been processed successfully. This pointer must + * point to a __u32 type of memory location. + * @inv_data_uptr: Pointer to the cache invalidation requests + * + * The Intel VT-d specific invalidation data for a set of cache invalidation + * requests. Kernel loops the requests one-by-one and stops when failure + * is encountered. The number of handled requests is reported to user by + * writing the buffer pointed by @entry_nr_uptr. + */ +struct iommu_hwpt_vtd_s1_invalidate { + __u32 flags; + __u32 entry_size; + __aligned_u64 entry_nr_uptr; + __aligned_u64 inv_data_uptr; +}; + /** * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) * @size: sizeof(struct iommu_hwpt_invalidate)
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:13 PM
This adds the data structure for flushing iotlb for the nested domain allocated with IOMMU_HWPT_TYPE_VTD_S1 type.
Cache invalidation path is performance path, so it's better to avoid memory allocation in such path. To achieve it, this path reuses the ucmd_buffer to copy user data. So the new data structures are added in the ucmd_buffer union to avoid overflow.
this patch has nothing to do with ucmd_buffer
+/**
- struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
(IOMMU_HWPT_TYPE_VTD_S1)
- @flags: Must be 0
- @entry_size: Size in bytes of each cache invalidation request
- @entry_nr_uptr: User pointer to the number of invalidation requests.
Kernel reads it to get the number of requests and
updates the buffer with the number of requests that
have been processed successfully. This pointer must
point to a __u32 type of memory location.
- @inv_data_uptr: Pointer to the cache invalidation requests
- The Intel VT-d specific invalidation data for a set of cache invalidation
- requests. Kernel loops the requests one-by-one and stops when failure
- is encountered. The number of handled requests is reported to user by
- writing the buffer pointed by @entry_nr_uptr.
- */
+struct iommu_hwpt_vtd_s1_invalidate {
- __u32 flags;
- __u32 entry_size;
- __aligned_u64 entry_nr_uptr;
- __aligned_u64 inv_data_uptr;
+};
I wonder whether this array can be defined directly in the common struct iommu_hwpt_invalidate so there is no need for underlying iommu driver to further deal with user buffers, including various minsz/backward compat. check.
smmu may not require it by using a native queue format. But that could be considered as a special case of 1-entry array. With careful coding the added cost should be negligible.
Jason, your thought?
On Wed, Aug 02, 2023 at 07:41:05AM +0000, Tian, Kevin wrote:
+/**
- struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
(IOMMU_HWPT_TYPE_VTD_S1)
- @flags: Must be 0
- @entry_size: Size in bytes of each cache invalidation request
- @entry_nr_uptr: User pointer to the number of invalidation requests.
Kernel reads it to get the number of requests and
updates the buffer with the number of requests that
have been processed successfully. This pointer must
point to a __u32 type of memory location.
- @inv_data_uptr: Pointer to the cache invalidation requests
- The Intel VT-d specific invalidation data for a set of cache invalidation
- requests. Kernel loops the requests one-by-one and stops when failure
- is encountered. The number of handled requests is reported to user by
- writing the buffer pointed by @entry_nr_uptr.
- */
+struct iommu_hwpt_vtd_s1_invalidate {
- __u32 flags;
- __u32 entry_size;
- __aligned_u64 entry_nr_uptr;
- __aligned_u64 inv_data_uptr;
+};
I wonder whether this array can be defined directly in the common struct iommu_hwpt_invalidate so there is no need for underlying iommu driver to further deal with user buffers, including various minsz/backward compat. check.
You want to have an array and another chunk of data?
What is the array for? To do batching?
It means we have to allocate memory on this path, that doesn't seem like the right direction for a performance improvement..
Having the driver copy in a loop might be better
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, August 2, 2023 9:48 PM
On Wed, Aug 02, 2023 at 07:41:05AM +0000, Tian, Kevin wrote:
+/**
- struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
(IOMMU_HWPT_TYPE_VTD_S1)
- @flags: Must be 0
- @entry_size: Size in bytes of each cache invalidation request
- @entry_nr_uptr: User pointer to the number of invalidation requests.
Kernel reads it to get the number of requests and
updates the buffer with the number of requests that
have been processed successfully. This pointer must
point to a __u32 type of memory location.
- @inv_data_uptr: Pointer to the cache invalidation requests
- The Intel VT-d specific invalidation data for a set of cache invalidation
- requests. Kernel loops the requests one-by-one and stops when
failure
- is encountered. The number of handled requests is reported to user
by
- writing the buffer pointed by @entry_nr_uptr.
- */
+struct iommu_hwpt_vtd_s1_invalidate {
- __u32 flags;
- __u32 entry_size;
- __aligned_u64 entry_nr_uptr;
- __aligned_u64 inv_data_uptr;
+};
I wonder whether this array can be defined directly in the common struct iommu_hwpt_invalidate so there is no need for underlying iommu driver to further deal with user buffers, including various minsz/backward compat. check.
You want to have an array and another chunk of data?
What is the array for? To do batching?
yes, it's for batching
It means we have to allocate memory on this path, that doesn't seem like the right direction for a performance improvement..
It reuses the ucmd_buffer to avoid memory allocation:
@@ -485,6 +485,12 @@ union ucmd_buffer { #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif + /* + * hwpt_type specific structure used in the cache invalidation + * path. + */ + struct iommu_hwpt_vtd_s1_invalidate vtd; + struct iommu_hwpt_vtd_s1_invalidate_desc req_vtd; };
I don't quite like this way.
Having the driver copy in a loop might be better
Can you elaborate?
From: Tian, Kevin kevin.tian@intel.com Sent: Thursday, August 3, 2023 8:39 AM
From: Jason Gunthorpe jgg@nvidia.com Sent: Wednesday, August 2, 2023 9:48 PM
On Wed, Aug 02, 2023 at 07:41:05AM +0000, Tian, Kevin wrote:
+/**
- struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
(IOMMU_HWPT_TYPE_VTD_S1)
- @flags: Must be 0
- @entry_size: Size in bytes of each cache invalidation request
- @entry_nr_uptr: User pointer to the number of invalidation requests.
Kernel reads it to get the number of requests and
updates the buffer with the number of requests that
have been processed successfully. This pointer must
point to a __u32 type of memory location.
- @inv_data_uptr: Pointer to the cache invalidation requests
- The Intel VT-d specific invalidation data for a set of cache invalidation
- requests. Kernel loops the requests one-by-one and stops when
failure
- is encountered. The number of handled requests is reported to user
by
- writing the buffer pointed by @entry_nr_uptr.
- */
+struct iommu_hwpt_vtd_s1_invalidate {
- __u32 flags;
- __u32 entry_size;
- __aligned_u64 entry_nr_uptr;
- __aligned_u64 inv_data_uptr;
+};
I wonder whether this array can be defined directly in the common struct iommu_hwpt_invalidate so there is no need for underlying iommu driver to further deal with user buffers, including various minsz/backward compat. check.
You want to have an array and another chunk of data?
What is the array for? To do batching?
yes, it's for batching
It means we have to allocate memory on this path, that doesn't seem like the right direction for a performance improvement..
It reuses the ucmd_buffer to avoid memory allocation:
I guess your point is to copy each invalidation descriptor in the common layer and pass the descriptor to iommu driver. right?
@@ -485,6 +485,12 @@ union ucmd_buffer { #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; #endif
- /*
* hwpt_type specific structure used in the cache invalidation
* path.
*/
- struct iommu_hwpt_vtd_s1_invalidate vtd;
- struct iommu_hwpt_vtd_s1_invalidate_desc req_vtd;
};
I don't quite like this way.
This is because each descriptor is stored in the uncmd_buffer. So Need to put the struct iommu_hwpt_vtd_s1_invalidate_desc here.
Having the driver copy in a loop might be better
Can you elaborate?
I think Jason means the way in patch 09.
Regards, Yi Liu
On Fri, Aug 04, 2023 at 01:04:57PM +0000, Liu, Yi L wrote:
Having the driver copy in a loop might be better
Can you elaborate?
I think Jason means the way in patch 09.
Yeah, you can't reuse the stack buffer for an array, so patch 9 copies each element uniquely.
This is more calls to copy_to_user, which has some cost
But we avoid a memory allocation
Patch 9 should not abuse the user_data, cast it to the inv_info and just put req on the stack:
struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data; struct iommu_hwpt_vtd_s1_invalidate_desc req;
But I'm not sure about this entry_size logic, what happens if the entry_size is larger than the kernel supports? I think it should fail..
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, August 4, 2023 10:04 PM
On Fri, Aug 04, 2023 at 01:04:57PM +0000, Liu, Yi L wrote:
Having the driver copy in a loop might be better
Can you elaborate?
I think Jason means the way in patch 09.
Yeah, you can't reuse the stack buffer for an array, so patch 9 copies each element uniquely.
This is more calls to copy_to_user, which has some cost
But we avoid a memory allocation
Yes.
Patch 9 should not abuse the user_data, cast it to the inv_info and just put req on the stack:
struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data; struct iommu_hwpt_vtd_s1_invalidate_desc req;
Sure. The way in patch 09 is a bit tricky. The above is better and clearer. 😊
But I'm not sure about this entry_size logic, what happens if the entry_size is larger than the kernel supports? I think it should fail..
Yes. should fail. It should be failed in copy_struct_from_user() as I use it to copy the struct iommu_hwpt_vtd_s1_invalidate_desc.
* -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src.
Regards, Yi Liu
This makes the helpers visible to nested.c.
Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 10 +++++----- drivers/iommu/intel/iommu.h | 6 ++++++ 2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 289e8c2417ad..3119a79ebc83 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1466,10 +1466,10 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain, spin_unlock_irqrestore(&domain->lock, flags); }
-static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, - struct dmar_domain *domain, - unsigned long pfn, unsigned int pages, - int ih, int map) +void iommu_flush_iotlb_psi(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages, + int ih, int map) { unsigned int aligned_pages = __roundup_pow_of_two(pages); unsigned int mask = ilog2(aligned_pages); @@ -1542,7 +1542,7 @@ static inline void __mapping_notify_one(struct intel_iommu *iommu, iommu_flush_write_buffer(iommu); }
-static void intel_flush_iotlb_all(struct iommu_domain *domain) +void intel_flush_iotlb_all(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct iommu_domain_info *info; diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 4b12166a9c3f..5b292213bcb8 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -859,6 +859,12 @@ int prepare_domain_attach_device(struct iommu_domain *domain, struct device *dev); bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain); void domain_update_iommu_cap(struct dmar_domain *domain); +void iommu_flush_iotlb_psi(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages, + int ih, int map); +void intel_flush_iotlb_all(struct iommu_domain *domain); +
int dmar_ir_support(void);
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
This makes the helpers visible to nested.c.
Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
This implements the .cache_invalidate_user() callback and sets the .cache_invalidate_user_data_len to support iotlb flush for nested domain.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/nested.c | 63 ++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+)
diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c index 98164894f22f..2739c0d7880d 100644 --- a/drivers/iommu/intel/nested.c +++ b/drivers/iommu/intel/nested.c @@ -69,8 +69,71 @@ static void intel_nested_domain_free(struct iommu_domain *domain) kfree(to_dmar_domain(domain)); }
+static void intel_nested_invalidate(struct device *dev, + struct dmar_domain *domain, + u64 addr, unsigned long npages) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct intel_iommu *iommu = info->iommu; + + if (addr == 0 && npages == -1) + intel_flush_iotlb_all(&domain->domain); + else + iommu_flush_iotlb_psi(iommu, domain, + addr >> VTD_PAGE_SHIFT, + npages, 1, 0); +} + +static int intel_nested_cache_invalidate_user(struct iommu_domain *domain, + void *user_data) +{ + struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data; + struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data; + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + unsigned int entry_size = inv_info->entry_size; + u64 uptr = inv_info->inv_data_uptr; + u64 nr_uptr = inv_info->entry_nr_uptr; + struct device_domain_info *info; + u32 entry_nr, index; + unsigned long flags; + int ret = 0; + + if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr))) + return -EFAULT; + + for (index = 0; index < entry_nr; index++) { + ret = copy_struct_from_user(req, sizeof(*req), + u64_to_user_ptr(uptr + index * entry_size), + entry_size); + if (ret) { + pr_err_ratelimited("Failed to fetch invalidation request\n"); + break; + } + + if (req->__reserved || (req->flags & ~IOMMU_VTD_QI_FLAGS_LEAF) || + !IS_ALIGNED(req->addr, VTD_PAGE_SIZE)) { + ret = -EINVAL; + break; + } + + spin_lock_irqsave(&dmar_domain->lock, flags); + list_for_each_entry(info, &dmar_domain->devices, link) + intel_nested_invalidate(info->dev, dmar_domain, + req->addr, req->npages); + spin_unlock_irqrestore(&dmar_domain->lock, flags); + } + + if (put_user(index, (uint32_t __user *)u64_to_user_ptr(nr_uptr))) + return -EFAULT; + + return ret; +} + static const struct iommu_domain_ops intel_nested_domain_ops = { .attach_dev = intel_nested_attach_dev, + .cache_invalidate_user = intel_nested_cache_invalidate_user, + .cache_invalidate_user_data_len = + sizeof(struct iommu_hwpt_vtd_s1_invalidate), .free = intel_nested_domain_free, .enforce_cache_coherency = intel_iommu_enforce_cache_coherency, };
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
+{
- struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data;
- struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
If continuing this direction then the driver should also check minsz etc. for struct iommu_hwpt_vtd_s1_invalidate and iommu_hwpt_vtd_s1_invalidate_desc since they are uAPI and subject to change.
if (ret) {
pr_err_ratelimited("Failed to fetch invalidation
request\n");
break;
}
if (req->__reserved || (req->flags &
~IOMMU_VTD_QI_FLAGS_LEAF) ||
!IS_ALIGNED(req->addr, VTD_PAGE_SIZE)) {
ret = -EINVAL;
break;
}
spin_lock_irqsave(&dmar_domain->lock, flags);
list_for_each_entry(info, &dmar_domain->devices, link)
intel_nested_invalidate(info->dev, dmar_domain,
req->addr, req->npages);
spin_unlock_irqrestore(&dmar_domain->lock, flags);
Disabling interrupt while invalidating iotlb is certainly unacceptable.
Actually there is no need to walk devices. Under dmar_domain there is already a list of attached iommu's.
On 2023/8/2 15:46, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
+{
- struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data;
- struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
If continuing this direction then the driver should also check minsz etc. for struct iommu_hwpt_vtd_s1_invalidate and iommu_hwpt_vtd_s1_invalidate_desc since they are uAPI and subject to change.
Agreed.
if (ret) {
pr_err_ratelimited("Failed to fetch invalidation
request\n");
break;
}
if (req->__reserved || (req->flags &
~IOMMU_VTD_QI_FLAGS_LEAF) ||
!IS_ALIGNED(req->addr, VTD_PAGE_SIZE)) {
ret = -EINVAL;
break;
}
spin_lock_irqsave(&dmar_domain->lock, flags);
list_for_each_entry(info, &dmar_domain->devices, link)
intel_nested_invalidate(info->dev, dmar_domain,
req->addr, req->npages);
spin_unlock_irqrestore(&dmar_domain->lock, flags);
Disabling interrupt while invalidating iotlb is certainly unacceptable.
Actually there is no need to walk devices. Under dmar_domain there is already a list of attached iommu's.
Walking device is only necessary when invalidating device TLB. For iotlb invalidation, it only needs to know the iommu's.
Best regards, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, August 3, 2023 11:25 AM
On 2023/8/2 15:46, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
spin_lock_irqsave(&dmar_domain->lock, flags);
list_for_each_entry(info, &dmar_domain->devices, link)
intel_nested_invalidate(info->dev, dmar_domain,
req->addr, req->npages);
spin_unlock_irqrestore(&dmar_domain->lock, flags);
Disabling interrupt while invalidating iotlb is certainly unacceptable.
Actually there is no need to walk devices. Under dmar_domain there is already a list of attached iommu's.
Walking device is only necessary when invalidating device TLB. For iotlb invalidation, it only needs to know the iommu's.
even for device tlb we may think whether there is any better way to avoid disabling interrupt. It's a slow path, especially in a guest.
On 2023/8/3 12:00, Tian, Kevin wrote:
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, August 3, 2023 11:25 AM
On 2023/8/2 15:46, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
spin_lock_irqsave(&dmar_domain->lock, flags);
list_for_each_entry(info, &dmar_domain->devices, link)
intel_nested_invalidate(info->dev, dmar_domain,
req->addr, req->npages);
spin_unlock_irqrestore(&dmar_domain->lock, flags);
Disabling interrupt while invalidating iotlb is certainly unacceptable.
Actually there is no need to walk devices. Under dmar_domain there is already a list of attached iommu's.
Walking device is only necessary when invalidating device TLB. For iotlb invalidation, it only needs to know the iommu's.
even for device tlb we may think whether there is any better way to avoid disabling interrupt. It's a slow path, especially in a guest.
I ever tried this. But some device drivers call iommu_unmap() in the interrupt critical path. :-( So we have a long way to go.
Best regards, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, August 3, 2023 12:06 PM
On 2023/8/3 12:00, Tian, Kevin wrote:
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, August 3, 2023 11:25 AM
On 2023/8/2 15:46, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
spin_lock_irqsave(&dmar_domain->lock, flags);
list_for_each_entry(info, &dmar_domain->devices, link)
intel_nested_invalidate(info->dev, dmar_domain,
req->addr, req->npages);
spin_unlock_irqrestore(&dmar_domain->lock, flags);
Disabling interrupt while invalidating iotlb is certainly unacceptable.
Actually there is no need to walk devices. Under dmar_domain there is already a list of attached iommu's.
Walking device is only necessary when invalidating device TLB. For iotlb invalidation, it only needs to know the iommu's.
even for device tlb we may think whether there is any better way to avoid disabling interrupt. It's a slow path, especially in a guest.
I ever tried this. But some device drivers call iommu_unmap() in the interrupt critical path. :-( So we have a long way to go.
emmm... this path only comes from iommufd and the domain is user-managed. There won't be kernel drivers to call iommu_unmap() on such domain.
On 2023/8/3 12:13, Tian, Kevin wrote:
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, August 3, 2023 12:06 PM
On 2023/8/3 12:00, Tian, Kevin wrote:
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, August 3, 2023 11:25 AM
On 2023/8/2 15:46, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
spin_lock_irqsave(&dmar_domain->lock, flags);
list_for_each_entry(info, &dmar_domain->devices, link)
intel_nested_invalidate(info->dev, dmar_domain,
req->addr, req->npages);
spin_unlock_irqrestore(&dmar_domain->lock, flags);
Disabling interrupt while invalidating iotlb is certainly unacceptable.
Actually there is no need to walk devices. Under dmar_domain there is already a list of attached iommu's.
Walking device is only necessary when invalidating device TLB. For iotlb invalidation, it only needs to know the iommu's.
even for device tlb we may think whether there is any better way to avoid disabling interrupt. It's a slow path, especially in a guest.
I ever tried this. But some device drivers call iommu_unmap() in the interrupt critical path. :-( So we have a long way to go.
emmm... this path only comes from iommufd and the domain is user-managed. There won't be kernel drivers to call iommu_unmap() on such domain.
Probably we can use a different lock for nested domain and add a comment around the lock with above explanation.
Best regards, baolu
From: Tian, Kevin kevin.tian@intel.com Sent: Wednesday, August 2, 2023 3:47 PM
Subject: RE: [PATCH v4 09/12] iommu/vt-d: Add iotlb flush for nested domain
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
+{
- struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data;
- struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
If continuing this direction then the driver should also check minsz etc. for struct iommu_hwpt_vtd_s1_invalidate and iommu_hwpt_vtd_s1_invalidate_desc since they are uAPI and subject to change.
Then needs to define size in the uapi data structure, and copy size first and check minsz before going forward. How about the structures for hwpt alloc like struct iommu_hwpt_vtd_s1? Should check minsz for them as well?
Regards, Yi Liu
On Mon, Aug 07, 2023 at 03:08:29PM +0000, Liu, Yi L wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
+{
- struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data;
- struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
If continuing this direction then the driver should also check minsz etc. for struct iommu_hwpt_vtd_s1_invalidate and iommu_hwpt_vtd_s1_invalidate_desc since they are uAPI and subject to change.
Then needs to define size in the uapi data structure, and copy size first and check minsz before going forward. How about the structures for hwpt alloc like struct iommu_hwpt_vtd_s1? Should check minsz for them as well?
Assuming that every uAPI data structure needs a min_size, we can either add a structure holding all min_sizes like iommufd main.c or have another xx_min_len in iommu_/domain_ops.
Currently, we have the union of structures in iommu.h and also a xx_data_len in iommu_ops. So, neither of the two places seem to be optimal from my p.o.v... any suggestion?
Also, alloc allows data_len=0, so a min_size check will be only applied to data_len > 0 case.
Thanks Nic
On Mon, Aug 07, 2023 at 08:12:37PM -0700, Nicolin Chen wrote:
On Mon, Aug 07, 2023 at 03:08:29PM +0000, Liu, Yi L wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
+{
- struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data;
- struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
If continuing this direction then the driver should also check minsz etc. for struct iommu_hwpt_vtd_s1_invalidate and iommu_hwpt_vtd_s1_invalidate_desc since they are uAPI and subject to change.
Then needs to define size in the uapi data structure, and copy size first and check minsz before going forward. How about the structures for hwpt alloc like struct iommu_hwpt_vtd_s1? Should check minsz for them as well?
Assuming that every uAPI data structure needs a min_size, we can either add a structure holding all min_sizes like iommufd main.c or have another xx_min_len in iommu_/domain_ops.
If driver is doing the copy it is OK that driver does the min_size check too
Jason
On Tue, Aug 08, 2023 at 09:34:03AM -0300, Jason Gunthorpe wrote:
On Mon, Aug 07, 2023 at 08:12:37PM -0700, Nicolin Chen wrote:
On Mon, Aug 07, 2023 at 03:08:29PM +0000, Liu, Yi L wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
+{
- struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data;
- struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (get_user(entry_nr, (uint32_t __user *)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
If continuing this direction then the driver should also check minsz etc. for struct iommu_hwpt_vtd_s1_invalidate and iommu_hwpt_vtd_s1_invalidate_desc since they are uAPI and subject to change.
Then needs to define size in the uapi data structure, and copy size first and check minsz before going forward. How about the structures for hwpt alloc like struct iommu_hwpt_vtd_s1? Should check minsz for them as well?
Assuming that every uAPI data structure needs a min_size, we can either add a structure holding all min_sizes like iommufd main.c or have another xx_min_len in iommu_/domain_ops.
If driver is doing the copy it is OK that driver does the min_size check too
Ah, just realized my reply above was missing a context..
Yi and I are having a concern that the core iommu_hpwt_alloc() and iommu_hwpt_cache_invalidate(), in the nesting series, copy data without checking the min_sizes. So, we are trying to add the missing piece into the next version but not sure which way could be optimal.
It probably makes sense to add cache_invalidate_user_min_len next to the existing cache_invalidate_user_data_len. For the iommu_hwpt_alloc, we are missing a data_len, as the core just uses sizeof(union iommu_domain_user_data) when calling the copy_struct_from_user().
Perhaps we could add two pairs of data_len/min_len in the ops structs: // iommu_ops const size_t domain_alloc_user_data_len; // for sanity© const size_t domain_alloc_user_min_len; // for sanity only // iommu_domain_ops const size_t cache_invalidate_user_data_len; // for sanity© const size_t cache_invalidate_user_min_len; // for sanity only
Thanks Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 9, 2023 1:42 AM
On Tue, Aug 08, 2023 at 09:34:03AM -0300, Jason Gunthorpe wrote:
On Mon, Aug 07, 2023 at 08:12:37PM -0700, Nicolin Chen wrote:
On Mon, Aug 07, 2023 at 03:08:29PM +0000, Liu, Yi L wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
+static int intel_nested_cache_invalidate_user(struct
iommu_domain
*domain,
void *user_data)
+{
- struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data;
- struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data;
- struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned int entry_size = inv_info->entry_size;
- u64 uptr = inv_info->inv_data_uptr;
- u64 nr_uptr = inv_info->entry_nr_uptr;
- struct device_domain_info *info;
- u32 entry_nr, index;
- unsigned long flags;
- int ret = 0;
- if (get_user(entry_nr, (uint32_t __user
*)u64_to_user_ptr(nr_uptr)))
return -EFAULT;
- for (index = 0; index < entry_nr; index++) {
ret = copy_struct_from_user(req, sizeof(*req),
u64_to_user_ptr(uptr + index *
entry_size),
entry_size);
If continuing this direction then the driver should also check minsz etc. for struct iommu_hwpt_vtd_s1_invalidate and
iommu_hwpt_vtd_s1_invalidate_desc
since they are uAPI and subject to change.
Then needs to define size in the uapi data structure, and copy size first
and
check minsz before going forward. How about the structures for hwpt
alloc
like struct iommu_hwpt_vtd_s1? Should check minsz for them as well?
Assuming that every uAPI data structure needs a min_size, we can either add a structure holding all min_sizes like iommufd main.c or have another xx_min_len in iommu_/domain_ops.
If driver is doing the copy it is OK that driver does the min_size check too
Ah, just realized my reply above was missing a context..
Yi and I are having a concern that the core iommu_hpwt_alloc() and iommu_hwpt_cache_invalidate(), in the nesting series, copy data without checking the min_sizes. So, we are trying to add the missing piece into the next version but not sure which way could be optimal.
It probably makes sense to add cache_invalidate_user_min_len next to the existing cache_invalidate_user_data_len. For the iommu_hwpt_alloc, we are missing a data_len, as the core just uses sizeof(union iommu_domain_user_data) when calling the copy_struct_from_user().
Perhaps we could add two pairs of data_len/min_len in the ops structs: // iommu_ops const size_t domain_alloc_user_data_len; // for sanity© const size_t domain_alloc_user_min_len; // for sanity only // iommu_domain_ops const size_t cache_invalidate_user_data_len; // for sanity© const size_t cache_invalidate_user_min_len; // for sanity only
What about creating a simple array to track type specific len in iommufd instead of adding more fields to iommu/domain_ops? anyway it's iommufd doing the copy and all the type specific structures are already defined in the uapi header.
and a similar example already exists in union ucmd_buffer which includes type specific structures to avoid memory copy...
From: Tian, Kevin kevin.tian@intel.com Sent: Wednesday, August 9, 2023 4:23 PM
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 9, 2023 1:42 AM
On Tue, Aug 08, 2023 at 09:34:03AM -0300, Jason Gunthorpe wrote:
On Mon, Aug 07, 2023 at 08:12:37PM -0700, Nicolin Chen wrote:
On Mon, Aug 07, 2023 at 03:08:29PM +0000, Liu, Yi L wrote:
> From: Liu, Yi L yi.l.liu@intel.com > Sent: Monday, July 24, 2023 7:14 PM > > +static int intel_nested_cache_invalidate_user(struct
iommu_domain
> *domain, > + void *user_data) > +{ > + struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data; > + struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data; > + struct dmar_domain *dmar_domain = to_dmar_domain(domain); > + unsigned int entry_size = inv_info->entry_size; > + u64 uptr = inv_info->inv_data_uptr; > + u64 nr_uptr = inv_info->entry_nr_uptr; > + struct device_domain_info *info; > + u32 entry_nr, index; > + unsigned long flags; > + int ret = 0; > + > + if (get_user(entry_nr, (uint32_t __user
*)u64_to_user_ptr(nr_uptr)))
> + return -EFAULT; > + > + for (index = 0; index < entry_nr; index++) { > + ret = copy_struct_from_user(req, sizeof(*req), > + u64_to_user_ptr(uptr + index * > entry_size), > + entry_size);
If continuing this direction then the driver should also check minsz etc. for struct iommu_hwpt_vtd_s1_invalidate and
iommu_hwpt_vtd_s1_invalidate_desc
since they are uAPI and subject to change.
Then needs to define size in the uapi data structure, and copy size first
and
check minsz before going forward. How about the structures for hwpt
alloc
like struct iommu_hwpt_vtd_s1? Should check minsz for them as well?
Assuming that every uAPI data structure needs a min_size, we can either add a structure holding all min_sizes like iommufd main.c or have another xx_min_len in iommu_/domain_ops.
If driver is doing the copy it is OK that driver does the min_size check too
Ah, just realized my reply above was missing a context..
Yi and I are having a concern that the core iommu_hpwt_alloc() and iommu_hwpt_cache_invalidate(), in the nesting series, copy data without checking the min_sizes. So, we are trying to add the missing piece into the next version but not sure which way could be optimal.
It probably makes sense to add cache_invalidate_user_min_len next to the existing cache_invalidate_user_data_len. For the iommu_hwpt_alloc, we are missing a data_len, as the core just uses sizeof(union iommu_domain_user_data) when calling the copy_struct_from_user().
Perhaps we could add two pairs of data_len/min_len in the ops structs: // iommu_ops const size_t domain_alloc_user_data_len; // for sanity© const size_t domain_alloc_user_min_len; // for sanity only // iommu_domain_ops const size_t cache_invalidate_user_data_len; // for sanity© const size_t cache_invalidate_user_min_len; // for sanity only
What about creating a simple array to track type specific len in iommufd instead of adding more fields to iommu/domain_ops? anyway it's iommufd doing the copy and all the type specific structures are already defined in the uapi header.
Then index the array with type value, is it? Seems like we have defined such array before for the length of hwpt_alloc and invalidate structures. but finally we dropped it the array may grow largely per new types.
and a similar example already exists in union ucmd_buffer which includes type specific structures to avoid memory copy...
Not quite get here. ucmd_buffer is a union used to copy any user data. But here we want to check the minsz of the the user data. Seems not related.
Regards, Yi Liu
From: Liu, Yi L yi.l.liu@intel.com Sent: Wednesday, August 9, 2023 4:50 PM
From: Tian, Kevin kevin.tian@intel.com Sent: Wednesday, August 9, 2023 4:23 PM
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 9, 2023 1:42 AM
On Tue, Aug 08, 2023 at 09:34:03AM -0300, Jason Gunthorpe wrote:
On Mon, Aug 07, 2023 at 08:12:37PM -0700, Nicolin Chen wrote:
On Mon, Aug 07, 2023 at 03:08:29PM +0000, Liu, Yi L wrote:
> > From: Liu, Yi L yi.l.liu@intel.com > > Sent: Monday, July 24, 2023 7:14 PM > > > > +static int intel_nested_cache_invalidate_user(struct
iommu_domain
> > *domain, > > + void *user_data) > > +{ > > + struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data; > > + struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data; > > + struct dmar_domain *dmar_domain =
to_dmar_domain(domain);
> > + unsigned int entry_size = inv_info->entry_size; > > + u64 uptr = inv_info->inv_data_uptr; > > + u64 nr_uptr = inv_info->entry_nr_uptr; > > + struct device_domain_info *info; > > + u32 entry_nr, index; > > + unsigned long flags; > > + int ret = 0; > > + > > + if (get_user(entry_nr, (uint32_t __user
*)u64_to_user_ptr(nr_uptr)))
> > + return -EFAULT; > > + > > + for (index = 0; index < entry_nr; index++) { > > + ret = copy_struct_from_user(req, sizeof(*req), > > + u64_to_user_ptr(uptr + index * > > entry_size), > > + entry_size); > > If continuing this direction then the driver should also check minsz
etc.
> for struct iommu_hwpt_vtd_s1_invalidate and
iommu_hwpt_vtd_s1_invalidate_desc
> since they are uAPI and subject to change.
Then needs to define size in the uapi data structure, and copy size
first
and
check minsz before going forward. How about the structures for
hwpt
alloc
like struct iommu_hwpt_vtd_s1? Should check minsz for them as
well?
Assuming that every uAPI data structure needs a min_size, we can either add a structure holding all min_sizes like iommufd main.c or have another xx_min_len in iommu_/domain_ops.
If driver is doing the copy it is OK that driver does the min_size check too
Ah, just realized my reply above was missing a context..
Yi and I are having a concern that the core iommu_hpwt_alloc() and iommu_hwpt_cache_invalidate(), in the nesting series, copy data without checking the min_sizes. So, we are trying to add the missing piece into the next version but not sure which way could be optimal.
It probably makes sense to add cache_invalidate_user_min_len next to the existing cache_invalidate_user_data_len. For the iommu_hwpt_alloc, we are missing a data_len, as the core just uses sizeof(union iommu_domain_user_data) when calling the copy_struct_from_user().
Perhaps we could add two pairs of data_len/min_len in the ops structs: // iommu_ops const size_t domain_alloc_user_data_len; // for sanity© const size_t domain_alloc_user_min_len; // for sanity only // iommu_domain_ops const size_t cache_invalidate_user_data_len; // for sanity© const size_t cache_invalidate_user_min_len; // for sanity only
What about creating a simple array to track type specific len in iommufd instead of adding more fields to iommu/domain_ops? anyway it's iommufd doing the copy and all the type specific structures are already defined in the uapi header.
Then index the array with type value, is it? Seems like we have defined such array before for the length of hwpt_alloc and invalidate structures. but finally we dropped it the array may grow largely per new types.
I'm not sure how many types iommufd will support in reality.
Just my personal feeling that having information contained in iommufd is cleaner than expanding iommu core abstraction to assist iommufd user buffer copy/verification.
and a similar example already exists in union ucmd_buffer which includes type specific structures to avoid memory copy...
Not quite get here. ucmd_buffer is a union used to copy any user data. But here we want to check the minsz of the the user data. Seems not related.
that's the example for recording vendor specific structure information in iommufd and it will also grow along with new added types.
From: Tian, Kevin kevin.tian@intel.com Sent: Wednesday, August 9, 2023 4:58 PM
From: Liu, Yi L yi.l.liu@intel.com Sent: Wednesday, August 9, 2023 4:50 PM
From: Tian, Kevin kevin.tian@intel.com Sent: Wednesday, August 9, 2023 4:23 PM
From: Nicolin Chen nicolinc@nvidia.com Sent: Wednesday, August 9, 2023 1:42 AM
On Tue, Aug 08, 2023 at 09:34:03AM -0300, Jason Gunthorpe wrote:
On Mon, Aug 07, 2023 at 08:12:37PM -0700, Nicolin Chen wrote:
On Mon, Aug 07, 2023 at 03:08:29PM +0000, Liu, Yi L wrote: > > > From: Liu, Yi L yi.l.liu@intel.com > > > Sent: Monday, July 24, 2023 7:14 PM > > > > > > +static int intel_nested_cache_invalidate_user(struct
iommu_domain
> > > *domain, > > > + void *user_data) > > > +{ > > > + struct iommu_hwpt_vtd_s1_invalidate_desc *req = user_data; > > > + struct iommu_hwpt_vtd_s1_invalidate *inv_info = user_data; > > > + struct dmar_domain *dmar_domain =
to_dmar_domain(domain);
> > > + unsigned int entry_size = inv_info->entry_size; > > > + u64 uptr = inv_info->inv_data_uptr; > > > + u64 nr_uptr = inv_info->entry_nr_uptr; > > > + struct device_domain_info *info; > > > + u32 entry_nr, index; > > > + unsigned long flags; > > > + int ret = 0; > > > + > > > + if (get_user(entry_nr, (uint32_t __user
*)u64_to_user_ptr(nr_uptr)))
> > > + return -EFAULT; > > > + > > > + for (index = 0; index < entry_nr; index++) { > > > + ret = copy_struct_from_user(req, sizeof(*req), > > > + u64_to_user_ptr(uptr + index * > > > entry_size), > > > + entry_size); > > > > If continuing this direction then the driver should also check minsz
etc.
> > for struct iommu_hwpt_vtd_s1_invalidate and
iommu_hwpt_vtd_s1_invalidate_desc
> > since they are uAPI and subject to change. > > Then needs to define size in the uapi data structure, and copy size
first
and
> check minsz before going forward. How about the structures for
hwpt
alloc
> like struct iommu_hwpt_vtd_s1? Should check minsz for them as
well?
Assuming that every uAPI data structure needs a min_size, we can either add a structure holding all min_sizes like iommufd main.c or have another xx_min_len in iommu_/domain_ops.
If driver is doing the copy it is OK that driver does the min_size check too
Ah, just realized my reply above was missing a context..
Yi and I are having a concern that the core iommu_hpwt_alloc() and iommu_hwpt_cache_invalidate(), in the nesting series, copy data without checking the min_sizes. So, we are trying to add the missing piece into the next version but not sure which way could be optimal.
It probably makes sense to add cache_invalidate_user_min_len next to the existing cache_invalidate_user_data_len. For the iommu_hwpt_alloc, we are missing a data_len, as the core just uses sizeof(union iommu_domain_user_data) when calling the copy_struct_from_user().
Perhaps we could add two pairs of data_len/min_len in the ops structs: // iommu_ops const size_t domain_alloc_user_data_len; // for sanity© const size_t domain_alloc_user_min_len; // for sanity only // iommu_domain_ops const size_t cache_invalidate_user_data_len; // for sanity© const size_t cache_invalidate_user_min_len; // for sanity only
What about creating a simple array to track type specific len in iommufd instead of adding more fields to iommu/domain_ops? anyway it's iommufd doing the copy and all the type specific structures are already defined in the uapi header.
Then index the array with type value, is it? Seems like we have defined such array before for the length of hwpt_alloc and invalidate structures. but finally we dropped it the array may grow largely per new types.
I'm not sure how many types iommufd will support in reality.
Just my personal feeling that having information contained in iommufd is cleaner than expanding iommu core abstraction to assist iommufd user buffer copy/verification.
and a similar example already exists in union ucmd_buffer which includes type specific structures to avoid memory copy...
Not quite get here. ucmd_buffer is a union used to copy any user data. But here we want to check the minsz of the the user data. Seems not related.
that's the example for recording vendor specific structure information in iommufd and it will also grow along with new added types.
Yeah, adding new structures to ucmd_buffer may increase the size as well if the new one is larger. While for an array, if there is new entry, it is for sure to increase the size. I remember there is one tricky thing when handling the selftest type. E.g. it is defined as 0xbadbeef, if using it to index array, it would expire. So we have some special handling on it. If defining the things in iommu_ops, it is simpler. Selftest may be not so critical to determining the direction though.
Regards, Yi Liu
On Wed, Aug 09, 2023 at 09:30:12AM +0000, Liu, Yi L wrote:
Yeah, adding new structures to ucmd_buffer may increase the size as well if the new one is larger. While for an array, if there is new entry, it is for sure to increase the size. I remember there is one tricky thing when handling the selftest type. E.g. it is defined as 0xbadbeef, if using it to index array, it would expire. So we have some special handling on it. If defining the things in iommu_ops, it is simpler. Selftest may be not so critical to determining the direction though.
Maybe we are trying too hard to make it "easy" on the driver.
Can't we just have the driver invoke some:
driver_iommufd_invalidate_op(??? *opaque) { struct driver_base_struct args;
rc = iommufd_get_args(opaque, &args, sizeof(args), offsetof(args, last)); if (rc) return rc; }
The whole point of this excercise was to avoid the mistake where drivers code the uapi checks incorrectly. We can achieve the same outcome by providing a mandatory helper.
Similarly for managing the array of invalidation commands.
Jason
On Wed, Aug 09, 2023 at 01:24:56PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 09:30:12AM +0000, Liu, Yi L wrote:
Yeah, adding new structures to ucmd_buffer may increase the size as well if the new one is larger. While for an array, if there is new entry, it is for sure to increase the size. I remember there is one tricky thing when handling the selftest type. E.g. it is defined as 0xbadbeef, if using it to index array, it would expire. So we have some special handling on it. If defining the things in iommu_ops, it is simpler. Selftest may be not so critical to determining the direction though.
Maybe we are trying too hard to make it "easy" on the driver.
Can't we just have the driver invoke some:
driver_iommufd_invalidate_op(??? *opaque) { struct driver_base_struct args;
rc = iommufd_get_args(opaque, &args, sizeof(args), offsetof(args, last));
OK. So, IIUIC, the opaque should be:
struct iommu_user_data { void __user *data_uptr; size_t data_len; }user_data;
And core does basic sanity of data_uptr != NULL and data_len !=0 before passing this to driver, and then do a full sanity during the iommufd_get_args (or iommufd_get_user_data?) call.
The whole point of this excercise was to avoid the mistake where drivers code the uapi checks incorrectly. We can achieve the same outcome by providing a mandatory helper.
OK. I will rework this part today.
Similarly for managing the array of invalidation commands.
You mean an embedded uptr inside a driver user data struct right? Sure, that should go through the new helper too.
Thanks Nicolin
On Wed, Aug 09, 2023 at 12:12:25PM -0700, Nicolin Chen wrote:
On Wed, Aug 09, 2023 at 01:24:56PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 09:30:12AM +0000, Liu, Yi L wrote:
Yeah, adding new structures to ucmd_buffer may increase the size as well if the new one is larger. While for an array, if there is new entry, it is for sure to increase the size. I remember there is one tricky thing when handling the selftest type. E.g. it is defined as 0xbadbeef, if using it to index array, it would expire. So we have some special handling on it. If defining the things in iommu_ops, it is simpler. Selftest may be not so critical to determining the direction though.
Maybe we are trying too hard to make it "easy" on the driver.
Can't we just have the driver invoke some:
driver_iommufd_invalidate_op(??? *opaque) { struct driver_base_struct args;
rc = iommufd_get_args(opaque, &args, sizeof(args), offsetof(args, last));
OK. So, IIUIC, the opaque should be:
struct iommu_user_data { void __user *data_uptr; size_t data_len; }user_data;
And core does basic sanity of data_uptr != NULL and data_len !=0 before passing this to driver, and then do a full sanity during the iommufd_get_args (or iommufd_get_user_data?) call.
Don't even need to check datA_uptr and data_len, the helper should check the size and null is caught by copy from user
Similarly for managing the array of invalidation commands.
You mean an embedded uptr inside a driver user data struct right? Sure, that should go through the new helper too.
If we are committed that all drivers have to process an array then put the array in the top level struct and pass it in the same user_data struct and use another helper to allow the driver to iterate through it.
Jason
On Wed, Aug 09, 2023 at 04:19:01PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 12:12:25PM -0700, Nicolin Chen wrote:
On Wed, Aug 09, 2023 at 01:24:56PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 09:30:12AM +0000, Liu, Yi L wrote:
Yeah, adding new structures to ucmd_buffer may increase the size as well if the new one is larger. While for an array, if there is new entry, it is for sure to increase the size. I remember there is one tricky thing when handling the selftest type. E.g. it is defined as 0xbadbeef, if using it to index array, it would expire. So we have some special handling on it. If defining the things in iommu_ops, it is simpler. Selftest may be not so critical to determining the direction though.
Maybe we are trying too hard to make it "easy" on the driver.
Can't we just have the driver invoke some:
driver_iommufd_invalidate_op(??? *opaque) { struct driver_base_struct args;
rc = iommufd_get_args(opaque, &args, sizeof(args), offsetof(args, last));
OK. So, IIUIC, the opaque should be:
struct iommu_user_data { void __user *data_uptr; size_t data_len; }user_data;
And core does basic sanity of data_uptr != NULL and data_len !=0 before passing this to driver, and then do a full sanity during the iommufd_get_args (or iommufd_get_user_data?) call.
Don't even need to check datA_uptr and data_len, the helper should check the size and null is caught by copy from user
I see. I was worried about the alloc path since its data input is optional upon IOMMU_DOMAIN_UNMANAGED. But this helper should work for that also.
In that case, we might not even need to define the union with all structures, in iommu.h.
Similarly for managing the array of invalidation commands.
You mean an embedded uptr inside a driver user data struct right? Sure, that should go through the new helper too.
If we are committed that all drivers have to process an array then put the array in the top level struct and pass it in the same user_data struct and use another helper to allow the driver to iterate through it.
I see. Both VTD and SMMU pass uptr to the arrays of invalidation commands/requests. The only difference is that SMMU's array is a ring buffer other than a plain one indexing from the beginning. But the helper could take two index inputs, which should work for VTD case too. If another IOMMU driver only supports one request, rather than a array of requests, we can treat that as a single- entry array.
Then, the driver-specific data structure will be the array entry level only.
@Yi, This seems to be a bigger rework than the top level struct. Along with Jason's request for fail_nth below, we'd need to bisect the workload between us, or can just continue each other's daily work. Let me know which one you prefer. https://lore.kernel.org/linux-iommu/ZNPCtPTcHvITt6fk@nvidia.com/
Thanks! Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 10, 2023 4:17 AM
On Wed, Aug 09, 2023 at 04:19:01PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 12:12:25PM -0700, Nicolin Chen wrote:
On Wed, Aug 09, 2023 at 01:24:56PM -0300, Jason Gunthorpe wrote:
Similarly for managing the array of invalidation commands.
You mean an embedded uptr inside a driver user data struct right? Sure, that should go through the new helper too.
If we are committed that all drivers have to process an array then put the array in the top level struct and pass it in the same user_data struct and use another helper to allow the driver to iterate through it.
I see. Both VTD and SMMU pass uptr to the arrays of invalidation commands/requests. The only difference is that SMMU's array is a ring buffer other than a plain one indexing from the beginning. But the helper could take two index inputs, which should work for VTD case too. If another IOMMU driver only supports one request, rather than a array of requests, we can treat that as a single- entry array.
I like this approach.
On Thu, Aug 10, 2023 at 02:49:59AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 10, 2023 4:17 AM
On Wed, Aug 09, 2023 at 04:19:01PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 12:12:25PM -0700, Nicolin Chen wrote:
On Wed, Aug 09, 2023 at 01:24:56PM -0300, Jason Gunthorpe wrote:
Similarly for managing the array of invalidation commands.
You mean an embedded uptr inside a driver user data struct right? Sure, that should go through the new helper too.
If we are committed that all drivers have to process an array then put the array in the top level struct and pass it in the same user_data struct and use another helper to allow the driver to iterate through it.
I see. Both VTD and SMMU pass uptr to the arrays of invalidation commands/requests. The only difference is that SMMU's array is a ring buffer other than a plain one indexing from the beginning. But the helper could take two index inputs, which should work for VTD case too. If another IOMMU driver only supports one request, rather than a array of requests, we can treat that as a single- entry array.
I like this approach.
Do we need to worry about the ring wrap around? It is already the case that the VMM has to scan the ring and extract the invalidation commands, wouldn't it already just linearize them?
Is there a use case for invaliation only SW emulated rings, and do we care about optimizing for the wrap around case?
Let's answer those questions before designing something complicated :)
Jason
On Thu, Aug 10, 2023 at 12:57:04PM -0300, Jason Gunthorpe wrote:
On Thu, Aug 10, 2023 at 02:49:59AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 10, 2023 4:17 AM
On Wed, Aug 09, 2023 at 04:19:01PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 12:12:25PM -0700, Nicolin Chen wrote:
On Wed, Aug 09, 2023 at 01:24:56PM -0300, Jason Gunthorpe wrote:
Similarly for managing the array of invalidation commands.
You mean an embedded uptr inside a driver user data struct right? Sure, that should go through the new helper too.
If we are committed that all drivers have to process an array then put the array in the top level struct and pass it in the same user_data struct and use another helper to allow the driver to iterate through it.
I see. Both VTD and SMMU pass uptr to the arrays of invalidation commands/requests. The only difference is that SMMU's array is a ring buffer other than a plain one indexing from the beginning. But the helper could take two index inputs, which should work for VTD case too. If another IOMMU driver only supports one request, rather than a array of requests, we can treat that as a single- entry array.
I like this approach.
Do we need to worry about the ring wrap around? It is already the case that the VMM has to scan the ring and extract the invalidation commands, wouldn't it already just linearize them?
I haven't got the chance to send the latest vSMMU series but I pass down the raw user CMDQ to the host to go through, as it'd be easier to stall the consumer index movement when a command in the middle fails.
Is there a use case for invaliation only SW emulated rings, and do we care about optimizing for the wrap around case?
Hmm, why a SW emulated ring?
Yes for the latter question. SMMU kernel driver has something like Q_WRP and other helpers, so it wasn't difficult to process the user CMDQ in the same raw form. But it does complicates the common code if we want to do it there.
Thanks Nic
On Thu, Aug 10, 2023 at 10:14:37AM -0700, Nicolin Chen wrote:
On Thu, Aug 10, 2023 at 12:57:04PM -0300, Jason Gunthorpe wrote:
On Thu, Aug 10, 2023 at 02:49:59AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 10, 2023 4:17 AM
On Wed, Aug 09, 2023 at 04:19:01PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 12:12:25PM -0700, Nicolin Chen wrote:
On Wed, Aug 09, 2023 at 01:24:56PM -0300, Jason Gunthorpe wrote: > Similarly for managing the array of invalidation commands.
You mean an embedded uptr inside a driver user data struct right? Sure, that should go through the new helper too.
If we are committed that all drivers have to process an array then put the array in the top level struct and pass it in the same user_data struct and use another helper to allow the driver to iterate through it.
I see. Both VTD and SMMU pass uptr to the arrays of invalidation commands/requests. The only difference is that SMMU's array is a ring buffer other than a plain one indexing from the beginning. But the helper could take two index inputs, which should work for VTD case too. If another IOMMU driver only supports one request, rather than a array of requests, we can treat that as a single- entry array.
I like this approach.
Do we need to worry about the ring wrap around? It is already the case that the VMM has to scan the ring and extract the invalidation commands, wouldn't it already just linearize them?
I haven't got the chance to send the latest vSMMU series but I pass down the raw user CMDQ to the host to go through, as it'd be easier to stall the consumer index movement when a command in the middle fails.
Don't some commands have to be executed by the VMM?
Even so, it seems straightforward enough for the kernel to report the number of commands it executed and the VMM can adjust the virtual consumer index.
Is there a use case for invaliation only SW emulated rings, and do we care about optimizing for the wrap around case?
Hmm, why a SW emulated ring?
That is what you are building. The VMM catches the write of the producer pointer and the VMM SW bundles it up to call into the kernel.
Yes for the latter question. SMMU kernel driver has something like Q_WRP and other helpers, so it wasn't difficult to process the user CMDQ in the same raw form. But it does complicates the common code if we want to do it there.
Optimizing wrap around means when the producer/consumer pointers pass the end of the queue memory we execute one, not two ioctls toward the kernel. That is possible a very minor optimization, it depends how big the queues are and how frequent multi-entry items will be present.
Jaso
On Thu, Aug 10, 2023 at 04:27:02PM -0300, Jason Gunthorpe wrote:
Do we need to worry about the ring wrap around? It is already the case that the VMM has to scan the ring and extract the invalidation commands, wouldn't it already just linearize them?
I haven't got the chance to send the latest vSMMU series but I pass down the raw user CMDQ to the host to go through, as it'd be easier to stall the consumer index movement when a command in the middle fails.
Don't some commands have to be executed by the VMM?
Well, they do. VMM would go through the queue and "execute" non- invalidation commands, then defer the queue to the kernel to go through the queue once more. So, the flaw could be that some of the commands behind the failing TLB flush command got "executed", though in a real case most of other commands would be "executed" standalone with a CMD_SYNC, i.e. not mixing with any invalidation command.
Even so, it seems straightforward enough for the kernel to report the number of commands it executed and the VMM can adjust the virtual consumer index.
It is not that straightforward to revert an array index back to a consumer index because they might not be 1:1 mapped, since in theory there could be other commands mixing in-between, although it unlikely happens.
So, another index-mapping array would be needed for this matter. And this doesn't address the flaw that I mentioned above either. So, I took the former solution to reduce the complication.
Is there a use case for invaliation only SW emulated rings, and do we care about optimizing for the wrap around case?
Hmm, why a SW emulated ring?
That is what you are building. The VMM catches the write of the producer pointer and the VMM SW bundles it up to call into the kernel.
Still not fully getting it. Do you mean a ring that is prepared by the VMM? I think the only case that we need to handle a ring is what I did by forwarding the guest CMDQ (a ring) to the host directly. Not sure why VMM would need another ring for those linearized invalidation commands. Or maybe I misunderstood..
Yes for the latter question. SMMU kernel driver has something like Q_WRP and other helpers, so it wasn't difficult to process the user CMDQ in the same raw form. But it does complicates the common code if we want to do it there.
Optimizing wrap around means when the producer/consumer pointers pass the end of the queue memory we execute one, not two ioctls toward the kernel. That is possible a very minor optimization, it depends how big the queues are and how frequent multi-entry items will be present.
There could be other commands being issued by other VMs or even the host between the two ioctls. So probably we'd need to handle the wrapping case when doing a ring solution?
Thanks Nicolin
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, August 11, 2023 5:03 AM
Is there a use case for invaliation only SW emulated rings, and do we care about optimizing for the wrap around case?
Hmm, why a SW emulated ring?
That is what you are building. The VMM catches the write of the producer pointer and the VMM SW bundles it up to call into the kernel.
Still not fully getting it. Do you mean a ring that is prepared by the VMM? I think the only case that we need to handle a ring is what I did by forwarding the guest CMDQ (a ring) to the host directly. Not sure why VMM would need another ring for those linearized invalidation commands. Or maybe I misunderstood..
iiuc the point of a ring-based native format is to maximum code reuse when later in-kernel fast invalidation path (from kvm to smmu driver) is enabled. Then both slow (via vmm) and fast (in-kernel) path use the same logic to handle guest invalidation queue.
But if stepping back a bit supporting an array-based non-native format could simplify the uAPI design and allows code sharing for array among vendor drivers. You can still keep the entry as native format then the only difference with future in-kernel fast path is just on walking an array vs. walking a ring. And VMM doesn't need to expose non-invalidate cmds to the kernel and then be skipped.
Thanks Kevin
On Fri, Aug 11, 2023 at 03:52:52AM +0000, Tian, Kevin wrote:
From: Nicolin Chen nicolinc@nvidia.com Sent: Friday, August 11, 2023 5:03 AM
Is there a use case for invaliation only SW emulated rings, and do we care about optimizing for the wrap around case?
Hmm, why a SW emulated ring?
That is what you are building. The VMM catches the write of the producer pointer and the VMM SW bundles it up to call into the kernel.
Still not fully getting it. Do you mean a ring that is prepared by the VMM? I think the only case that we need to handle a ring is what I did by forwarding the guest CMDQ (a ring) to the host directly. Not sure why VMM would need another ring for those linearized invalidation commands. Or maybe I misunderstood..
iiuc the point of a ring-based native format is to maximum code reuse when later in-kernel fast invalidation path (from kvm to smmu driver) is enabled. Then both slow (via vmm) and fast (in-kernel) path use the same logic to handle guest invalidation queue.
I see. That's about the fast path topic. Thanks for the input.
But if stepping back a bit supporting an array-based non-native format could simplify the uAPI design and allows code sharing for array among vendor drivers. You can still keep the entry as native format then the only difference with future in-kernel fast path is just on walking an array vs. walking a ring. And VMM doesn't need to expose non-invalidate cmds to the kernel and then be skipped.
Ah, so we might still design the uAPI to be ring based at this moment, yet don't support a case CONS > 0 to leave that to an upgrade in the future.
I will try estimating a bit how complicated to implement the ring, to see if we could just start with that. Otherwise, will just start with an array.
Thanks Nic
On Fri, Aug 11, 2023 at 09:45:21AM -0700, Nicolin Chen wrote:
But if stepping back a bit supporting an array-based non-native format could simplify the uAPI design and allows code sharing for array among vendor drivers. You can still keep the entry as native format then the only difference with future in-kernel fast path is just on walking an array vs. walking a ring. And VMM doesn't need to expose non-invalidate cmds to the kernel and then be skipped.
Ah, so we might still design the uAPI to be ring based at this moment, yet don't support a case CONS > 0 to leave that to an upgrade in the future.
I will try estimating a bit how complicated to implement the ring, to see if we could just start with that. Otherwise, will just start with an array.
I drafted a uAPI structure for a ring-based SW queue. While I am trying an implementation, I'd like to collect some comments at the structure, to see if it overall makes sense.
One thing that I couldn't add to this common structure for SMMU is the hardware error code, which should be encoded in the higher bits of the consumer index register, following the SMMU spec: ERR, bits [30:24] Error reason code. - When a command execution error is detected, ERR is set to a reason code and then the SMMU_GERROR.CMDQ_ERR global error becomes active. - The value in this field is UNKNOWN when the CMDQ_ERR global error is not active. This field resets to an UNKNOWN value.
But, I feel it odd to do the same to the generic structure. So, perhaps an optional @out_error can be added to this structure. Or some other idea?
Thanks Nic
/** * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) * @size: sizeof(struct iommu_hwpt_invalidate) * @hwpt_id: HWPT ID of target hardware page table for the invalidation * @q_uptr: User pointer to an invalidation queue, which can be used as a flat * array or a circular ring queue. The entiry(s) in the queue must be * at a fixed width @q_entry_len, containing a user data structure for * an invalidation request, specific to the given hardware pagetable. * @q_cons_uptr: User pointer to the consumer index (with its wrap flag) of an * invalidation queue. This pointer must point to a __u32 type of * memory location. The consumer index tells kernel to read from * the entry pointed by it (and then its next entry) until kernel * reaches the entry pointed by the producer index @q_prod, and * allows kernel to update the consumer index to where it stops: * on success, it should be updated to @q_prod; otherwise, to the * index pointing to the failed entry. * @q_prod: Producer index (with its wrap flag) of an invalidation queue. This * index points to the entry next to the last requested entry in the * invalidation queue. In case of using the queue as a flat array, it * equals to the number of entries @q_entry_num. * @q_index_bits: Effective bits of both indexes. Defines the maximum value an * index can be. Must not be greater than 31 bits. A wrap flag * is defined at the next higher bit adjacent to the index bits: * e.g. if @q_index_bits is 20, @q_prod[19:0] are the index bits * and @q_prod[20] is the wrap flag. The wrap flag, acting like * a sign flag, must be toggled each time an index overflow and * wraps to the lower end of the circular queue. * @q_entry_num: Totaly number of the entries in an invalidation queue * @q_entry_len: Length (in bytes) of an entry of an invalidation queue * * Invalidate the iommu cache for user-managed page table. Modifications on a * user-managed page table should be followed by this operation to sync cache. * * One request supports multiple invalidations via a multi-entry queue: * |<----------- Length of Queue = @q_entry_num * @q_entry_len ------------>| * -------------------------------------------------------------------------- * | 0 | 1 | 2 | 3 | ... | @q_entry_num-3 | @q_entry_num-2 | @q_entry_num-1 | * -------------------------------------------------------------------------- * ^ ^ ^ |<-@q_entry_len->| * | | | * @q_uptr @q_cons_uptr @q_prod * * A queue index can wrap its index bits off the high end of the queue and back * onto the low end by toggling its wrap flag: e.g. when @q_entry_num=0x10 and * @q_index_bits=4, *@q_cons_uptr=0xf and @q_prod=0x11 inputs mean the producer * index is wrapped to 0x1 with its wrap flag set, so kernel reads/handles the * entry starting from by the consumer index (0xf) and wraps it back to 0x0 and * 0x1 by toggling the wrap flag, i.e. *@q_cons_uptr has a final value of 0x11. */ struct iommu_hwpt_invalidate { __u32 size; __u32 hwpt_id; __aligned_u64 q_uptr; __aligned_u64 q_cons_uptr; __u32 q_prod; __u32 q_index_bits; __u32 q_entry_num; __u32 q_entry_len; }; #define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE)
On Mon, Aug 14, 2023 at 11:18:46PM -0700, Nicolin Chen wrote:
On Fri, Aug 11, 2023 at 09:45:21AM -0700, Nicolin Chen wrote:
But if stepping back a bit supporting an array-based non-native format could simplify the uAPI design and allows code sharing for array among vendor drivers. You can still keep the entry as native format then the only difference with future in-kernel fast path is just on walking an array vs. walking a ring. And VMM doesn't need to expose non-invalidate cmds to the kernel and then be skipped.
Ah, so we might still design the uAPI to be ring based at this moment, yet don't support a case CONS > 0 to leave that to an upgrade in the future.
I will try estimating a bit how complicated to implement the ring, to see if we could just start with that. Otherwise, will just start with an array.
I drafted a uAPI structure for a ring-based SW queue. While I am trying an implementation, I'd like to collect some comments at the structure, to see if it overall makes sense.
I don't think a ring makes alot of sense at this point. The only thing it optimizes is a system call if the kernel needs to wrap around the tail of the ring. It would possibly be better to have a small gather list than try to describe the ring logic..
Further, the VMM already has to process it, so the vmm already knows what ops are going to kernel are not. The VMM can just organize them in a linear list in one way or another. We need to copy and read this stuff in the VMM anyhow to protect against a hostile VM.
One thing that I couldn't add to this common structure for SMMU is the hardware error code, which should be encoded in the higher bits of the consumer index register, following the SMMU spec: ERR, bits [30:24] Error reason code. - When a command execution error is detected, ERR is set to a reason code and then the SMMU_GERROR.CMDQ_ERR global error becomes active. - The value in this field is UNKNOWN when the CMDQ_ERR global error is not active. This field resets to an UNKNOWN value.
The invalidate ioctl should fail in some deterministic way and report back the error code and the highest array index that maybe could have triggered it.
The highest array index sounds generic, the error code maybe is too
Jason
On Fri, Aug 18, 2023 at 01:56:54PM -0300, Jason Gunthorpe wrote:
On Mon, Aug 14, 2023 at 11:18:46PM -0700, Nicolin Chen wrote:
On Fri, Aug 11, 2023 at 09:45:21AM -0700, Nicolin Chen wrote:
But if stepping back a bit supporting an array-based non-native format could simplify the uAPI design and allows code sharing for array among vendor drivers. You can still keep the entry as native format then the only difference with future in-kernel fast path is just on walking an array vs. walking a ring. And VMM doesn't need to expose non-invalidate cmds to the kernel and then be skipped.
Ah, so we might still design the uAPI to be ring based at this moment, yet don't support a case CONS > 0 to leave that to an upgrade in the future.
I will try estimating a bit how complicated to implement the ring, to see if we could just start with that. Otherwise, will just start with an array.
I drafted a uAPI structure for a ring-based SW queue. While I am trying an implementation, I'd like to collect some comments at the structure, to see if it overall makes sense.
I don't think a ring makes alot of sense at this point. The only thing it optimizes is a system call if the kernel needs to wrap around the tail of the ring. It would possibly be better to have a small gather list than try to describe the ring logic..
Further, the VMM already has to process it, so the vmm already knows what ops are going to kernel are not. The VMM can just organize them in a linear list in one way or another. We need to copy and read this stuff in the VMM anyhow to protect against a hostile VM.
OK. Then an linear array it is.
One thing that I couldn't add to this common structure for SMMU is the hardware error code, which should be encoded in the higher bits of the consumer index register, following the SMMU spec: ERR, bits [30:24] Error reason code. - When a command execution error is detected, ERR is set to a reason code and then the SMMU_GERROR.CMDQ_ERR global error becomes active. - The value in this field is UNKNOWN when the CMDQ_ERR global error is not active. This field resets to an UNKNOWN value.
The invalidate ioctl should fail in some deterministic way and report back the error code and the highest array index that maybe could have triggered it.
Yea. Having an error code in the highest bits of array_index, "array_index != array_max" could be the deterministic way to indicate a failure. And a kernel errno could be returned also to the invalidate ioctl.
The highest array index sounds generic, the error code maybe is too
We could do in its and report the error code in its raw form: __u32 out_array_index; /* number of bits used to report error code in the returned array_index */ __u32 out_array_index_error_code_bits; Or just: __u32 out_array_index; __u32 out_error_code;
Do we have to define a list of generic error code?
Thanks! Nic
On Fri, Aug 18, 2023 at 10:56:45AM -0700, Nicolin Chen wrote:
The highest array index sounds generic, the error code maybe is too
We could do in its and report the error code in its raw form: __u32 out_array_index; /* number of bits used to report error code in the returned array_index */ __u32 out_array_index_error_code_bits; Or just: __u32 out_array_index; __u32 out_error_code;
Do we have to define a list of generic error code?
out_driver_error_code may be OK
Jason
On Fri, Aug 18, 2023 at 03:20:55PM -0300, Jason Gunthorpe wrote:
On Fri, Aug 18, 2023 at 10:56:45AM -0700, Nicolin Chen wrote:
The highest array index sounds generic, the error code maybe is too
We could do in its and report the error code in its raw form: __u32 out_array_index; /* number of bits used to report error code in the returned array_index */ __u32 out_array_index_error_code_bits; Or just: __u32 out_array_index; __u32 out_error_code;
Do we have to define a list of generic error code?
out_driver_error_code may be OK
Ack. Will implement the array in that way.
Thanks! Nic
From: Nicolin Chen nicolinc@nvidia.com Sent: Saturday, August 19, 2023 2:26 AM
On Fri, Aug 18, 2023 at 03:20:55PM -0300, Jason Gunthorpe wrote:
On Fri, Aug 18, 2023 at 10:56:45AM -0700, Nicolin Chen wrote:
The highest array index sounds generic, the error code maybe is too
We could do in its and report the error code in its raw form: __u32 out_array_index; /* number of bits used to report error code in the returned
array_index */
__u32 out_array_index_error_code_bits; Or just: __u32 out_array_index; __u32 out_error_code;
Do we have to define a list of generic error code?
out_driver_error_code may be OK
Ack. Will implement the array in that way.
Yes. Error code is driver specific. Just having a field to carry it is sufficient.
From: Nicolin Chen nicolinc@nvidia.com Sent: Thursday, August 10, 2023 4:17 AM
On Wed, Aug 09, 2023 at 04:19:01PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 12:12:25PM -0700, Nicolin Chen wrote:
On Wed, Aug 09, 2023 at 01:24:56PM -0300, Jason Gunthorpe wrote:
On Wed, Aug 09, 2023 at 09:30:12AM +0000, Liu, Yi L wrote:
Yeah, adding new structures to ucmd_buffer may increase the size as well if the new one is larger. While for an array, if there is new entry, it is for sure to increase the size. I remember there is one tricky thing when handling the selftest type. E.g. it is defined as 0xbadbeef, if using it to index array, it would expire. So we have some special handling on it. If defining the things in iommu_ops, it is simpler. Selftest may be not so critical to determining the direction though.
Maybe we are trying too hard to make it "easy" on the driver.
Can't we just have the driver invoke some:
driver_iommufd_invalidate_op(??? *opaque) { struct driver_base_struct args;
rc = iommufd_get_args(opaque, &args, sizeof(args), offsetof(args, last));
OK. So, IIUIC, the opaque should be:
struct iommu_user_data { void __user *data_uptr; size_t data_len; }user_data;
And core does basic sanity of data_uptr != NULL and data_len !=0 before passing this to driver, and then do a full sanity during the iommufd_get_args (or iommufd_get_user_data?) call.
Don't even need to check datA_uptr and data_len, the helper should check the size and null is caught by copy from user
I see. I was worried about the alloc path since its data input is optional upon IOMMU_DOMAIN_UNMANAGED. But this helper should work for that also.
In that case, we might not even need to define the union with all structures, in iommu.h.
Similarly for managing the array of invalidation commands.
You mean an embedded uptr inside a driver user data struct right? Sure, that should go through the new helper too.
If we are committed that all drivers have to process an array then put the array in the top level struct and pass it in the same user_data struct and use another helper to allow the driver to iterate through it.
I see. Both VTD and SMMU pass uptr to the arrays of invalidation commands/requests. The only difference is that SMMU's array is a ring buffer other than a plain one indexing from the beginning. But the helper could take two index inputs, which should work for VTD case too. If another IOMMU driver only supports one request, rather than a array of requests, we can treat that as a single- entry array.
Then, the driver-specific data structure will be the array entry level only.
@Yi, This seems to be a bigger rework than the top level struct. Along with Jason's request for fail_nth below, we'd need to bisect the workload between us, or can just continue each other's daily work. Let me know which one you prefer. https://lore.kernel.org/linux-iommu/ZNPCtPTcHvITt6fk@nvidia.com/
Let me address the fail_nth request first. You may rework the iommufd_get_user_data(). If I can finish the fail_nth soon, then may help to lift the array to the top level. If not, you may make it as well. 😊 I guess I need some study on nth as well.
Regards, Yi Liu
On Thu, Aug 10, 2023 at 03:28:10AM +0000, Liu, Yi L wrote:
@Yi, This seems to be a bigger rework than the top level struct. Along with Jason's request for fail_nth below, we'd need to bisect the workload between us, or can just continue each other's daily work. Let me know which one you prefer. https://lore.kernel.org/linux-iommu/ZNPCtPTcHvITt6fk@nvidia.com/
Let me address the fail_nth request first. You may rework the iommufd_get_user_data(). If I can finish the fail_nth soon, then may help to lift the array to the top level. If not, you may make it as well. 😊 I guess I need some study on nth as well.
Ack. Am working on the iommufd_get_user_data() already. Will continue the ring buffer thing.
Nic
From: Lu Baolu baolu.lu@linux.intel.com
This adds the support for IOMMU_HWPT_TYPE_VTD_S1 type.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 20 ++++++++++++++++++++ include/linux/iommu.h | 1 + 2 files changed, 21 insertions(+)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 3119a79ebc83..6977d320c440 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4084,6 +4084,25 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) return NULL; }
+static struct iommu_domain * +intel_iommu_domain_alloc_user(struct device *dev, + enum iommu_hwpt_type hwpt_type, + struct iommu_domain *parent, + const union iommu_domain_user_data *user_data) +{ + if (hwpt_type != IOMMU_HWPT_TYPE_DEFAULT && + hwpt_type != IOMMU_HWPT_TYPE_VTD_S1) + return ERR_PTR(-EINVAL); + + if ((hwpt_type == IOMMU_HWPT_TYPE_DEFAULT) == !!parent) + return ERR_PTR(-EINVAL); + + if (parent) + return intel_nested_domain_alloc(parent, user_data); + else + return iommu_domain_alloc(dev->bus); +} + static void intel_iommu_domain_free(struct iommu_domain *domain) { if (domain != &si_domain->domain && domain != &blocking_domain) @@ -4732,6 +4751,7 @@ static void intel_iommu_remove_dev_pasid(struct device *dev, ioasid_t pasid) const struct iommu_ops intel_iommu_ops = { .capable = intel_iommu_capable, .domain_alloc = intel_iommu_domain_alloc, + .domain_alloc_user = intel_iommu_domain_alloc_user, .probe_device = intel_iommu_probe_device, .probe_finalize = intel_iommu_probe_finalize, .release_device = intel_iommu_release_device, diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 231920efab84..09b8e800b55e 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -240,6 +240,7 @@ union iommu_domain_user_data { #ifdef CONFIG_IOMMUFD_TEST __u64 test[2]; #endif + struct iommu_hwpt_vtd_s1 vtd; };
/**
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
From: Lu Baolu baolu.lu@linux.intel.com
This adds the support for IOMMU_HWPT_TYPE_VTD_S1 type.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com
Reviewed-by: Kevin Tian kevin.tian@intel.com
Add intel_iommu_hw_info() to report cap_reg and ecap_reg information.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++ 2 files changed, 42 insertions(+)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 6977d320c440..ba34827045e6 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4748,8 +4748,26 @@ static void intel_iommu_remove_dev_pasid(struct device *dev, ioasid_t pasid) intel_pasid_tear_down_entry(iommu, dev, pasid, false); }
+static void *intel_iommu_hw_info(struct device *dev, u32 *length) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct intel_iommu *iommu = info->iommu; + struct iommu_hw_info_vtd *vtd; + + vtd = kzalloc(sizeof(*vtd), GFP_KERNEL); + if (!vtd) + return ERR_PTR(-ENOMEM); + + vtd->cap_reg = iommu->cap; + vtd->ecap_reg = iommu->ecap; + *length = sizeof(*vtd); + + return vtd; +} + const struct iommu_ops intel_iommu_ops = { .capable = intel_iommu_capable, + .hw_info = intel_iommu_hw_info, .domain_alloc = intel_iommu_domain_alloc, .domain_alloc_user = intel_iommu_domain_alloc_user, .probe_device = intel_iommu_probe_device, @@ -4763,6 +4781,7 @@ const struct iommu_ops intel_iommu_ops = { .def_domain_type = device_def_domain_type, .remove_dev_pasid = intel_iommu_remove_dev_pasid, .pgsize_bitmap = SZ_4K, + .hw_info_type = IOMMU_HW_INFO_TYPE_INTEL_VTD, #ifdef CONFIG_INTEL_IOMMU_SVM .page_response = intel_svm_page_response, #endif diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 2c1241448c87..0dfb6f3d8dda 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -435,12 +435,35 @@ struct iommu_hwpt_alloc { }; #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
+/** + * struct iommu_hw_info_vtd - Intel VT-d hardware information + * + * @flags: Must be 0 + * @__reserved: Must be 0 + * + * @cap_reg: Value of Intel VT-d capability register defined in VT-d spec + * section 11.4.2 Capability Register. + * @ecap_reg: Value of Intel VT-d capability register defined in VT-d spec + * section 11.4.3 Extended Capability Register. + * + * User needs to understand the Intel VT-d specification to decode the + * register value. + */ +struct iommu_hw_info_vtd { + __u32 flags; + __u32 __reserved; + __aligned_u64 cap_reg; + __aligned_u64 ecap_reg; +}; + /** * enum iommu_hw_info_type - IOMMU Hardware Info Types * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that does not report hardware info + * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type */ enum iommu_hw_info_type { IOMMU_HW_INFO_TYPE_NONE, + IOMMU_HW_INFO_TYPE_INTEL_VTD, };
/**
On Mon, Jul 24, 2023 at 04:13:33AM -0700, Yi Liu wrote:
Add intel_iommu_hw_info() to report cap_reg and ecap_reg information.
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Nicolin Chen nicolinc@nvidia.com Signed-off-by: Yi Liu yi.l.liu@intel.com
drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++ 2 files changed, 42 insertions(+)
I would like to pick this patch out of this series to go with the main get_info stuff so that we have drivers implementing what is merged. I made the trivial fixup.
Lu are you OK?
Thanks, Jason
On 2023/8/16 0:31, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:13:33AM -0700, Yi Liu wrote:
Add intel_iommu_hw_info() to report cap_reg and ecap_reg information.
Signed-off-by: Lu Baolubaolu.lu@linux.intel.com Signed-off-by: Nicolin Chennicolinc@nvidia.com Signed-off-by: Yi Liuyi.l.liu@intel.com
drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++ 2 files changed, 42 insertions(+)
I would like to pick this patch out of this series to go with the main get_info stuff so that we have drivers implementing what is merged. I made the trivial fixup.
Lu are you OK?
Yes.
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
Best regards, baolu
On Wed, Aug 16, 2023 at 08:35:00AM +0800, Baolu Lu wrote:
On 2023/8/16 0:31, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:13:33AM -0700, Yi Liu wrote:
Add intel_iommu_hw_info() to report cap_reg and ecap_reg information.
Signed-off-by: Lu Baolubaolu.lu@linux.intel.com Signed-off-by: Nicolin Chennicolinc@nvidia.com Signed-off-by: Yi Liuyi.l.liu@intel.com
drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++ 2 files changed, 42 insertions(+)
I would like to pick this patch out of this series to go with the main get_info stuff so that we have drivers implementing what is merged. I made the trivial fixup.
Lu are you OK?
Yes.
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
Hmm...
We have Yi in the author line and Baolu in the first signed-off line, and now Baolu with "Reviewed-by" again...
I guess we might need a bit of fix or re-arrange? :)
Thanks Nic
On Wed, Aug 16, 2023 at 08:35:00AM +0800, Baolu Lu wrote:
On 2023/8/16 0:31, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:13:33AM -0700, Yi Liu wrote:
Add intel_iommu_hw_info() to report cap_reg and ecap_reg information.
Signed-off-by: Lu Baolubaolu.lu@linux.intel.com Signed-off-by: Nicolin Chennicolinc@nvidia.com Signed-off-by: Yi Liuyi.l.liu@intel.com
drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++ 2 files changed, 42 insertions(+)
I would like to pick this patch out of this series to go with the main get_info stuff so that we have drivers implementing what is merged. I made the trivial fixup.
Lu are you OK?
Yes.
Reviewed-by: Lu Baolu baolu.lu@linux.intel.com
I changed this to an acked-by since you helpd write the patch :)
Jason
On 2023/8/16 19:46, Jason Gunthorpe wrote:
On Wed, Aug 16, 2023 at 08:35:00AM +0800, Baolu Lu wrote:
On 2023/8/16 0:31, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:13:33AM -0700, Yi Liu wrote:
Add intel_iommu_hw_info() to report cap_reg and ecap_reg information.
Signed-off-by: Lu Baolubaolu.lu@linux.intel.com Signed-off-by: Nicolin Chennicolinc@nvidia.com Signed-off-by: Yi Liuyi.l.liu@intel.com
drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++ 2 files changed, 42 insertions(+)
I would like to pick this patch out of this series to go with the main get_info stuff so that we have drivers implementing what is merged. I made the trivial fixup.
Lu are you OK?
Yes.
Reviewed-by: Lu Baolubaolu.lu@linux.intel.com
I changed this to an acked-by since you helpd write the patch 😄
Okay, fine to me.
Best regards, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Wednesday, August 16, 2023 7:49 PM
On 2023/8/16 19:46, Jason Gunthorpe wrote:
On Wed, Aug 16, 2023 at 08:35:00AM +0800, Baolu Lu wrote:
On 2023/8/16 0:31, Jason Gunthorpe wrote:
On Mon, Jul 24, 2023 at 04:13:33AM -0700, Yi Liu wrote:
Add intel_iommu_hw_info() to report cap_reg and ecap_reg information.
Signed-off-by: Lu Baolubaolu.lu@linux.intel.com Signed-off-by: Nicolin Chennicolinc@nvidia.com Signed-off-by: Yi Liuyi.l.liu@intel.com
drivers/iommu/intel/iommu.c | 19 +++++++++++++++++++ include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++ 2 files changed, 42 insertions(+)
I would like to pick this patch out of this series to go with the main get_info stuff so that we have drivers implementing what is merged. I made the trivial fixup.
Lu are you OK?
Yes.
Reviewed-by: Lu Baolubaolu.lu@linux.intel.com
I changed this to an acked-by since you helpd write the patch 😄
Okay, fine to me.
I'm going to send hw_info v8 which includes this patch, so will add below tag, thanks.
Acked-by: Lu Baolu baolu.lu@linux.intel.com
Regards, Yi Liu
From: Lu Baolu baolu.lu@linux.intel.com
When remapping hardware is configured by system software in scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled) in first-stage page-table entries even when second-stage mappings indicate that corresponding first-stage page-table is Read-Only.
As the result, contents of pages designated by VMM as Read-Only can be modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of address translation process due to DMAs issued by Guest.
Disallow the nested translation when there are read-only pages in the corresponding second-stage mappings. And, no read-only pages are allowed to be configured in the second-stage table of a nested translation.
For simplicity the 2nd restriction is not relaxed even when the nesting is turned off later due to vIOMMU config change. In concept if the user understands this errata and does expect to enable nested translation it should never install any RO mapping in stage-2 in the entire VM life cycle. Accordingly introduce a single sticky bit to mark the parent role on a domain instead of tracking the role with a counter.
Reference from Sapphire Rapids Specification Update [1], errata details, SPR17.
[1] https://www.intel.com/content/www/us/en/content-details/772415/content-detai...
Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Signed-off-by: Yi Liu yi.l.liu@intel.com --- drivers/iommu/intel/iommu.c | 14 ++++++++++++++ drivers/iommu/intel/iommu.h | 4 ++++ drivers/iommu/intel/nested.c | 14 +++++++++++++- include/uapi/linux/iommufd.h | 12 +++++++++++- 4 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ba34827045e6..caaa3a58dc94 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2138,6 +2138,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, struct dma_pte *first_pte = NULL, *pte = NULL; unsigned int largepage_lvl = 0; unsigned long lvl_pages = 0; + unsigned long flags; phys_addr_t pteval; u64 attr;
@@ -2147,6 +2148,18 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0) return -EINVAL;
+ if (!(prot & DMA_PTE_WRITE) && !domain->read_only_mapped) { + spin_lock_irqsave(&domain->lock, flags); + if (domain->set_nested) { + pr_err_ratelimited("No read-only mapping permitted\n"); + spin_unlock_irqrestore(&domain->lock, flags); + return -EINVAL; + } + + domain->read_only_mapped = true; + spin_unlock_irqrestore(&domain->lock, flags); + } + attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP); attr |= DMA_FL_PTE_PRESENT; if (domain->use_first_level) { @@ -4758,6 +4771,7 @@ static void *intel_iommu_hw_info(struct device *dev, u32 *length) if (!vtd) return ERR_PTR(-ENOMEM);
+ vtd->flags = IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17; vtd->cap_reg = iommu->cap; vtd->ecap_reg = iommu->ecap; *length = sizeof(*vtd); diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 5b292213bcb8..2a14fab6ac4f 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -592,6 +592,10 @@ struct dmar_domain { * otherwise, goes through the second * level. */ + u8 read_only_mapped:1; /* domain has mappings with read-only + * permission. + */ + u8 set_nested:1; /* has other domains nested on it */
spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */ diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c index 2739c0d7880d..50934da613fa 100644 --- a/drivers/iommu/intel/nested.c +++ b/drivers/iommu/intel/nested.c @@ -142,14 +142,26 @@ struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain, const union iommu_domain_user_data *user_data) { const struct iommu_hwpt_vtd_s1 *vtd = (struct iommu_hwpt_vtd_s1 *)user_data; + struct dmar_domain *s2_dmar_domain = to_dmar_domain(s2_domain); struct dmar_domain *domain; + unsigned long flags;
domain = kzalloc(sizeof(*domain), GFP_KERNEL_ACCOUNT); if (!domain) return NULL;
+ spin_lock_irqsave(&s2_dmar_domain->lock, flags); + if (s2_dmar_domain->read_only_mapped) { + spin_unlock_irqrestore(&s2_dmar_domain->lock, flags); + pr_err_ratelimited("S2 domain has read-only mappings\n"); + kfree(domain); + return NULL; + } + s2_dmar_domain->set_nested = true; + spin_unlock_irqrestore(&s2_dmar_domain->lock, flags); + domain->use_first_level = true; - domain->s2_domain = to_dmar_domain(s2_domain); + domain->s2_domain = s2_dmar_domain; domain->s1_pgtbl = vtd->pgtbl_addr; domain->s1_cfg = *vtd; domain->domain.ops = &intel_nested_domain_ops; diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 0dfb6f3d8dda..2f8f2dab95a7 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -435,10 +435,20 @@ struct iommu_hwpt_alloc { }; #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
+/** + * enum iommu_hw_info_vtd_flags - Flags for VT-d hw_info + * @IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17: If set, disallow nesting on domains + * with read-only mapping. + * https://www.intel.com/content/www/us/en/content-details/772415/content-detai... + */ +enum iommu_hw_info_vtd_flags { + IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 = 1 << 0, +}; + /** * struct iommu_hw_info_vtd - Intel VT-d hardware information * - * @flags: Must be 0 + * @flags: Combination of enum iommu_hw_info_vtd_flags * @__reserved: Must be 0 * * @cap_reg: Value of Intel VT-d capability register defined in VT-d spec
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
@@ -2147,6 +2148,18 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0) return -EINVAL;
- if (!(prot & DMA_PTE_WRITE) && !domain->read_only_mapped) {
spin_lock_irqsave(&domain->lock, flags);
if (domain->set_nested) {
pr_err_ratelimited("No read-only mapping
permitted\n");
"Read-only mapping is disallowed on the domain which serves as the parent in a nested configuration, due to HW errata (ERRATA_772415_SPR17)"
- u8 read_only_mapped:1; /* domain has mappings with
read-only
* permission.
*/
- u8 set_nested:1; /* has other domains nested on it */
what about "is_parent"?
- spin_lock_irqsave(&s2_dmar_domain->lock, flags);
- if (s2_dmar_domain->read_only_mapped) {
spin_unlock_irqrestore(&s2_dmar_domain->lock, flags);
pr_err_ratelimited("S2 domain has read-only mappings\n");
"Nested configuration is disallowed when the stage-2 domain already has read-only mappings, due to HW errata (ERRATA_772415_SPR17)"
On 2023/8/2 15:59, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
@@ -2147,6 +2148,18 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0) return -EINVAL;
- if (!(prot & DMA_PTE_WRITE) && !domain->read_only_mapped) {
spin_lock_irqsave(&domain->lock, flags);
if (domain->set_nested) {
pr_err_ratelimited("No read-only mapping
permitted\n");
"Read-only mapping is disallowed on the domain which serves as the parent in a nested configuration, due to HW errata (ERRATA_772415_SPR17)"
Ack.
- u8 read_only_mapped:1; /* domain has mappings with
read-only
* permission.
*/
- u8 set_nested:1; /* has other domains nested on it */
what about "is_parent"?
"is_parent" is still a bit generic. How about "is_nested_parent"?
- spin_lock_irqsave(&s2_dmar_domain->lock, flags);
- if (s2_dmar_domain->read_only_mapped) {
spin_unlock_irqrestore(&s2_dmar_domain->lock, flags);
pr_err_ratelimited("S2 domain has read-only mappings\n");
"Nested configuration is disallowed when the stage-2 domain already has read-only mappings, due to HW errata (ERRATA_772415_SPR17)"
Ack.
Best regards, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Thursday, August 3, 2023 11:27 AM
On 2023/8/2 15:59, Tian, Kevin wrote:
From: Liu, Yi L yi.l.liu@intel.com Sent: Monday, July 24, 2023 7:14 PM
- u8 read_only_mapped:1; /* domain has mappings with
read-only
* permission.
*/
- u8 set_nested:1; /* has other domains nested on it */
what about "is_parent"?
"is_parent" is still a bit generic. How about "is_nested_parent"?
looks good.
linux-kselftest-mirror@lists.linaro.org