On 9/3/2025 11:16 PM, Jason Gunthorpe wrote:
This is used by x86 CPUs and can be used in AMD/VT-D x86 IOMMUs. When a x86 IOMMU is running SVA it is using this page table format.
This implementation follows the AMD v2 io-pgtable version.
There is nothing remarkable here, the format can have 4 or 5 levels and limited support for different page sizes. No contiguous pages support.
x86 uses a sign extension mechanism where the top bits of the VA must match the sign bit. The core code supports this through PT_FEAT_SIGN_EXTEND which creates and upper and lower VA range. All the new operations will work correctly in both spaces, however currently there is no way to report the upper space to other layers. Future patches can improve that.
In principle this can support 3 page tables levels matching the 32 bit PAE table format, but no iommu driver needs this. The focus is on the modern 64 bit 4 and 5 level formats.
Comparing the performance of several operations to the existing version:
iommu_map() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 71,61 , 66,58 , -13.13 2^21, 66,60 , 61,55 , -10.10 2^30, 59,56 , 56,54 , -3.03 256*2^12, 392,1360 , 345,1289 , 73.73 256*2^21, 383,1159 , 335,1145 , 70.70 256*2^30, 378,965 , 331,892 , 62.62
iommu_unmap() pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 77,71 , 73,68 , -7.07 2^21, 76,70 , 70,66 , -6.06 2^30, 69,66 , 66,63 , -4.04 256*2^12, 225,899 , 210,870 , 75.75 256*2^21, 262,722 , 248,710 , 65.65 256*2^30, 251,643 , 244,634 , 61.61
The small -ve values in the iommu_unmap() are due to the core code calling iommu_pgsize() before invoking the domain op. This is unncessary with this implementation. Future work optimizes this and gets to 2%, 4%, 3%.
Tested-by: Alejandro Jimenez alejandro.j.jimenez@oracle.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com
.../...
diff --git a/drivers/iommu/generic_pt/fmt/x86_64.h b/drivers/iommu/generic_pt/fmt/x86_64.h new file mode 100644 index 00000000000000..c01815b6229cce --- /dev/null +++ b/drivers/iommu/generic_pt/fmt/x86_64.h @@ -0,0 +1,248 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/*
- Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES
- x86 page table. Supports the 4 and 5 level variations.
- The 4 and 5 level version is described in:
- Section "4.4 4-Level Paging and 5-Level Paging" of the Intel Software
- Developer's Manual Volume 3
- Section "9.7 First-Stage Paging Entries" of the "Intel Virtualization
- Technology for Directed I/O Architecture Specification"
- Section "2.2.6 I/O Page Tables for Guest Translations" of the "AMD I/O
- Virtualization Technology (IOMMU) Specification"
- It is used by x86 CPUs, AMD and VT-D IOMMU HW.
- Note the 3 level format is very similar and almost implemented here. The
- reserved/ignored layout is different and there are functional bit
- differences.
- This format uses PT_FEAT_SIGN_EXTEND to have a upper/non-canonical/lower
- split. PT_FEAT_SIGN_EXTEND is optional as AMD IOMMU sometimes uses non-sign
- extended addressing with this page table format.
- The named levels in the spec map to the pts->level as:
- Table/PTE - 0
- Directory/PDE - 1
- Directory Ptr/PDPTE - 2
- PML4/PML4E - 3
- PML5/PML5E - 4
- */
+#ifndef __GENERIC_PT_FMT_X86_64_H +#define __GENERIC_PT_FMT_X86_64_H
+#include "defs_x86_64.h" +#include "../pt_defs.h"
+#include <linux/bitfield.h> +#include <linux/container_of.h> +#include <linux/log2.h> +#include <linux/mem_encrypt.h>
+enum {
- PT_MAX_OUTPUT_ADDRESS_LG2 = 52,
- PT_MAX_VA_ADDRESS_LG2 = 57,
- PT_ITEM_WORD_SIZE = sizeof(u64),
- PT_MAX_TOP_LEVEL = 4,
- PT_GRANULE_LG2SZ = 12,
- PT_TABLEMEM_LG2SZ = 12,
- /*
* For AMD the GCR3 Base only has these bits. For VT-D FSPTPTR is 4k
* aligned and is limited by the architected HAW
*/
- PT_TOP_PHYS_MASK = GENMASK_ULL(51, 12),
+};
+/* Shared descriptor bits */ +enum {
- X86_64_FMT_P = BIT(0),
- X86_64_FMT_RW = BIT(1),
- X86_64_FMT_U = BIT(2),
- X86_64_FMT_A = BIT(5),
- X86_64_FMT_D = BIT(6),
- X86_64_FMT_OA = GENMASK_ULL(51, 12),
- X86_64_FMT_XD = BIT_ULL(63),
+};
+/* PDPTE/PDE */ +enum {
- X86_64_FMT_PS = BIT(7),
+};
+#define common_to_x86_64_pt(common_ptr) \
- container_of_const(common_ptr, struct pt_x86_64, common)
+#define to_x86_64_pt(pts) common_to_x86_64_pt((pts)->range->common)
+static inline pt_oaddr_t x86_64_pt_table_pa(const struct pt_state *pts) +{
- return oalog2_mul(FIELD_GET(X86_64_FMT_OA, pts->entry),
PT_TABLEMEM_LG2SZ);
Same as amdv1.h , we need __sme_clr() here.
+} +#define pt_table_pa x86_64_pt_table_pa
+static inline pt_oaddr_t x86_64_pt_entry_oa(const struct pt_state *pts) +{
- return oalog2_mul(FIELD_GET(X86_64_FMT_OA, pts->entry),
PT_GRANULE_LG2SZ);
Also here.
-Vasant