Commit fcc8e46f768f ("vdso/gettimeofday: Return bool from clock_gettime()
helpers") changed the return value from clock_gettime() helpers, but it
missed updating the one call to the do_hres() function, what breaks VDSO
operation on some of my ARM 32bit based test boards. Fix this.
Fixes: fcc8e46f768f ("vdso/gettimeofday: Return bool from clock_gettime() helpers")
Signed-off-by: Marek Szyprowski <m.szyprowski(a)samsung.com>
---
lib/vdso/gettimeofday.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index d6743ed756a1..97aa9059a5c9 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -395,7 +395,7 @@ __cvdso_gettimeofday_data(const struct vdso_time_data *vd,
if (likely(tv != NULL)) {
struct __kernel_timespec ts;
- if (do_hres(vd, &vc[CS_HRES_COARSE], CLOCK_REALTIME, &ts))
+ if (!do_hres(vd, &vc[CS_HRES_COARSE], CLOCK_REALTIME, &ts))
return gettimeofday_fallback(tv, tz);
tv->tv_sec = ts.tv_sec;
--
2.34.1
The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.
HW QUEUE introduced by this series as a part of the vIOMMU infrastructure
represents a HW accelerated queue/buffer for VM to use exclusively, e.g.
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
each of which allows its IOMMU HW to directly access a queue memory owned
by a guest VM and allows a guest OS to control the HW queue direclty, to
avoid VM Exit overheads to improve the performance.
Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC
allowing VMM to forward the IOMMU-specific queue info, such as queue base
address, size, and etc.
Meanwhile, a guest-owned queue needs the guest kernel to control the queue
by reading/writing its consumer and producer indexes, via MMIO acceses to
the hardware MMIO registers. Introduce an mmap infrastructure for iommufd
to support passing through a piece of MMIO region from the host physical
address space to the guest physical address space. The mmap info (offset/
length) used by an mmap syscall must be pre-allocated and returned to the
user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC
call. Thus, it requires a driver-specific user data support in the vIOMMU
allocation flow.
As a real-world use case, this series implements a HW QUEUE support in the
tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it
is also the Tegra CMDQV series Part-2 (user-space support), reworked from
Previous RFCv1:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
// Unmap latencies from "dma_map_benchmark -g @granule -t @threads",
// by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq"
@granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0
4KB 1 35.7 us 5.3 us
16KB 1 41.8 us 6.8 us
64KB 1 68.9 us 9.9 us
128KB 1 109.0 us 12.6 us
256KB 1 187.1 us 18.0 us
4KB 2 96.9 us 6.8 us
16KB 2 97.8 us 7.5 us
64KB 2 151.5 us 10.7 us
128KB 2 257.8 us 12.7 us
256KB 2 443.0 us 17.9 us
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v8
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v8
Changelog (attached git-diff v7..v8 at the end of this letter)
v8
* Add Reviewed-by from Pranj, Kevin and Jason
* Improve kdoc and comments
* [iommufd] Skip selftest for no_viommu variants
* [iommufd] Add unmap coverage for non internal area
* [iommufd] Skip the first page when mtree_alloc_range()
* [iommufd] Correct the passed in index to mtree_erase()
* [iommufd] Correct variable types in iommufd_hw_queue_alloc_phys()
* [iommufd] Reject iopt_unmap_iova_range() if area->num_locked is set
* [tegra] Rename "SID replacement" with "SID mapping"
* [tegra] Unwrap useless _tegra241_vcmdq_hw_init helper
v7
https://lore.kernel.org/all/cover.1750966133.git.nicolinc@nvidia.com/
* Rebased on Jason's for-next tree (iommufd_hw_queue-prep series)
* Add Reviewed-by from Baolu, Jason, Pranjal
* Update kdocs and notes
* [iommu] Replace "u32" with "enum iommu_hw_info_type"
* [iommufd] Rename vdev->id to vdev->virt_id
* [iommufd] Replace macros with inline helpers
* [iommufd] Report unmapped_bytes in error path
* [iommufd] Add iommufd_access_is_internal helper
* [iommufd] Do not drop ops->unmap check for mdevs
* [iommufd] Store physical addresses in immap structure
* [iommufd] Reorder access and hw_queue object allocations
* [iommufd] Scan for an internal access before any unmap call
* [iommufd] Drop unused ictx pointer in struct iommufd_hw_queue
* [iommufd] Use kcalloc to avoid failure due to memory fragmentation
* [tegra] Use "else"
* [tegra] Lock destroy() using lvcmdq_mutex
v6
https://lore.kernel.org/all/cover.1749884998.git.nicolinc@nvidia.com/
* Rebase on iommufd_hw_queue-prep-v2
* Add Reviewed-by from Kevin and Jason
* [iommufd] Update kdocs and notes
* [iommufd] Drop redundant pages[i] check
* [iommufd] Allow nesting_parent_iova to be 0
* [iommufd] Add iommufd_hw_queue_alloc_phys()
* [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs
* [iommufd] Move destroy ops to vdevice/hw_queue structures
* [iommufd] Add union in hw_info struct to share out_data_type field
* [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs
* [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init
* [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init
* [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys
* [smmu] Drop arm_smmu_domain_ipa_to_pa
* [smmu] Update arm_smmu_impl_ops changes for vsmmu_init
* [tegra] Add a vdev_to_vsid macro
* [tegra] Add lvcmdq_mutex to protect multi queues
* [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak)
v5
https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc6
* Add Reviewed-by from Jason and Kevin
* Correct typos in kdoc and update commit logs
* [iommufd] Add a cosmetic fix
* [iommufd] Drop unused num_pfns
* [iommufd] Drop unnecessary check
* [iommufd] Reorder patch sequence
* [iommufd] Use io_remap_pfn_range()
* [iommufd] Use success oriented flow
* [iommufd] Fix max_npages calculation
* [iommufd] Add more selftest coverage
* [iommufd] Drop redundant static_assert
* [iommufd] Fix mmap pfn range validation
* [iommufd] Reject unmap on pinned iovas
* [iommufd] Drop redundant vm_flags_set()
* [iommufd] Drop iommufd_struct_destroy()
* [iommufd] Drop redundant queue iova test
* [iommufd] Use "mmio_addr" and "mmio_pfn"
* [iommufd] Rename to "nesting_parent_iova"
* [iommufd] Make iopt_pin_pages call option
* [iommufd] Add ictx comparison in depend()
* [iommufd] Add iommufd_object_alloc_ucmd()
* [iommufd] Move kcalloc() after validations
* [iommufd] Replace ictx setting with WARN_ON
* [iommufd] Make hw_info's type bidirectional
* [smmu] Add supported_vsmmu_type in impl_ops
* [smmu] Drop impl report in smmu vendor struct
* [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV
* [tegra] Replace "number of VINTFs" with a note
* [tegra] Drop the redundant lvcmdq pointer setting
* [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA
* [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op
v4
https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc5
* Add Reviewed-by from Vasant
* Rename "vQUEUE" to "HW QUEUE"
* Use "offset" and "length" for all mmap-related variables
* [iommufd] Use u64 for guest PA
* [iommufd] Fix typo in uAPI doc
* [iommufd] Rename immap_id to offset
* [iommufd] Drop the partial-size mmap support
* [iommufd] Do not replace WARN_ON with WARN_ON_ONCE
* [iommufd] Use "u64 base_addr" for queue base address
* [iommufd] Use u64 base_pfn/num_pfns for immap structure
* [iommufd] Correct the size passed in to mtree_alloc_range()
* [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops
v3
https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu, Pranjal, and Alok
* Revise kdocs, uAPI docs, and commit logs
* Rename "vCMDQ" back to "vQUEUE" for AMD cases
* [tegra] Add tegra241_vcmdq_hw_flush_timeout()
* [tegra] Rename vsmmu_alloc to alloc_vintf_user
* [tegra] Use writel for SID replacement registers
* [tegra] Move mmap removal call to vsmmu_destroy op
* [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user()
* [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED()
* [iommufd] Add an object-type "owner" to immap structure
* [iommufd] Drop the ictx input in the new for-driver APIs
* [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle
* [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers
* [iommufd] Rename iommufd_ctx_alloc/free_mmap to
_iommufd_alloc/destroy_mmap
v2
https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason
* [smmu] Fix vsmmu initial value
* [smmu] Support impl for hw_info
* [tegra] Rename "slot" to "vsid"
* [tegra] Update kdocs and commit logs
* [tegra] Map/unmap LVCMDQ dynamically
* [tegra] Refcount the previous LVCMDQ
* [tegra] Return -EEXIST if LVCMDQ exists
* [tegra] Simplify VINTF cleanup routine
* [tegra] Use vmid and s2_domain in vsmmu
* [tegra] Rename "mmap_pgoff" to "immap_id"
* [tegra] Add more addr and length validation
* [iommufd] Add more narrative to mmap's kdoc
* [iommufd] Add iommufd_struct_depend/undepend()
* [iommufd] Rename vcmdq_free op to vcmdq_destroy
* [iommufd] Fix bug in iommu_copy_struct_to_user()
* [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
* [iommufd] Test the queue memory for its contiguity
* [iommufd] Return -ENXIO if address or length fails
* [iommufd] Do not change @min_last in mock_viommu_alloc()
* [iommufd] Generalize TEGRA241_VCMDQ data in core structure
* [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
* [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (29):
iommufd: Report unmapped bytes in the error path of
iopt_unmap_iova_range
iommufd: Correct virt_id kdoc at struct iommu_vdevice_alloc
iommufd/viommu: Explicitly define vdev->virt_id
iommu: Use enum iommu_hw_info_type for type in hw_info op
iommu: Add iommu_copy_struct_to_user helper
iommu: Pass in a driver-level user data structure to viommu_init op
iommufd/viommu: Allow driver-specific user data for a vIOMMU object
iommufd/selftest: Support user_data in mock_viommu_alloc
iommufd/selftest: Add coverage for viommu data
iommufd/access: Add internal APIs for HW queue to use
iommufd/access: Bypass access->ops->unmap for internal use
iommufd/viommu: Add driver-defined vDEVICE support
iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct
iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers
iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC
iommufd: Add mmap interface
iommufd/selftest: Add coverage for the new mmap interface
Documentation: userspace-api: iommufd: Update HW QUEUE
iommu: Allow an input type in hw_info op
iommufd: Allow an input data_type via iommu_hw_info
iommufd/selftest: Update hw_info coverage for an input data_type
iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/tegra241-cmdqv: Simplify deinit flow in
tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 22 +-
drivers/iommu/iommufd/io_pagetable.h | 5 +-
drivers/iommu/iommufd/iommufd_private.h | 46 +-
drivers/iommu/iommufd/iommufd_test.h | 20 +
include/linux/iommu.h | 50 +-
include/linux/iommufd.h | 160 ++++++
include/uapi/linux/iommufd.h | 147 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 89 +++-
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 28 +-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 477 +++++++++++++++++-
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 87 +++-
drivers/iommu/iommufd/driver.c | 82 ++-
drivers/iommu/iommufd/io_pagetable.c | 13 +-
drivers/iommu/iommufd/main.c | 69 +++
drivers/iommu/iommufd/pages.c | 12 +-
drivers/iommu/iommufd/selftest.c | 153 +++++-
drivers/iommu/iommufd/viommu.c | 215 +++++++-
tools/testing/selftests/iommu/iommufd.c | 140 ++++-
.../selftests/iommu/iommufd_fail_nth.c | 15 +-
Documentation/userspace-api/iommufd.rst | 12 +
21 files changed, 1739 insertions(+), 110 deletions(-)
--
2.43.0
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index d57a3bea948c..d5d43a1c7708 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -161,7 +161,7 @@ struct tegra241_vcmdq {
* @lvcmdq_mutex: Lock to serialize user-allocated lvcmdqs
* @base: MMIO base address
* @mmap_offset: Offset argument for mmap() syscall
- * @sids: Stream ID replacement resources
+ * @sids: Stream ID mapping resources
*/
struct tegra241_vintf {
struct arm_vsmmu vsmmu;
@@ -183,11 +183,11 @@ struct tegra241_vintf {
#define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, vsmmu.core)
/**
- * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
+ * struct tegra241_vintf_sid - Virtual Interface Stream ID Mapping
* @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
* @vintf: Parent VINTF pointer
* @sid: Physical Stream ID
- * @idx: Replacement index in the VINTF
+ * @idx: Mapping index in the VINTF
*/
struct tegra241_vintf_sid {
struct iommufd_vdevice core;
@@ -207,7 +207,7 @@ struct tegra241_vintf_sid {
* @num_vintfs: Total number of VINTFs
* @num_vcmdqs: Total number of VCMDQs
* @num_lvcmdqs_per_vintf: Number of logical VCMDQs per VINTF
- * @num_sids_per_vintf: Total number of SID replacements per VINTF
+ * @num_sids_per_vintf: Total number of SID mappings per VINTF
* @vintf_ids: VINTF id allocator
* @vintfs: List of VINTFs
*/
@@ -470,12 +470,6 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
}
-/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
-static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
-{
- writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
-}
-
/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
{
@@ -486,7 +480,7 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
tegra241_vcmdq_hw_deinit(vcmdq);
/* Configure and enable VCMDQ */
- _tegra241_vcmdq_hw_init(vcmdq);
+ writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
ret = vcmdq_write_config(vcmdq, VCMDQ_EN);
if (ret) {
@@ -1077,7 +1071,7 @@ static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq)
char header[64];
/* Configure the vcmdq only; User space does the enabling */
- _tegra241_vcmdq_hw_init(vcmdq);
+ writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
dev_dbg(vcmdq->cmdqv->dev, "%sinited at host PA 0x%llx size 0x%lx\n",
lvcmdq_error_header(vcmdq, header, 64),
@@ -1259,6 +1253,7 @@ static int tegra241_vintf_init_vsid(struct iommufd_vdevice *vdev)
static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
.destroy = tegra241_cmdqv_destroy_vintf_user,
.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
+ /* Non-accelerated commands will be still handled by the kernel */
.cache_invalidate = arm_vsmmu_cache_invalidate,
.vdevice_size = VDEVICE_STRUCT_SIZE(struct tegra241_vintf_sid, core),
.vdevice_init = tegra241_vintf_init_vsid,
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index cbd86aabdd1c..e2ba21c43ad2 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -1245,26 +1245,18 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_replace, "IOMMUFD");
* run in the future. Due to this a driver must not create locking that prevents
* unmap to complete while iommufd_access_destroy() is running.
*/
-int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
- unsigned long length)
+void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+ unsigned long length)
{
struct iommufd_ioas *ioas =
container_of(iopt, struct iommufd_ioas, iopt);
struct iommufd_access *access;
unsigned long index;
- int ret = 0;
xa_lock(&ioas->iopt.access_list);
- /* Bypass any unmap if there is an internal access */
xa_for_each(&ioas->iopt.access_list, index, access) {
- if (iommufd_access_is_internal(access)) {
- ret = -EBUSY;
- goto unlock;
- }
- }
-
- xa_for_each(&ioas->iopt.access_list, index, access) {
- if (!iommufd_lock_obj(&access->obj))
+ if (!iommufd_lock_obj(&access->obj) ||
+ iommufd_access_is_internal(access))
continue;
xa_unlock(&ioas->iopt.access_list);
@@ -1273,9 +1265,7 @@ int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
iommufd_put_object(access->ictx, &access->obj);
xa_lock(&ioas->iopt.access_list);
}
-unlock:
xa_unlock(&ioas->iopt.access_list);
- return ret;
}
/**
@@ -1290,6 +1280,7 @@ int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
void iommufd_access_unpin_pages(struct iommufd_access *access,
unsigned long iova, unsigned long length)
{
+ bool internal = iommufd_access_is_internal(access);
struct iopt_area_contig_iter iter;
struct io_pagetable *iopt;
unsigned long last_iova;
@@ -1316,7 +1307,8 @@ void iommufd_access_unpin_pages(struct iommufd_access *access,
area, iopt_area_iova_to_index(area, iter.cur_iova),
iopt_area_iova_to_index(
area,
- min(last_iova, iopt_area_last_iova(area))));
+ min(last_iova, iopt_area_last_iova(area))),
+ internal);
WARN_ON(!iopt_area_contig_done(&iter));
up_read(&iopt->iova_rwsem);
mutex_unlock(&access->ioas_lock);
@@ -1365,6 +1357,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
unsigned long length, struct page **out_pages,
unsigned int flags)
{
+ bool internal = iommufd_access_is_internal(access);
struct iopt_area_contig_iter iter;
struct io_pagetable *iopt;
unsigned long last_iova;
@@ -1374,8 +1367,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
/* Driver's ops don't support pin_pages */
if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
WARN_ON(access->iova_alignment != PAGE_SIZE ||
- (!iommufd_access_is_internal(access) &&
- !access->ops->unmap)))
+ (!internal && !access->ops->unmap)))
return -EINVAL;
if (!length)
@@ -1409,7 +1401,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
}
rc = iopt_area_add_access(area, index, last_index, out_pages,
- flags);
+ flags, internal);
if (rc)
goto err_remove;
out_pages += last_index - index + 1;
@@ -1432,7 +1424,8 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
iopt_area_iova_to_index(area, iter.cur_iova),
iopt_area_iova_to_index(
area, min(last_iova,
- iopt_area_last_iova(area))));
+ iopt_area_last_iova(area))),
+ internal);
}
up_read(&iopt->iova_rwsem);
mutex_unlock(&access->ioas_lock);
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c
index 2c9af93217f1..153e4720ee18 100644
--- a/drivers/iommu/iommufd/driver.c
+++ b/drivers/iommu/iommufd/driver.c
@@ -56,8 +56,9 @@ int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner,
immap->length = length;
immap->mmio_addr = mmio_addr;
- rc = mtree_alloc_range(&ictx->mt_mmap, &startp, immap, immap->length, 0,
- PHYS_ADDR_MAX, GFP_KERNEL);
+ /* Skip the first page to ease caller identifying the returned offset */
+ rc = mtree_alloc_range(&ictx->mt_mmap, &startp, immap, immap->length,
+ PAGE_SIZE, PHYS_ADDR_MAX, GFP_KERNEL);
if (rc < 0) {
kfree(immap);
return rc;
@@ -76,7 +77,7 @@ void _iommufd_destroy_mmap(struct iommufd_ctx *ictx,
{
struct iommufd_mmap *immap;
- immap = mtree_erase(&ictx->mt_mmap, offset >> PAGE_SHIFT);
+ immap = mtree_erase(&ictx->mt_mmap, offset);
WARN_ON_ONCE(!immap || immap->owner != owner);
kfree(immap);
}
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 6b8477b1f94b..abf4aadca96c 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -719,6 +719,12 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
goto out_unlock_iova;
}
+ /* The area is locked by an object that has not been destroyed */
+ if (area->num_locks) {
+ rc = -EBUSY;
+ goto out_unlock_iova;
+ }
+
if (area_first < start || area_last > last) {
rc = -ENOENT;
goto out_unlock_iova;
@@ -740,15 +746,7 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
up_write(&iopt->iova_rwsem);
up_read(&iopt->domains_rwsem);
- rc = iommufd_access_notify_unmap(iopt, area_first,
- length);
- if (rc) {
- down_read(&iopt->domains_rwsem);
- down_write(&iopt->iova_rwsem);
- area->prevent_access = false;
- goto out_unlock_iova;
- }
-
+ iommufd_access_notify_unmap(iopt, area_first, length);
/* Something is not responding to unmap requests. */
tries++;
if (WARN_ON(tries > 100)) {
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index c115a51d9384..b6064f4ce4af 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -48,6 +48,7 @@ struct iopt_area {
int iommu_prot;
bool prevent_access : 1;
unsigned int num_accesses;
+ unsigned int num_locks;
};
struct iopt_allowed {
@@ -238,9 +239,9 @@ void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
int iopt_area_add_access(struct iopt_area *area, unsigned long start,
unsigned long last, struct page **out_pages,
- unsigned int flags);
+ unsigned int flags, bool lock_area);
void iopt_area_remove_access(struct iopt_area *area, unsigned long start,
- unsigned long last);
+ unsigned long last, bool unlock_area);
int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
void *data, unsigned long length, unsigned int flags);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index ebac6a4b3538..cd14163abdd1 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -125,8 +125,8 @@ int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
int iopt_set_dirty_tracking(struct io_pagetable *iopt,
struct iommu_domain *domain, bool enable);
-int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
- unsigned long length);
+void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+ unsigned long length);
int iopt_table_add_domain(struct io_pagetable *iopt,
struct iommu_domain *domain);
void iopt_table_remove_domain(struct io_pagetable *iopt,
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index cbdde642d2af..301c232462bd 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -2111,7 +2111,7 @@ iopt_pages_get_exact_access(struct iopt_pages *pages, unsigned long index,
*/
int iopt_area_add_access(struct iopt_area *area, unsigned long start_index,
unsigned long last_index, struct page **out_pages,
- unsigned int flags)
+ unsigned int flags, bool lock_area)
{
struct iopt_pages *pages = area->pages;
struct iopt_pages_access *access;
@@ -2124,6 +2124,8 @@ int iopt_area_add_access(struct iopt_area *area, unsigned long start_index,
access = iopt_pages_get_exact_access(pages, start_index, last_index);
if (access) {
area->num_accesses++;
+ if (lock_area)
+ area->num_locks++;
access->users++;
iopt_pages_fill_from_xarray(pages, start_index, last_index,
out_pages);
@@ -2145,6 +2147,8 @@ int iopt_area_add_access(struct iopt_area *area, unsigned long start_index,
access->node.last = last_index;
access->users = 1;
area->num_accesses++;
+ if (lock_area)
+ area->num_locks++;
interval_tree_insert(&access->node, &pages->access_itree);
mutex_unlock(&pages->mutex);
return 0;
@@ -2166,7 +2170,7 @@ int iopt_area_add_access(struct iopt_area *area, unsigned long start_index,
* must stop using the PFNs before calling this.
*/
void iopt_area_remove_access(struct iopt_area *area, unsigned long start_index,
- unsigned long last_index)
+ unsigned long last_index, bool unlock_area)
{
struct iopt_pages *pages = area->pages;
struct iopt_pages_access *access;
@@ -2177,6 +2181,10 @@ void iopt_area_remove_access(struct iopt_area *area, unsigned long start_index,
goto out_unlock;
WARN_ON(area->num_accesses == 0 || access->users == 0);
+ if (unlock_area) {
+ WARN_ON(area->num_locks == 0);
+ area->num_locks--;
+ }
area->num_accesses--;
access->users--;
if (access->users)
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index ce509a827721..00641204efb2 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -241,13 +241,20 @@ iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd,
{
struct iommufd_access *access;
struct page **pages;
- int max_npages, i;
+ size_t max_npages;
+ size_t length;
u64 offset;
+ size_t i;
int rc;
offset =
cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova);
- max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE);
+ /* DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */
+ if (check_add_overflow(offset, cmd->length, &length))
+ return ERR_PTR(-ERANGE);
+ if (check_add_overflow(length, PAGE_SIZE - 1, &length))
+ return ERR_PTR(-ERANGE);
+ max_npages = length / PAGE_SIZE;
/*
* Use kvcalloc() to avoid memory fragmentation for a large page array.
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 7ab9e3e928b3..e3a0cd47384d 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -112,7 +112,7 @@ struct iommufd_vdevice {
/*
* Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID of
- * AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
+ * AMD IOMMU, and vRID of Intel VT-d
*/
u64 virt_id;
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index a2840beefa8c..111ea81f91a2 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -997,7 +997,7 @@ struct iommu_fault_alloc {
* @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use
* @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type
* @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
- * SMMUv3) Virtual Interface (VINTF)
+ * SMMUv3) enabled ARM SMMUv3 type
*/
enum iommu_viommu_type {
IOMMU_VIOMMU_TYPE_DEFAULT = 0,
@@ -1065,7 +1065,7 @@ struct iommu_viommu_alloc {
* @dev_id: The physical device to allocate a virtual instance on the vIOMMU
* @out_vdevice_id: Object handle for the vDevice. Pass to IOMMU_DESTORY
* @virt_id: Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID
- * of AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
+ * of AMD IOMMU, and vRID of Intel VT-d
*
* Allocate a virtual device instance (for a physical device) against a vIOMMU.
* This instance holds the device's information (related to its vIOMMU) in a VM.
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index fda93a195e26..9d5b852d5e19 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -2817,33 +2817,31 @@ TEST_F(iommufd_viommu, viommu_alloc_with_data)
};
uint32_t *test;
- if (self->device_id) {
- test_cmd_viommu_alloc(self->device_id, self->hwpt_id,
- IOMMU_VIOMMU_TYPE_SELFTEST, &data,
- sizeof(data), &self->viommu_id);
- ASSERT_EQ(data.out_data, data.in_data);
-
- /* Negative mmap tests -- offset and length cannot be changed */
- test_err_mmap(ENXIO, data.out_mmap_length,
- data.out_mmap_offset + PAGE_SIZE);
- test_err_mmap(ENXIO, data.out_mmap_length,
- data.out_mmap_offset + PAGE_SIZE * 2);
- test_err_mmap(ENXIO, data.out_mmap_length / 2,
- data.out_mmap_offset);
- test_err_mmap(ENXIO, data.out_mmap_length * 2,
- data.out_mmap_offset);
-
- /* Now do a correct mmap for a loopback test */
- test = mmap(NULL, data.out_mmap_length, PROT_READ | PROT_WRITE,
- MAP_SHARED, self->fd, data.out_mmap_offset);
- ASSERT_NE(MAP_FAILED, test);
- ASSERT_EQ(data.in_data, *test);
-
- /* The owner of the mmap region should be blocked */
- EXPECT_ERRNO(EBUSY,
- _test_ioctl_destroy(self->fd, self->viommu_id));
- munmap(test, data.out_mmap_length);
- }
+ if (!self->device_id)
+ SKIP(return, "Skipping test for variant no_viommu");
+
+ test_cmd_viommu_alloc(self->device_id, self->hwpt_id,
+ IOMMU_VIOMMU_TYPE_SELFTEST, &data, sizeof(data),
+ &self->viommu_id);
+ ASSERT_EQ(data.out_data, data.in_data);
+
+ /* Negative mmap tests -- offset and length cannot be changed */
+ test_err_mmap(ENXIO, data.out_mmap_length,
+ data.out_mmap_offset + PAGE_SIZE);
+ test_err_mmap(ENXIO, data.out_mmap_length,
+ data.out_mmap_offset + PAGE_SIZE * 2);
+ test_err_mmap(ENXIO, data.out_mmap_length / 2, data.out_mmap_offset);
+ test_err_mmap(ENXIO, data.out_mmap_length * 2, data.out_mmap_offset);
+
+ /* Now do a correct mmap for a loopback test */
+ test = mmap(NULL, data.out_mmap_length, PROT_READ | PROT_WRITE,
+ MAP_SHARED, self->fd, data.out_mmap_offset);
+ ASSERT_NE(MAP_FAILED, test);
+ ASSERT_EQ(data.in_data, *test);
+
+ /* The owner of the mmap region should be blocked */
+ EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, self->viommu_id));
+ munmap(test, data.out_mmap_length);
}
TEST_F(iommufd_viommu, vdevice_alloc)
@@ -3071,61 +3069,60 @@ TEST_F(iommufd_viommu, vdevice_cache)
TEST_F(iommufd_viommu, hw_queue)
{
+ __u64 iova = MOCK_APERTURE_START, iova2;
uint32_t viommu_id = self->viommu_id;
- __u64 iova = MOCK_APERTURE_START;
uint32_t hw_queue_id[2];
- if (viommu_id) {
- /* Fail IOMMU_HW_QUEUE_TYPE_DEFAULT */
- test_err_hw_queue_alloc(EOPNOTSUPP, viommu_id,
- IOMMU_HW_QUEUE_TYPE_DEFAULT, 0, iova,
- PAGE_SIZE, &hw_queue_id[0]);
- /* Fail queue addr and length */
- test_err_hw_queue_alloc(EINVAL, viommu_id,
- IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova,
- 0, &hw_queue_id[0]);
- test_err_hw_queue_alloc(EOVERFLOW, viommu_id,
- IOMMU_HW_QUEUE_TYPE_SELFTEST, 0,
- ~(uint64_t)0, PAGE_SIZE,
- &hw_queue_id[0]);
- /* Fail missing iova */
- test_err_hw_queue_alloc(ENOENT, viommu_id,
- IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova,
- PAGE_SIZE, &hw_queue_id[0]);
-
- /* Map iova */
- test_ioctl_ioas_map(buffer, PAGE_SIZE, &iova);
-
- /* Fail index=1 and =MAX; must start from index=0 */
- test_err_hw_queue_alloc(EIO, viommu_id,
- IOMMU_HW_QUEUE_TYPE_SELFTEST, 1, iova,
- PAGE_SIZE, &hw_queue_id[0]);
- test_err_hw_queue_alloc(EINVAL, viommu_id,
- IOMMU_HW_QUEUE_TYPE_SELFTEST,
- IOMMU_TEST_HW_QUEUE_MAX, iova,
- PAGE_SIZE, &hw_queue_id[0]);
-
- /* Allocate index=0, declare ownership of the iova */
- test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
- 0, iova, PAGE_SIZE, &hw_queue_id[0]);
- /* Fail duplicate */
- test_err_hw_queue_alloc(EEXIST, viommu_id,
- IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova,
- PAGE_SIZE, &hw_queue_id[0]);
- /* Fail unmap, due to iova ownership */
- test_err_ioctl_ioas_unmap(EBUSY, iova, PAGE_SIZE);
-
- /* Allocate index=1 */
- test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
- 1, iova, PAGE_SIZE, &hw_queue_id[1]);
- /* Fail to destroy, due to dependency */
- EXPECT_ERRNO(EBUSY,
- _test_ioctl_destroy(self->fd, hw_queue_id[0]));
-
- /* Destroy in descending order */
- test_ioctl_destroy(hw_queue_id[1]);
- test_ioctl_destroy(hw_queue_id[0]);
- }
+ if (!viommu_id)
+ SKIP(return, "Skipping test for variant no_viommu");
+
+ /* Fail IOMMU_HW_QUEUE_TYPE_DEFAULT */
+ test_err_hw_queue_alloc(EOPNOTSUPP, viommu_id,
+ IOMMU_HW_QUEUE_TYPE_DEFAULT, 0, iova, PAGE_SIZE,
+ &hw_queue_id[0]);
+ /* Fail queue addr and length */
+ test_err_hw_queue_alloc(EINVAL, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
+ 0, iova, 0, &hw_queue_id[0]);
+ test_err_hw_queue_alloc(EOVERFLOW, viommu_id,
+ IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, ~(uint64_t)0,
+ PAGE_SIZE, &hw_queue_id[0]);
+ /* Fail missing iova */
+ test_err_hw_queue_alloc(ENOENT, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
+ 0, iova, PAGE_SIZE, &hw_queue_id[0]);
+
+ /* Map iova */
+ test_ioctl_ioas_map(buffer, PAGE_SIZE, &iova);
+ test_ioctl_ioas_map(buffer + PAGE_SIZE, PAGE_SIZE, &iova2);
+
+ /* Fail index=1 and =MAX; must start from index=0 */
+ test_err_hw_queue_alloc(EIO, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 1,
+ iova, PAGE_SIZE, &hw_queue_id[0]);
+ test_err_hw_queue_alloc(EINVAL, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
+ IOMMU_TEST_HW_QUEUE_MAX, iova, PAGE_SIZE,
+ &hw_queue_id[0]);
+
+ /* Allocate index=0, declare ownership of the iova */
+ test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 0,
+ iova, PAGE_SIZE, &hw_queue_id[0]);
+ /* Fail duplicate */
+ test_err_hw_queue_alloc(EEXIST, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
+ 0, iova, PAGE_SIZE, &hw_queue_id[0]);
+ /* Fail unmap, due to iova ownership */
+ test_err_ioctl_ioas_unmap(EBUSY, iova, PAGE_SIZE);
+ /* The 2nd page is not pinned, so it can be unmmap */
+ test_ioctl_ioas_unmap(iova + PAGE_SIZE, PAGE_SIZE);
+
+ /* Allocate index=1 */
+ test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 1,
+ iova, PAGE_SIZE, &hw_queue_id[1]);
+ /* Fail to destroy, due to dependency */
+ EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, hw_queue_id[0]));
+
+ /* Destroy in descending order */
+ test_ioctl_destroy(hw_queue_id[1]);
+ test_ioctl_destroy(hw_queue_id[0]);
+ /* Now it can unmap the first page */
+ test_ioctl_ioas_unmap(iova, PAGE_SIZE);
}
FIXTURE(iommufd_device_pasid)
This patchset refactors non-composite global variables into a common
struct that can be initialized and passed around per-test instead of
relying on the presence of global variables.
This allows:
- Better encapsulation
- Debugging becomes easier -- local variable state can be viewed per
stack frame, and we can more easily reason about the variable
mutations
Patch 1 needs to be applied first and can be followed by any of the
other patches.
I've ensured that the tests are passing locally (or atleast have the
same output as the code on master).
Ujwal Kundur (4):
selftests/mm/uffd: Refactor non-composite global vars into struct
selftests/mm/uffd: Swap global vars with global test options
selftests/mm/uffd: Swap global variables with global test opts
selftests/mm/uffd: Swap global variables with global test opts
tools/testing/selftests/mm/uffd-common.c | 269 +++++-----
tools/testing/selftests/mm/uffd-common.h | 78 +--
tools/testing/selftests/mm/uffd-stress.c | 226 ++++----
tools/testing/selftests/mm/uffd-unit-tests.c | 523 ++++++++++---------
tools/testing/selftests/mm/uffd-wp-mremap.c | 23 +-
5 files changed, 591 insertions(+), 528 deletions(-)
--
2.20.1
Fix to use the return value of the function 'chdir("/")' and check if the
return is either 0 (ok) or 1 (not ok, so the test stops).
The patch fies the solves the following errors:
mount-notify_test.c:468:17: warning: ignoring return value of ‘chdir’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
468 | chdir("/");
mount-notify_test_ns.c:489:17: warning: ignoring return value of
‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused-
result]
489 | chdir("/");
To reproduce the issue, use the command:
make kselftest TARGET=filesystems/statmount
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
---
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
.../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 5a3b0ace1a88..a7f899599d52 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -458,7 +458,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
index d91946e69591..dc9eb3087a1a 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
@@ -486,7 +486,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
--
2.43.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v21.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v21 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>)
- Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>)
- Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>)
- Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>)
- Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>)
- Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>)
- Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>)
- Fix some typos and format issues of dualpi2 attributes
v20 (21-Jun-2025)
- Add one more commit to fix warning and style check on tdc.sh reported by shellcheck
- Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>)
v19 (14-Jun-2025)
- Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>)
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (5):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Fix warning and style check on tdc.sh
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 151 ++-
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1171 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 6 +-
9 files changed, 1665 insertions(+), 5 deletions(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
This is series 2b/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Tamir Duberstein (10):
gpu: nova-core: use `core::ffi::CStr` method names
rust: auxiliary: use `core::ffi::CStr` method names
rust: configfs: use `core::ffi::CStr` method names
rust: cpufreq: use `core::ffi::CStr` method names
rust: drm: use `core::ffi::CStr` method names
rust: firmware: use `core::ffi::CStr` method names
rust: kunit: use `core::ffi::CStr` method names
rust: miscdevice: use `core::ffi::CStr` method names
rust: net: use `core::ffi::CStr` method names
rust: of: use `core::ffi::CStr` method names
drivers/gpu/drm/drm_panic_qr.rs | 2 +-
rust/kernel/auxiliary.rs | 4 ++--
rust/kernel/configfs.rs | 4 ++--
rust/kernel/cpufreq.rs | 2 +-
rust/kernel/drm/device.rs | 4 ++--
rust/kernel/firmware.rs | 2 +-
rust/kernel/kunit.rs | 6 +++---
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/net/phy.rs | 2 +-
rust/kernel/of.rs | 2 +-
samples/rust/rust_configfs.rs | 2 +-
11 files changed, 16 insertions(+), 16 deletions(-)
---
base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494
change-id: 20250709-core-cstr-fanout-1-f20611832272
prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1
prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573
prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e
prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa
prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec
prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27
prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
A few packets may still be sent and received during the termination of
the iperf processes. These late packets cause failures when they arrive
on queues expected to be empty.
Add a one second delay between repeated _send_traffic_check() calls in
rss_ctx tests to ensure such packets are processed before the next
traffic checks are performed.
Example failure observed:
Check failed 2 != 0 traffic on inactive queues (context 1):
[0, 0, 1, 1, 386385, 397196, 0, 0, 0, 0, ...]
Check failed 4 != 0 traffic on inactive queues (context 2):
[0, 0, 0, 0, 2, 2, 247152, 253013, 0, 0, ...]
Check failed 2 != 0 traffic on inactive queues (context 3):
[0, 0, 0, 0, 0, 0, 1, 1, 282434, 283070, ...]
Note: While the `noise` parameter could be used to tolerate these late
packets, it would be inappropriate here. `noise` tolerates far more
traffic than acceptable in this case, risking false positives.
Inactive queues are supposed to see zero traffic.
Fixes: 847aa551fa78 ("selftests: drv-net: rss_ctx: factor out send traffic and check")
Reviewed-by: Gal Pressman <gal(a)nvidia.com>
Reviewed-by: Carolina Jubran <cjubran(a)nvidia.com>
Signed-off-by: Nimrod Oren <noren(a)nvidia.com>
---
tools/testing/selftests/drivers/net/hw/rss_ctx.py | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/hw/rss_ctx.py b/tools/testing/selftests/drivers/net/hw/rss_ctx.py
index 7bb552f8b182..19be69227693 100755
--- a/tools/testing/selftests/drivers/net/hw/rss_ctx.py
+++ b/tools/testing/selftests/drivers/net/hw/rss_ctx.py
@@ -4,6 +4,7 @@
import datetime
import random
import re
+import time
from lib.py import ksft_run, ksft_pr, ksft_exit
from lib.py import ksft_eq, ksft_ne, ksft_ge, ksft_in, ksft_lt, ksft_true, ksft_raises
from lib.py import NetDrvEpEnv
@@ -492,6 +493,7 @@ def test_rss_context(cfg, ctx_cnt=1, create_with_cfg=None):
{ 'target': (2+i*2, 3+i*2),
'noise': (0, 1),
'empty': list(range(2, 2+i*2)) + list(range(4+i*2, 2+2*ctx_cnt)) })
+ time.sleep(1)
if requested_ctx_cnt != ctx_cnt:
raise KsftSkipEx(f"Tested only {ctx_cnt} contexts, wanted {requested_ctx_cnt}")
@@ -559,6 +561,7 @@ def test_rss_context_out_of_order(cfg, ctx_cnt=4):
}
_send_traffic_check(cfg, ports[i], f"context {i}", expected)
+ time.sleep(1)
# Use queues 0 and 1 for normal traffic
ethtool(f"-X {cfg.ifname} equal 2")
--
2.37.1
[All the precursor patches are merged now and AMD/RISCV/VTD conversions
are written]
Currently each of the iommu page table formats duplicates all of the logic
to maintain the page table and perform map/unmap/etc operations. There are
several different versions of the algorithms between all the different
formats. The io-pgtable system provides an interface to help isolate the
page table code from the iommu driver, but doesn't provide tools to
implement the common algorithms.
This makes it very hard to improve the state of the pagetable code under
the iommu domains as any proposed improvement needs to alter a large
number of different driver code paths. Combined with a lack of software
based testing this makes improvement in this area very hard.
iommufd wants several new page table operations:
- More efficient map/unmap operations, using iommufd's batching logic
- unmap that returns the physical addresses into a batch as it progresses
- cut that allows splitting areas so large pages can have holes
poked in them dynamically (ie guestmemfd hitless shared/private
transitions)
- More agressive freeing of table memory to avoid waste
- Fragmenting large pages so that dirty tracking can be more granular
- Reassembling large pages so that VMs can run at full IO performance
in migration/dirty tracking error flows
- KHO integration for kernel live upgrade
Together these are algorithmically complex enough to be a very significant
task to go and implement in all the page table formats we support. Just
the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86
PAE / AMDv1 / VT-D SS / RISCV)
Instead of doing the duplicated work, this series takes the first step to
consolidate the algorithms into one places. In spirit it is similar to the
work Christoph did a few years back to pull the redundant get_user_pages()
implementations out of the arch code into core MM. This unlocked a great
deal of improvement in that space in the following years. I would like to
see the same benefit in iommu as well.
My first RFC showed a bigger picture with all most all formats and more
algorithms. This series reorganizes that to be narrowly focused on just
enough to convert the AMD driver to use the new mechanism.
kunit tests are provided that allow good testing of the algorithms and all
formats on x86, nothing is arch specific.
AMD is one of the simpler options as the HW is quite uniform with few
different options/bugs while still requiring the complicated contiguous
pages support. The HW also has a very simple range based invalidation
approach that is easy to implement.
The AMD v1 and AMD v2 page table formats are implemented bit for bit
identical to the current code, tested using a compare kunit test that
checks against the io-pgtable version (on github, see below).
Updating the AMD driver to replace the io-pgtable layer with the new stuff
is fairly straightforward now. The layering is fixed up in the new version
so that all the invalidation goes through function pointers.
Several small fixing patches have come out of this as I've been fixing the
problems that the test suite uncovers in the current code, and
implementing the fixed version in iommupt.
On performance, there is a quite wide variety of implementation designs
across all the drivers. Looking at some key performance across
the main formats:
iommu_map():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 53,66 , 51,63 , 19.19 (AMDV1)
256*2^12, 386,1909 , 367,1795 , 79.79
256*2^21, 362,1633 , 355,1556 , 77.77
2^12, 56,62 , 52,59 , 11.11 (AMDv2)
256*2^12, 405,1355 , 357,1292 , 72.72
256*2^21, 393,1160 , 358,1114 , 67.67
2^12, 55,65 , 53,62 , 14.14 (VTD second stage)
256*2^12, 391,518 , 332,512 , 35.35
256*2^21, 383,635 , 336,624 , 46.46
2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit)
256*2^12, 380,389 , 361,369 , 2.02
256*2^21, 358,419 , 345,400 , 13.13
iommu_unmap():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 69,88 , 65,85 , 23.23 (AMDv1)
256*2^12, 353,6498 , 331,6029 , 94.94
256*2^21, 373,6014 , 360,5706 , 93.93
2^12, 71,72 , 66,69 , 4.04 (AMDv2)
256*2^12, 228,891 , 206,871 , 76.76
256*2^21, 254,721 , 245,711 , 65.65
2^12, 69,87 , 65,82 , 20.20 (VTD second stage)
256*2^12, 210,321 , 200,315 , 36.36
256*2^21, 255,349 , 238,342 , 30.30
2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit)
256*2^12, 521,357 , 447,346 , -29.29
256*2^21, 489,358 , 433,345 , -25.25
* Above numbers include additional patches to remove the iommu_pgsize()
overheads. gcc 13.3.0, i7-12700
This version provides fairly consistent performance across formats. ARM
unmap performance is quite different because this version supports
contiguous pages and uses a very different algorithm for unmapping. Though
why it is so worse compared to AMDv1 I haven't figured out yet.
The per-format commits include a more detailed chart.
There is a second branch:
https://github.com/jgunthorpe/linux/commits/iommu_pt_all
Containing supporting work and future steps:
- ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats
- RISCV format and RISCV conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv
- Support for a DMA incoherent HW page table walker
- VT-D second stage format and VT-D conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd
- DART v1 & v2 format
- Draft of a iommufd 'cut' operation to break down huge pages
- A compare test that checks the iommupt formats against the iopgtable
interface, including updating AMD to have a working iopgtable and patches
to make VT-D have an iopgtable for testing.
- A performance test to micro-benchmark map and unmap against iogptable
My strategy is to go one by one for the drivers:
- AMD driver conversion
- RISCV page table and driver
- Intel VT-D driver and VTDSS page table
- Flushing improvements for RISCV
- ARM SMMUv3
And concurrently work on the algorithm side:
- debugfs content dump, like VT-D has
- Cut support
- Increase/Decrease page size support
- map/unmap batching
- KHO
As we make more algorithm improvements the value to convert the drivers
increases.
This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt
v2:
- Rebase on v6.16-rc2
- s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better
- Comment and documentation updates
- Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top
pointer
- Add missed force_aperture = true
- Make pt_iommu_deinit() take care of the not-yet-inited error case
internally as AMD/RISCV/VTD all shared this logic
- Change gather_range() into gather_range_pages() so it also deals with
the page list. This makes the following cache flushing series simpler
- Fix missed update of unmap->unmapped in some error cases
- Change clear_contig() to order the gather more logically
- Remove goto from the error handling in __map_range_leaf()
- s/log2_/oalog2_/ in places where the argument is an oaddr_t
- Pass the pts to pt_table_install64/32()
- Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's
information on how PASID 0 works.
v1: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com
- AMD driver only, many code changes
RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/
Alejandro Jimenez (1):
iommu/amd: Use the generic iommu page table
Jason Gunthorpe (14):
genpt: Generic Page Table base API
genpt: Add Documentation/ files
iommupt: Add the basic structure of the iommu implementation
iommupt: Add the AMD IOMMU v1 page table format
iommupt: Add iova_to_phys op
iommupt: Add unmap_pages op
iommupt: Add map_pages op
iommupt: Add read_and_clear_dirty op
iommupt: Add a kunit test for Generic Page Table
iommupt: Add a mock pagetable format for iommufd selftest to use
iommufd: Change the selftest to use iommupt instead of xarray
iommupt: Add the x86 64 bit page table format
iommu/amd: Remove AMD io_pgtable support
iommupt: Add a kunit test for the IOMMU implementation
.clang-format | 1 +
Documentation/driver-api/generic_pt.rst | 140 ++
Documentation/driver-api/index.rst | 1 +
drivers/iommu/Kconfig | 2 +
drivers/iommu/Makefile | 1 +
drivers/iommu/amd/Kconfig | 5 +-
drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 1 -
drivers/iommu/amd/amd_iommu_types.h | 109 +-
drivers/iommu/amd/io_pgtable.c | 560 --------
drivers/iommu/amd/io_pgtable_v2.c | 370 ------
drivers/iommu/amd/iommu.c | 516 ++++----
drivers/iommu/generic_pt/.kunitconfig | 13 +
drivers/iommu/generic_pt/Kconfig | 72 ++
drivers/iommu/generic_pt/fmt/Makefile | 26 +
drivers/iommu/generic_pt/fmt/amdv1.h | 409 ++++++
drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 +
drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 +
drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 +
drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 +
drivers/iommu/generic_pt/fmt/iommu_template.h | 48 +
drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 +
drivers/iommu/generic_pt/fmt/x86_64.h | 248 ++++
drivers/iommu/generic_pt/iommu_pt.h | 1150 +++++++++++++++++
drivers/iommu/generic_pt/kunit_generic_pt.h | 717 ++++++++++
drivers/iommu/generic_pt/kunit_iommu.h | 183 +++
drivers/iommu/generic_pt/kunit_iommu_pt.h | 451 +++++++
drivers/iommu/generic_pt/pt_common.h | 354 +++++
drivers/iommu/generic_pt/pt_defs.h | 323 +++++
drivers/iommu/generic_pt/pt_fmt_defaults.h | 193 +++
drivers/iommu/generic_pt/pt_iter.h | 640 +++++++++
drivers/iommu/generic_pt/pt_log2.h | 130 ++
drivers/iommu/io-pgtable.c | 4 -
drivers/iommu/iommufd/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_test.h | 11 +-
drivers/iommu/iommufd/selftest.c | 439 +++----
include/linux/generic_pt/common.h | 166 +++
include/linux/generic_pt/iommu.h | 270 ++++
include/linux/io-pgtable.h | 2 -
tools/testing/selftests/iommu/iommufd.c | 60 +-
tools/testing/selftests/iommu/iommufd_utils.h | 12 +
41 files changed, 6119 insertions(+), 1589 deletions(-)
create mode 100644 Documentation/driver-api/generic_pt.rst
delete mode 100644 drivers/iommu/amd/io_pgtable.c
delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c
create mode 100644 drivers/iommu/generic_pt/.kunitconfig
create mode 100644 drivers/iommu/generic_pt/Kconfig
create mode 100644 drivers/iommu/generic_pt/fmt/Makefile
create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c
create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h
create mode 100644 drivers/iommu/generic_pt/iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/pt_common.h
create mode 100644 drivers/iommu/generic_pt/pt_defs.h
create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h
create mode 100644 drivers/iommu/generic_pt/pt_iter.h
create mode 100644 drivers/iommu/generic_pt/pt_log2.h
create mode 100644 include/linux/generic_pt/common.h
create mode 100644 include/linux/generic_pt/iommu.h
base-commit: cd76b0248a38645a3e3f8ca4a48bffc591e9da19
--
2.43.0
On Tue, 8 Jul 2025 23:13:01 +0530 Suresh Chandrappa <suresh.k.chandrappa(a)gmail.com> wrote:
> Hi Joshua,
>
> Thanks for the feedback! In the first patch, both shmem and mmap operations
> are present, but I hadn’t introduced any logic to distinguish between them
> yet. That distinction is added in the second patch through a new API.
Hi Suresh,
Yes, this makes sense to me. I think what I was getting at was that we could
still make a conditional statement like
if (type == FILE_SHMEM)
ksft_print_msg("Unable to create shmem file.\n")'
else if (type == FILE_MMAP)
ksft_print_msg("Unable to create mmap file.\n");
(or use a switch statement)
...
And just refactor it in patch 2, as opposed to changing the behavior.
But this is mostly a nit. If you are planning to merge both patches in one
patch in the next version, then all of these comments shouldn't matter : -)
Looking forward to the next version, have a great day!
Joshua
Sent using hkml (https://github.com/sjp38/hackermail)