Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patchset provides the benchmark infrastruture for streaming DMA mapping. The architecture of the code is pretty much similar with GUP benchmark: * mm/gup_benchmark.c provides kernel interface; * tools/testing/selftests/vm/gup_benchmark.c provides user program to call the interface provided by mm/gup_benchmark.c.
In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c; tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/ selftests/vm/gup_benchmark.c
A major difference with GUP benchmark is DMA_MAP benchmark needs to run on a device. Considering one board with below devices and IOMMUs device A ------- IOMMU 1 device B ------- IOMMU 2 device C ------- non-IOMMU
Different devices might attach to different IOMMU or non-IOMMU. To make benchmark run, we can either * create a virtual device and hack the kernel code to attach the virtual device to IOMMU1, IOMMU2 or non-IOMMU. * use the existing driver_override mechinism, unbind device A,B, OR c from their original driver and bind A to dma_map_benchmark platform driver or pci driver for benchmarking.
In this patchset, I prefer to use the driver_override and avoid the ugly hack in kernel. We can dynamically switch device behind different IOMMUs to get the performance of IOMMU or non-IOMMU.
-v3: * fix build issues reported by 0day kernel test robot -v2: * add PCI support; v1 supported platform devices only * replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed * many changes according to Robin's suggestions, thanks! Robin - add standard deviation output to reflect the worst case - check users' parameters strictly like the number of threads - make cache dirty before dma_map - fix unpaired dma_map_page and dma_unmap_single; - remove redundant "long long" before ktime_to_ns(); - use devm_add_action()
Barry Song (2): dma-mapping: add benchmark support for streaming DMA APIs selftests/dma: add test application for DMA_MAP_BENCHMARK
MAINTAINERS | 6 + kernel/dma/Kconfig | 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 296 ++++++++++++++++++ tools/testing/selftests/dma/Makefile | 6 + tools/testing/selftests/dma/config | 1 + .../testing/selftests/dma/dma_map_benchmark.c | 87 +++++ 7 files changed, 405 insertions(+) create mode 100644 kernel/dma/map_benchmark.c create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com --- -v3: * fix build issues reported by 0day kernel test robot -v2: * add PCI support; v1 supported platform devices only * replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed * many changes according to Robin's suggestions, thanks! Robin - add standard deviation output to reflect the worst case - check users' parameters strictly like the number of threads - make cache dirty before dma_map - fix unpaired dma_map_page and dma_unmap_single; - remove redundant "long long" before ktime_to_ns(); - use devm_add_action()
kernel/dma/Kconfig | 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 296 +++++++++++++++++++++++++++++++++++++ 3 files changed, 305 insertions(+) create mode 100644 kernel/dma/map_benchmark.c
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec.
If unsure, say N. + +config DMA_MAP_BENCHMARK + bool "Enable benchmarking of streaming DMA mapping" + help + Provides /sys/kernel/debug/dma_map_benchmark that helps with testing + performance of dma_(un)map_page. + + See tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o obj-$(CONFIG_DMA_REMAP) += remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index 000000000000..dc4e5ff48a2d --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,296 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/debugfs.h> +#include <linux/delay.h> +#include <linux/device.h> +#include <linux/dma-mapping.h> +#include <linux/kernel.h> +#include <linux/kthread.h> +#include <linux/math64.h> +#include <linux/module.h> +#include <linux/pci.h> +#include <linux/platform_device.h> +#include <linux/slab.h> +#include <linux/timekeeping.h> + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS 1024 +#define DMA_MAP_MAX_SECONDS 300 + +struct map_benchmark { + __u64 avg_map_100ns; /* average map latency in 100ns */ + __u64 map_stddev; /* standard deviation of map latency */ + __u64 avg_unmap_100ns; /* as above */ + __u64 unmap_stddev; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10]; /* For future use */ +}; + +struct map_benchmark_data { + struct map_benchmark bparam; + struct device *dev; + struct dentry *debugfs; + atomic64_t sum_map_100ns; + atomic64_t sum_unmap_100ns; + atomic64_t sum_square_map; + atomic64_t sum_square_unmap; + atomic64_t loops; +}; + +static int map_benchmark_thread(void *data) +{ + void *buf; + dma_addr_t dma_addr; + struct map_benchmark_data *map = data; + int ret = 0; + + buf = (void *)__get_free_page(GFP_KERNEL); + if (!buf) + return -ENOMEM; + + while (!kthread_should_stop()) { + __u64 map_100ns, unmap_100ns, map_square, unmap_square; + ktime_t map_stime, map_etime, unmap_stime, unmap_etime; + + /* + * for a non-coherent device, if we don't stain them in the cache, + * this will give an underestimate of the real-world overhead of + * BIDIRECTIONAL or TO_DEVICE mappings + * 66 means evertything goes well! 66 is lucky. + */ + memset(buf, 0x66, PAGE_SIZE); + + map_stime = ktime_get(); + dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, DMA_BIDIRECTIONAL); + if (unlikely(dma_mapping_error(map->dev, dma_addr))) { + pr_err("dma_map_single failed on %s\n", dev_name(map->dev)); + ret = -ENOMEM; + goto out; + } + map_etime = ktime_get(); + + unmap_stime = ktime_get(); + dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL); + unmap_etime = ktime_get(); + + /* calculate sum and sum of squares */ + map_100ns = div64_ul(ktime_to_ns(ktime_sub(map_etime, map_stime)), 100); + unmap_100ns = div64_ul(ktime_to_ns(ktime_sub(unmap_etime, unmap_stime)), 100); + map_square = map_100ns * map_100ns; + unmap_square = unmap_100ns * unmap_100ns; + + atomic64_add(map_100ns, &map->sum_map_100ns); + atomic64_add(unmap_100ns, &map->sum_unmap_100ns); + atomic64_add(map_square, &map->sum_square_map); + atomic64_add(unmap_square, &map->sum_square_unmap); + atomic64_inc(&map->loops); + } + +out: + free_page((unsigned long)buf); + return ret; +} + +static int do_map_benchmark(struct map_benchmark_data *map) +{ + struct task_struct **tsk; + int threads = map->bparam.threads; + int node = map->bparam.node; + const cpumask_t *cpu_mask = cpumask_of_node(node); + __u64 loops; + int ret = 0; + int i; + + tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL); + if (!tsk) + return -ENOMEM; + + get_device(map->dev); + + for (i = 0; i < threads; i++) { + tsk[i] = kthread_create_on_node(map_benchmark_thread, map, + map->bparam.node, "dma-map-benchmark/%d", i); + if (IS_ERR(tsk[i])) { + pr_err("create dma_map thread failed\n"); + ret = PTR_ERR(tsk[i]); + goto out; + } + + if (node != NUMA_NO_NODE && node_online(node)) + kthread_bind_mask(tsk[i], cpu_mask); + } + + /* clear the old value in the previous benchmark */ + atomic64_set(&map->sum_map_100ns, 0); + atomic64_set(&map->sum_unmap_100ns, 0); + atomic64_set(&map->sum_square_map, 0); + atomic64_set(&map->sum_square_unmap, 0); + atomic64_set(&map->loops, 0); + + for (i = 0; i < threads; i++) + wake_up_process(tsk[i]); + + msleep_interruptible(map->bparam.seconds * 1000); + + /* wait for the completion of benchmark threads */ + for (i = 0; i < threads; i++) { + ret = kthread_stop(tsk[i]); + if (ret) + goto out; + } + + loops = atomic64_read(&map->loops); + if (likely(loops > 0)) { + __u64 map_variance, unmap_variance; + + /* average latency */ + map->bparam.avg_map_100ns = div64_u64(atomic64_read(&map->sum_map_100ns), loops); + map->bparam.avg_unmap_100ns = div64_u64(atomic64_read(&map->sum_unmap_100ns), loops); + + /* standard deviation of latency */ + map_variance = div64_u64(atomic64_read(&map->sum_square_map), loops) - + map->bparam.avg_map_100ns * map->bparam.avg_map_100ns; + unmap_variance = div64_u64(atomic64_read(&map->sum_square_unmap), loops) - + map->bparam.avg_unmap_100ns * map->bparam.avg_unmap_100ns; + map->bparam.map_stddev = int_sqrt64(map_variance); + map->bparam.unmap_stddev = int_sqrt64(unmap_variance); + } + +out: + put_device(map->dev); + kfree(tsk); + return ret; +} + +static long map_benchmark_ioctl(struct file *filep, unsigned int cmd, + unsigned long arg) +{ + struct map_benchmark_data *map = filep->private_data; + int ret; + + if (copy_from_user(&map->bparam, (void __user *)arg, sizeof(map->bparam))) + return -EFAULT; + + switch (cmd) { + case DMA_MAP_BENCHMARK: + if (map->bparam.threads == 0 || map->bparam.threads > DMA_MAP_MAX_THREADS) { + pr_err("invalid thread number\n"); + return -EINVAL; + } + if (map->bparam.seconds == 0 || map->bparam.seconds > DMA_MAP_MAX_SECONDS) { + pr_err("invalid duration seconds\n"); + return -EINVAL; + } + + ret = do_map_benchmark(map); + break; + default: + return -EINVAL; + } + + if (copy_to_user((void __user *)arg, &map->bparam, sizeof(map->bparam))) + return -EFAULT; + + return ret; +} + +static const struct file_operations map_benchmark_fops = { + .open = simple_open, + .unlocked_ioctl = map_benchmark_ioctl, +}; + +static void map_benchmark_remove_debugfs(void *data) +{ + struct map_benchmark_data *map = (struct map_benchmark_data *)data; + + debugfs_remove(map->debugfs); +} + +static int __map_benchmark_probe(struct device *dev) +{ + struct dentry *entry; + struct map_benchmark_data *map; + int ret; + + map = devm_kzalloc(dev, sizeof(*map), GFP_KERNEL); + if (!map) + return -ENOMEM; + map->dev = dev; + + ret = devm_add_action(dev, map_benchmark_remove_debugfs, map); + if (ret) { + pr_err("Can't add debugfs remove action\n"); + return ret; + } + + /* + * we only permit a device bound with this driver, 2nd probe + * will fail + */ + entry = debugfs_create_file("dma_map_benchmark", 0600, NULL, map, + &map_benchmark_fops); + if (IS_ERR(entry)) + return PTR_ERR(entry); + map->debugfs = entry; + + return 0; +} + +static int map_benchmark_platform_probe(struct platform_device *pdev) +{ + return __map_benchmark_probe(&pdev->dev); +} + +static struct platform_driver map_benchmark_platform_driver = { + .driver = { + .name = "dma_map_benchmark", + }, + .probe = map_benchmark_platform_probe, +}; + +static int map_benchmark_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) +{ + return __map_benchmark_probe(&pdev->dev); +} + +static struct pci_driver map_benchmark_pci_driver = { + .name = "dma_map_benchmark", + .probe = map_benchmark_pci_probe, +}; + +static int __init map_benchmark_init(void) +{ + int ret; + + ret = pci_register_driver(&map_benchmark_pci_driver); + if (ret) + return ret; + + ret = platform_driver_register(&map_benchmark_platform_driver); + if (ret) { + pci_unregister_driver(&map_benchmark_pci_driver); + return ret; + } + + return 0; +} + +static void __exit map_benchmark_cleanup(void) +{ + platform_driver_unregister(&map_benchmark_platform_driver); + pci_unregister_driver(&map_benchmark_pci_driver); +} + +module_init(map_benchmark_init); +module_exit(map_benchmark_cleanup); + +MODULE_AUTHOR("Barry Song song.bao.hua@hisilicon.com"); +MODULE_DESCRIPTION("dma_map benchmark driver"); +MODULE_LICENSE("GPL");
On 02/11/2020 08:06, Barry Song wrote:
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com
-v3:
- fix build issues reported by 0day kernel test robot
-v2:
- add PCI support; v1 supported platform devices only
- replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed
- many changes according to Robin's suggestions, thanks! Robin
- add standard deviation output to reflect the worst case
- check users' parameters strictly like the number of threads
- make cache dirty before dma_map
- fix unpaired dma_map_page and dma_unmap_single;
- remove redundant "long long" before ktime_to_ns();
- use devm_add_action()
kernel/dma/Kconfig | 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 296 +++++++++++++++++++++++++++++++++++++ 3 files changed, 305 insertions(+) create mode 100644 kernel/dma/map_benchmark.c
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec. If unsure, say N.
+config DMA_MAP_BENCHMARK
- bool "Enable benchmarking of streaming DMA mapping"
- help
Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
performance of dma_(un)map_page.
Since this is a driver, any reason for which it cannot be loadable? If so, it seems any functionality would depend on DEBUG FS, I figure that's just how we work for debugfs.
Thanks, John
See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile
-----Original Message----- From: John Garry Sent: Monday, November 2, 2020 10:19 PM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com; m.szyprowski@samsung.com Cc: linux-kselftest@vger.kernel.org; Shuah Khan shuah@kernel.org; Joerg Roedel joro@8bytes.org; Linuxarm linuxarm@huawei.com; xuwei (O) xuwei5@huawei.com; Will Deacon will@kernel.org Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
On 02/11/2020 08:06, Barry Song wrote:
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with
the
specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
For PCI devices: echo dma_map_benchmark >
/sys/bus/pci/devices/0000:00:01.0/driver_override
echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com
-v3:
- fix build issues reported by 0day kernel test robot
-v2:
- add PCI support; v1 supported platform devices only
- replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed
- many changes according to Robin's suggestions, thanks! Robin
- add standard deviation output to reflect the worst case
- check users' parameters strictly like the number of threads
- make cache dirty before dma_map
- fix unpaired dma_map_page and dma_unmap_single;
- remove redundant "long long" before ktime_to_ns();
- use devm_add_action()
kernel/dma/Kconfig | 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 296
+++++++++++++++++++++++++++++++++++++
3 files changed, 305 insertions(+) create mode 100644 kernel/dma/map_benchmark.c
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec.
If unsure, say N.
+config DMA_MAP_BENCHMARK
- bool "Enable benchmarking of streaming DMA mapping"
- help
Provides /sys/kernel/debug/dma_map_benchmark that helps with
testing
performance of dma_(un)map_page.
Since this is a driver, any reason for which it cannot be loadable? If so, it seems any functionality would depend on DEBUG FS, I figure that's just how we work for debugfs.
We depend on kthread_bind_mask which isn't an export_symbol. Maybe worth to send a patch to export it?
Thanks, John
See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile
Thanks Barry
Hello Robin, Christoph, Any further comment? John suggested that "depends on DEBUG_FS" should be added in Kconfig. I am collecting more comments to send v4 together with fixing this minor issue :-)
Thanks Barry
-----Original Message----- From: Song Bao Hua (Barry Song) Sent: Monday, November 2, 2020 9:07 PM To: iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com; m.szyprowski@samsung.com Cc: Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org; xuwei (O) xuwei5@huawei.com; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Joerg Roedel joro@8bytes.org; Will Deacon will@kernel.org; Shuah Khan shuah@kernel.org Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com
-v3:
- fix build issues reported by 0day kernel test robot
-v2:
- add PCI support; v1 supported platform devices only
- replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed
- many changes according to Robin's suggestions, thanks! Robin
- add standard deviation output to reflect the worst case
- check users' parameters strictly like the number of threads
- make cache dirty before dma_map
- fix unpaired dma_map_page and dma_unmap_single;
- remove redundant "long long" before ktime_to_ns();
- use devm_add_action()
kernel/dma/Kconfig | 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 296 +++++++++++++++++++++++++++++++++++++ 3 files changed, 305 insertions(+) create mode 100644 kernel/dma/map_benchmark.c
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec.
If unsure, say N.
+config DMA_MAP_BENCHMARK
- bool "Enable benchmarking of streaming DMA mapping"
- help
Provides /sys/kernel/debug/dma_map_benchmark that helps with
testing
performance of dma_(un)map_page.
See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o obj-$(CONFIG_DMA_REMAP) += remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index 000000000000..dc4e5ff48a2d --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,296 @@ +// SPDX-License-Identifier: GPL-2.0-only +/*
- Copyright (C) 2020 Hisilicon Limited.
- */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/debugfs.h> +#include <linux/delay.h> +#include <linux/device.h> +#include <linux/dma-mapping.h> +#include <linux/kernel.h> +#include <linux/kthread.h> +#include <linux/math64.h> +#include <linux/module.h> +#include <linux/pci.h> +#include <linux/platform_device.h> +#include <linux/slab.h> +#include <linux/timekeeping.h>
+#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS 1024 +#define DMA_MAP_MAX_SECONDS 300
+struct map_benchmark {
- __u64 avg_map_100ns; /* average map latency in 100ns */
- __u64 map_stddev; /* standard deviation of map latency */
- __u64 avg_unmap_100ns; /* as above */
- __u64 unmap_stddev;
- __u32 threads; /* how many threads will do map/unmap in parallel */
- __u32 seconds; /* how long the test will last */
- int node; /* which numa node this benchmark will run on */
- __u64 expansion[10]; /* For future use */
+};
+struct map_benchmark_data {
- struct map_benchmark bparam;
- struct device *dev;
- struct dentry *debugfs;
- atomic64_t sum_map_100ns;
- atomic64_t sum_unmap_100ns;
- atomic64_t sum_square_map;
- atomic64_t sum_square_unmap;
- atomic64_t loops;
+};
+static int map_benchmark_thread(void *data) {
- void *buf;
- dma_addr_t dma_addr;
- struct map_benchmark_data *map = data;
- int ret = 0;
- buf = (void *)__get_free_page(GFP_KERNEL);
- if (!buf)
return -ENOMEM;
- while (!kthread_should_stop()) {
__u64 map_100ns, unmap_100ns, map_square, unmap_square;
ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
/*
* for a non-coherent device, if we don't stain them in the cache,
* this will give an underestimate of the real-world overhead of
* BIDIRECTIONAL or TO_DEVICE mappings
* 66 means evertything goes well! 66 is lucky.
*/
memset(buf, 0x66, PAGE_SIZE);
map_stime = ktime_get();
dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE,
DMA_BIDIRECTIONAL);
if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
pr_err("dma_map_single failed on %s\n",
dev_name(map->dev));
ret = -ENOMEM;
goto out;
}
map_etime = ktime_get();
unmap_stime = ktime_get();
dma_unmap_single(map->dev, dma_addr, PAGE_SIZE,
DMA_BIDIRECTIONAL);
unmap_etime = ktime_get();
/* calculate sum and sum of squares */
map_100ns = div64_ul(ktime_to_ns(ktime_sub(map_etime,
map_stime)), 100);
unmap_100ns = div64_ul(ktime_to_ns(ktime_sub(unmap_etime,
unmap_stime)), 100);
map_square = map_100ns * map_100ns;
unmap_square = unmap_100ns * unmap_100ns;
atomic64_add(map_100ns, &map->sum_map_100ns);
atomic64_add(unmap_100ns, &map->sum_unmap_100ns);
atomic64_add(map_square, &map->sum_square_map);
atomic64_add(unmap_square, &map->sum_square_unmap);
atomic64_inc(&map->loops);
- }
+out:
- free_page((unsigned long)buf);
- return ret;
+}
+static int do_map_benchmark(struct map_benchmark_data *map) {
- struct task_struct **tsk;
- int threads = map->bparam.threads;
- int node = map->bparam.node;
- const cpumask_t *cpu_mask = cpumask_of_node(node);
- __u64 loops;
- int ret = 0;
- int i;
- tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL);
- if (!tsk)
return -ENOMEM;
- get_device(map->dev);
- for (i = 0; i < threads; i++) {
tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
map->bparam.node, "dma-map-benchmark/%d", i);
if (IS_ERR(tsk[i])) {
pr_err("create dma_map thread failed\n");
ret = PTR_ERR(tsk[i]);
goto out;
}
if (node != NUMA_NO_NODE && node_online(node))
kthread_bind_mask(tsk[i], cpu_mask);
- }
- /* clear the old value in the previous benchmark */
- atomic64_set(&map->sum_map_100ns, 0);
- atomic64_set(&map->sum_unmap_100ns, 0);
- atomic64_set(&map->sum_square_map, 0);
- atomic64_set(&map->sum_square_unmap, 0);
- atomic64_set(&map->loops, 0);
- for (i = 0; i < threads; i++)
wake_up_process(tsk[i]);
- msleep_interruptible(map->bparam.seconds * 1000);
- /* wait for the completion of benchmark threads */
- for (i = 0; i < threads; i++) {
ret = kthread_stop(tsk[i]);
if (ret)
goto out;
- }
- loops = atomic64_read(&map->loops);
- if (likely(loops > 0)) {
__u64 map_variance, unmap_variance;
/* average latency */
map->bparam.avg_map_100ns =
div64_u64(atomic64_read(&map->sum_map_100ns), loops);
map->bparam.avg_unmap_100ns =
+div64_u64(atomic64_read(&map->sum_unmap_100ns), loops);
/* standard deviation of latency */
map_variance =
div64_u64(atomic64_read(&map->sum_square_map), loops) -
map->bparam.avg_map_100ns *
map->bparam.avg_map_100ns;
unmap_variance =
div64_u64(atomic64_read(&map->sum_square_unmap), loops) -
map->bparam.avg_unmap_100ns *
map->bparam.avg_unmap_100ns;
map->bparam.map_stddev = int_sqrt64(map_variance);
map->bparam.unmap_stddev = int_sqrt64(unmap_variance);
- }
+out:
- put_device(map->dev);
- kfree(tsk);
- return ret;
+}
+static long map_benchmark_ioctl(struct file *filep, unsigned int cmd,
unsigned long arg)
+{
- struct map_benchmark_data *map = filep->private_data;
- int ret;
- if (copy_from_user(&map->bparam, (void __user *)arg,
sizeof(map->bparam)))
return -EFAULT;
- switch (cmd) {
- case DMA_MAP_BENCHMARK:
if (map->bparam.threads == 0 || map->bparam.threads >
DMA_MAP_MAX_THREADS) {
pr_err("invalid thread number\n");
return -EINVAL;
}
if (map->bparam.seconds == 0 || map->bparam.seconds >
DMA_MAP_MAX_SECONDS) {
pr_err("invalid duration seconds\n");
return -EINVAL;
}
ret = do_map_benchmark(map);
break;
- default:
return -EINVAL;
- }
- if (copy_to_user((void __user *)arg, &map->bparam,
sizeof(map->bparam)))
return -EFAULT;
- return ret;
+}
+static const struct file_operations map_benchmark_fops = {
- .open = simple_open,
- .unlocked_ioctl = map_benchmark_ioctl, };
+static void map_benchmark_remove_debugfs(void *data) {
- struct map_benchmark_data *map = (struct map_benchmark_data *)data;
- debugfs_remove(map->debugfs);
+}
+static int __map_benchmark_probe(struct device *dev) {
- struct dentry *entry;
- struct map_benchmark_data *map;
- int ret;
- map = devm_kzalloc(dev, sizeof(*map), GFP_KERNEL);
- if (!map)
return -ENOMEM;
- map->dev = dev;
- ret = devm_add_action(dev, map_benchmark_remove_debugfs, map);
- if (ret) {
pr_err("Can't add debugfs remove action\n");
return ret;
- }
- /*
* we only permit a device bound with this driver, 2nd probe
* will fail
*/
- entry = debugfs_create_file("dma_map_benchmark", 0600, NULL, map,
&map_benchmark_fops);
- if (IS_ERR(entry))
return PTR_ERR(entry);
- map->debugfs = entry;
- return 0;
+}
+static int map_benchmark_platform_probe(struct platform_device *pdev) {
- return __map_benchmark_probe(&pdev->dev);
+}
+static struct platform_driver map_benchmark_platform_driver = {
- .driver = {
.name = "dma_map_benchmark",
- },
- .probe = map_benchmark_platform_probe, };
+static int map_benchmark_pci_probe(struct pci_dev *pdev, const struct +pci_device_id *id) {
- return __map_benchmark_probe(&pdev->dev);
+}
+static struct pci_driver map_benchmark_pci_driver = {
- .name = "dma_map_benchmark",
- .probe = map_benchmark_pci_probe,
+};
+static int __init map_benchmark_init(void) {
- int ret;
- ret = pci_register_driver(&map_benchmark_pci_driver);
- if (ret)
return ret;
- ret = platform_driver_register(&map_benchmark_platform_driver);
- if (ret) {
pci_unregister_driver(&map_benchmark_pci_driver);
return ret;
- }
- return 0;
+}
+static void __exit map_benchmark_cleanup(void) {
- platform_driver_unregister(&map_benchmark_platform_driver);
- pci_unregister_driver(&map_benchmark_pci_driver);
+}
+module_init(map_benchmark_init); +module_exit(map_benchmark_cleanup);
+MODULE_AUTHOR("Barry Song song.bao.hua@hisilicon.com"); +MODULE_DESCRIPTION("dma_map benchmark driver"); MODULE_LICENSE("GPL"); -- 2.25.1
On 10/11/2020 08:10, Song Bao Hua (Barry Song) wrote:
Hello Robin, Christoph, Any further comment? John suggested that "depends on DEBUG_FS" should be added in Kconfig. I am collecting more comments to send v4 together with fixing this minor issue :-)
Thanks Barry
-----Original Message----- From: Song Bao Hua (Barry Song) Sent: Monday, November 2, 2020 9:07 PM To: iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com; m.szyprowski@samsung.com Cc: Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org; xuwei (O) xuwei5@huawei.com; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Joerg Roedel joro@8bytes.org; Will Deacon will@kernel.org; Shuah Khan shuah@kernel.org Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
Hi Barry,
For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
Do we need to check if the device to which we attach actually has DMA mapping capability?
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com
Thanks, John
-----Original Message----- From: John Garry Sent: Tuesday, November 10, 2020 9:39 PM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com; m.szyprowski@samsung.com Cc: linux-kselftest@vger.kernel.org; Will Deacon will@kernel.org; Joerg Roedel joro@8bytes.org; Linuxarm linuxarm@huawei.com; xuwei (O) xuwei5@huawei.com; Shuah Khan shuah@kernel.org Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
On 10/11/2020 08:10, Song Bao Hua (Barry Song) wrote:
Hello Robin, Christoph, Any further comment? John suggested that "depends on DEBUG_FS" should
be added in Kconfig.
I am collecting more comments to send v4 together with fixing this minor
issue :-)
Thanks Barry
-----Original Message----- From: Song Bao Hua (Barry Song) Sent: Monday, November 2, 2020 9:07 PM To: iommu@lists.linux-foundation.org; hch@lst.de;
robin.murphy@arm.com;
m.szyprowski@samsung.com Cc: Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org;
xuwei
(O) xuwei5@huawei.com; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Joerg Roedel joro@8bytes.org; Will
Deacon
will@kernel.org; Shuah Khan shuah@kernel.org Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for
streaming
DMA APIs
Nowadays, there are increasing requirements to benchmark the
performance
of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads
to
do dma_map_page and dma_unmap_page on a specific NUMA node with
the
specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark >
/sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
Hi Barry,
For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
Do we need to check if the device to which we attach actually has DMA mapping capability?
Hello John,
I'd like to think checking this here would be overdesign. We just give users the freedom to bind any device they care about to the benchmark driver. Usually that means a real hardware either behind an IOMMU or through a direct mapping.
if for any reason users put a wrong "device", that is the choice of users. Anyhow, the below code will still handle it properly and users will get a report in which everything is zero.
+static int map_benchmark_thread(void *data) +{ ... + dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, DMA_BIDIRECTIONAL); + if (unlikely(dma_mapping_error(map->dev, dma_addr))) { + pr_err("dma_map_single failed on %s\n", dev_name(map->dev)); + ret = -ENOMEM; + goto out; + } ... +}
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com
Thanks, John
Thanks Barry
Lots of > 80 char lines. Please fix up the style.
I think this needs to set a dma mask as behavior for unlimited dma mask vs the default 32-bit one can be very different. I also think you need to be able to pass the direction or have different tests for directions. bidirectional is not exactly heavily used and pays more cache management penality.
-----Original Message----- From: Christoph Hellwig [mailto:hch@lst.de] Sent: Sunday, November 15, 2020 5:54 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com Cc: iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com; m.szyprowski@samsung.com; Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org; xuwei (O) xuwei5@huawei.com; Joerg Roedel joro@8bytes.org; Will Deacon will@kernel.org; Shuah Khan shuah@kernel.org Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Lots of > 80 char lines. Please fix up the style.
Checkpatch has changed 80 to 100. That's probably why my local checkpatch didn't report any warning: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
I am happy to change them to be less than 80 if you like.
I think this needs to set a dma mask as behavior for unlimited dma mask vs the default 32-bit one can be very different.
I actually prefer users bind real devices with real dma_mask to test rather than force to change the dma_mask in this benchmark.
Some device might have 32bit dma_mask while some others might have unlimited. But both of them can bind to this driver or unbind from it after the test is done. So users just need to bind those different real devices with different real dma_mask to dma_benchmark.
This can reflect the real performance of the real device better, I think.
I also think you need to be able to pass the direction or have different tests for directions. bidirectional is not exactly heavily used and pays more cache management penality.
For this, I'd like to increase a direction option in the test app and pass the option to the benchmark driver.
Thanks Barry
On Sun, Nov 15, 2020 at 12:11:15AM +0000, Song Bao Hua (Barry Song) wrote:
Checkpatch has changed 80 to 100. That's probably why my local checkpatch didn't report any warning: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
I am happy to change them to be less than 80 if you like.
Don't rely on checkpath, is is broken. Look at the codingstyle document.
I think this needs to set a dma mask as behavior for unlimited dma mask vs the default 32-bit one can be very different.
I actually prefer users bind real devices with real dma_mask to test rather than force to change the dma_mask in this benchmark.
The mask is set by the driver, not the device. So you need to set when when you bind, real device or not.
-----Original Message----- From: Christoph Hellwig [mailto:hch@lst.de] Sent: Sunday, November 15, 2020 9:45 PM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com Cc: Christoph Hellwig hch@lst.de; iommu@lists.linux-foundation.org; robin.murphy@arm.com; m.szyprowski@samsung.com; Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org; xuwei (O) xuwei5@huawei.com; Joerg Roedel joro@8bytes.org; Will Deacon will@kernel.org; Shuah Khan shuah@kernel.org Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
On Sun, Nov 15, 2020 at 12:11:15AM +0000, Song Bao Hua (Barry Song) wrote:
Checkpatch has changed 80 to 100. That's probably why my local checkpatch
didn't report any warning:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... bdc48fa11e46f867ea4d
I am happy to change them to be less than 80 if you like.
Don't rely on checkpath, is is broken. Look at the codingstyle document.
I think this needs to set a dma mask as behavior for unlimited dma mask vs the default 32-bit one can be very different.
I actually prefer users bind real devices with real dma_mask to test rather
than force to change
the dma_mask in this benchmark.
The mask is set by the driver, not the device. So you need to set when when you bind, real device or not.
Yep while it is a little bit tricky.
Sometimes, it is done by "device" in architectures, e.g. there are lots of dma_mask configuration code in arch/arm/mach-xxx. arch/arm/mach-davinci/da850.c static u64 da850_vpif_dma_mask = DMA_BIT_MASK(32); static struct platform_device da850_vpif_dev = { .name = "vpif", .id = -1, .dev = { .dma_mask = &da850_vpif_dma_mask, .coherent_dma_mask = DMA_BIT_MASK(32), }, .resource = da850_vpif_resource, .num_resources = ARRAY_SIZE(da850_vpif_resource), };
Sometimes, it is done by "of" or "acpi", for example: drivers/acpi/arm64/iort.c void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) { u64 end, mask, dmaaddr = 0, size = 0, offset = 0; int ret;
...
ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size); if (!ret) { /* * Limit coherent and dma mask based on size retrieved from * firmware. */ end = dmaaddr + size - 1; mask = DMA_BIT_MASK(ilog2(end) + 1); dev->bus_dma_limit = end; dev->coherent_dma_mask = mask; *dev->dma_mask = mask; } ... }
Sometimes, it is done by "bus", for example, ISA: isa_dev->dev.coherent_dma_mask = DMA_BIT_MASK(24); isa_dev->dev.dma_mask = &isa_dev->dev.coherent_dma_mask;
error = device_register(&isa_dev->dev); if (error) { put_device(&isa_dev->dev); break; }
And in many cases, it is done by driver. On the ARM64 server platform I am testing, actually rarely drivers set dma_mask.
So to make the dma benchmark work on all platforms, it seems it is worth to add a dma_mask_bit parameter. But, in order to avoid breaking the dma_mask of those devices whose dma_mask are set by architectures, acpi and bus, it seems we need to do the below in dma_benchmark:
u64 old_mask;
old_mask = dma_get_mask(dev);
dma_set_mask(dev, &new_mask);
do_map_benchmark();
/* restore old dma_mask so that the dma_mask of the device is not changed due to benchmark when it is bound back to its original driver */ dma_set_mask(dev, &old_mask);
Thanks Barry
This patch provides the test application for DMA_MAP_BENCHMARK.
Before running the test application, we need to bind a device to dma_map_ benchmark driver. For example, unbind "xxx" from its original driver and bind to dma_map_benchmark:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
Another example for PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
The below command will run 16 threads on numa node 0 for 10 seconds on the device bound to dma_map_benchmark platform_driver or pci_driver: ./dma_map_benchmark -t 16 -s 10 -n 0 dma mapping benchmark: threads:16 seconds:10 average map latency(us):1.1 standard deviation:1.9 average unmap latency(us):0.5 standard deviation:0.8
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com --- MAINTAINERS | 6 ++ tools/testing/selftests/dma/Makefile | 6 ++ tools/testing/selftests/dma/config | 1 + .../testing/selftests/dma/dma_map_benchmark.c | 87 +++++++++++++++++++ 4 files changed, 100 insertions(+) create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/MAINTAINERS b/MAINTAINERS index 608fc8484c02..a1e38d5e14f6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5247,6 +5247,12 @@ F: include/linux/dma-mapping.h F: include/linux/dma-map-ops.h F: kernel/dma/
+DMA MAPPING BENCHMARK +M: Barry Song song.bao.hua@hisilicon.com +L: iommu@lists.linux-foundation.org +F: kernel/dma/map_benchmark.c +F: tools/testing/selftests/dma/ + DMA-BUF HEAPS FRAMEWORK M: Sumit Semwal sumit.semwal@linaro.org R: Benjamin Gaignard benjamin.gaignard@linaro.org diff --git a/tools/testing/selftests/dma/Makefile b/tools/testing/selftests/dma/Makefile new file mode 100644 index 000000000000..aa8e8b5b3864 --- /dev/null +++ b/tools/testing/selftests/dma/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +CFLAGS += -I../../../../usr/include/ + +TEST_GEN_PROGS := dma_map_benchmark + +include ../lib.mk diff --git a/tools/testing/selftests/dma/config b/tools/testing/selftests/dma/config new file mode 100644 index 000000000000..6102ee3c43cd --- /dev/null +++ b/tools/testing/selftests/dma/config @@ -0,0 +1 @@ +CONFIG_DMA_MAP_BENCHMARK=y diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c new file mode 100644 index 000000000000..4778df0c458f --- /dev/null +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#include <fcntl.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <linux/types.h> + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS 1024 +#define DMA_MAP_MAX_SECONDS 300 + +struct map_benchmark { + __u64 avg_map_100ns; /* average map latency in 100ns */ + __u64 map_stddev; /* standard deviation of map latency */ + __u64 avg_unmap_100ns; /* as above */ + __u64 unmap_stddev; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10]; /* For future use */ +}; + +int main(int argc, char **argv) +{ + struct map_benchmark map; + int fd, opt; + /* default single thread, run 20 seconds on NUMA_NO_NODE */ + int threads = 1, seconds = 20, node = -1; + int cmd = DMA_MAP_BENCHMARK; + char *p; + + while ((opt = getopt(argc, argv, "t:s:n:")) != -1) { + switch (opt) { + case 't': + threads = atoi(optarg); + break; + case 's': + seconds = atoi(optarg); + break; + case 'n': + node = atoi(optarg); + break; + default: + return -1; + } + } + + if (threads <= 0 || threads > DMA_MAP_MAX_THREADS) { + fprintf(stderr, "invalid number of threads, must be in 1-%d\n", + DMA_MAP_MAX_THREADS); + exit(1); + } + + if (seconds <= 0 || seconds > DMA_MAP_MAX_SECONDS) { + fprintf(stderr, "invalid number of seconds, must be in 1-%d\n", + DMA_MAP_MAX_SECONDS); + exit(1); + } + + fd = open("/sys/kernel/debug/dma_map_benchmark", O_RDWR); + if (fd == -1) { + perror("open"); + exit(1); + } + + map.seconds = seconds; + map.threads = threads; + map.node = node; + if (ioctl(fd, cmd, &map)) { + perror("ioctl"); + exit(1); + } + + printf("dma mapping benchmark: threads:%d seconds:%d\n", threads, seconds); + printf("average map latency(us):%.1f standard deviation:%.1f\n", + map.avg_map_100ns/10.0, map.map_stddev/10.0); + printf("average unmap latency(us):%.1f standard deviation:%.1f\n", + map.avg_unmap_100ns/10.0, map.unmap_stddev/10.0); + + return 0; +}
linux-kselftest-mirror@lists.linaro.org