Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patchset provides the benchmark infrastruture for streaming DMA mapping. The architecture of the code is pretty much similar with GUP benchmark: * mm/gup_benchmark.c provides kernel interface; * tools/testing/selftests/vm/gup_benchmark.c provides user program to call the interface provided by mm/gup_benchmark.c.
In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c; tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/ selftests/vm/gup_benchmark.c
A major difference with GUP benchmark is DMA_MAP benchmark needs to run on a device. Considering one board with below devices and IOMMUs device A ------- IOMMU 1 device B ------- IOMMU 2 device C ------- non-IOMMU
Different devices might attach to different IOMMU or non-IOMMU. To make benchmark run, we can either * create a virtual device and hack the kernel code to attach the virtual device to IOMMU1, IOMMU2 or non-IOMMU. * use the existing driver_override mechinism, unbind device A,B, or c from their original driver and bind them to "dma_map_benchmark" platform_driver or pci_driver for benchmarking.
In this patchset, I prefer to use the driver_override and avoid the various hack in kernel. We can dynamically switch devices behind different IOMMUs to get the performance of dma map on IOMMU or non-IOMMU.
Barry Song (2): dma-mapping: add benchmark support for streaming DMA APIs selftests/dma: add test application for DMA_MAP_BENCHMARK
MAINTAINERS | 6 + kernel/dma/Kconfig | 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 202 ++++++++++++++++++ tools/testing/selftests/dma/Makefile | 6 + tools/testing/selftests/dma/config | 1 + .../testing/selftests/dma/dma_map_benchmark.c | 72 +++++++ 7 files changed, 296 insertions(+) create mode 100644 kernel/dma/map_benchmark.c create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
For this moment, it supports platform device only, PCI device will also be supported afterwards.
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com --- kernel/dma/Kconfig | 8 ++ kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 202 +++++++++++++++++++++++++++++++++++++ 3 files changed, 211 insertions(+) create mode 100644 kernel/dma/map_benchmark.c
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec.
If unsure, say N. + +config DMA_MAP_BENCHMARK + bool "Enable benchmarking of streaming DMA mapping" + help + Provides /sys/kernel/debug/dma_map_benchmark that helps with testing + performance of dma_(un)map_page. + + See tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o obj-$(CONFIG_DMA_REMAP) += remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index 000000000000..16a5d7779d67 --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,202 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/kthread.h> +#include <linux/slab.h> +#include <linux/debugfs.h> +#include <linux/dma-mapping.h> +#include <linux/timekeeping.h> +#include <linux/delay.h> +#include <linux/platform_device.h> + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) + +struct map_benchmark { + __u64 map_nsec; + __u64 unmap_nsec; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10]; /* For future use */ +}; + +struct map_benchmark_data { + struct map_benchmark bparam; + struct device *dev; + struct dentry *debugfs; + atomic64_t total_map_nsecs; + atomic64_t total_map_loops; + atomic64_t total_unmap_nsecs; + atomic64_t total_unmap_loops; +}; + +static int map_benchmark_thread(void *data) +{ + struct page *page; + dma_addr_t dma_addr; + struct map_benchmark_data *map = data; + int ret = 0; + + page = alloc_page(GFP_KERNEL); + if (!page) + return -ENOMEM; + + while (!kthread_should_stop()) { + ktime_t map_stime, map_etime, unmap_stime, unmap_etime; + + map_stime = ktime_get(); + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL); + if (unlikely(dma_mapping_error(map->dev, dma_addr))) { + dev_err(map->dev, "dma_map_page failed\n"); + ret = -ENOMEM; + goto out; + } + map_etime = ktime_get(); + + unmap_stime = ktime_get(); + dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL); + unmap_etime = ktime_get(); + + atomic64_add((long long)ktime_to_ns(ktime_sub(map_etime, map_stime)), + &map->total_map_nsecs); + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime, unmap_stime)), + &map->total_unmap_nsecs); + atomic64_inc(&map->total_map_loops); + atomic64_inc(&map->total_unmap_loops); + } + +out: + __free_page(page); + return ret; +} + +static int do_map_benchmark(struct map_benchmark_data *map) +{ + struct task_struct **tsk; + int threads = map->bparam.threads; + int node = map->bparam.node; + const cpumask_t *cpu_mask = cpumask_of_node(node); + int ret = 0; + int i; + + tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL); + if (!tsk) + return -ENOMEM; + + get_device(map->dev); + + for (i = 0; i < threads; i++) { + tsk[i] = kthread_create_on_node(map_benchmark_thread, map, + map->bparam.node, "dma-map-benchmark/%d", i); + if (IS_ERR(tsk[i])) { + dev_err(map->dev, "create dma_map thread failed\n"); + return PTR_ERR(tsk[i]); + } + + if (node != NUMA_NO_NODE && node_online(node)) + kthread_bind_mask(tsk[i], cpu_mask); + + wake_up_process(tsk[i]); + } + + ssleep(map->bparam.seconds); + + /* wait for the completion of benchmark threads */ + for (i = 0; i < threads; i++) { + ret = kthread_stop(tsk[i]); + if (ret) + goto out; + } + + /* average map nsec and unmap nsec */ + map->bparam.map_nsec = atomic64_read(&map->total_map_nsecs) / + atomic64_read(&map->total_map_loops); + map->bparam.unmap_nsec = atomic64_read(&map->total_unmap_nsecs) / + atomic64_read(&map->total_unmap_loops); + +out: + put_device(map->dev); + kfree(tsk); + return ret; +} + +static long map_benchmark_ioctl(struct file *filep, unsigned int cmd, + unsigned long arg) +{ + struct map_benchmark_data *map = filep->private_data; + int ret; + + if (copy_from_user(&map->bparam, (void __user *)arg, sizeof(map->bparam))) + return -EFAULT; + + switch (cmd) { + case DMA_MAP_BENCHMARK: + ret = do_map_benchmark(map); + break; + default: + return -EINVAL; + } + + if (copy_to_user((void __user *)arg, &map->bparam, sizeof(map->bparam))) + return -EFAULT; + + return ret; +} + +static const struct file_operations map_benchmark_fops = { + .open = simple_open, + .unlocked_ioctl = map_benchmark_ioctl, +}; + +static int map_benchmark_probe(struct platform_device *pdev) +{ + struct dentry *entry; + struct map_benchmark_data *map; + + map = devm_kzalloc(&pdev->dev, sizeof(*map), GFP_KERNEL); + if (!map) + return -ENOMEM; + + map->dev = &pdev->dev; + platform_set_drvdata(pdev, map); + + /* + * we only permit a device bound with this driver, 2nd probe + * will fail + */ + entry = debugfs_create_file("dma_map_benchmark", 0600, NULL, map, + &map_benchmark_fops); + if (IS_ERR(entry)) + return PTR_ERR(entry); + map->debugfs = entry; + + return 0; +} + +static int map_benchmark_remove(struct platform_device *pdev) +{ + struct map_benchmark_data *map = platform_get_drvdata(pdev); + + debugfs_remove(map->debugfs); + + return 0; +} + +static struct platform_driver map_benchmark_driver = { + .driver = { + .name = "dma_map_benchmark", + }, + .probe = map_benchmark_probe, + .remove = map_benchmark_remove, +}; + +module_platform_driver(map_benchmark_driver); + +MODULE_AUTHOR("Barry Song song.bao.hua@hisilicon.com"); +MODULE_DESCRIPTION("dma_map benchmark driver"); +MODULE_LICENSE("GPL");
This patch provides the test application for DMA_MAP_BENCHMARK.
Before running the test application, we need to bind a device to dma_map_ benchmark driver. For example, unbind "xxx" from its original driver and bind to dma_map_benchmark:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
Then, run 10 threads on numa node 1 for 10 seconds on device "xxx": ./dma_map_benchmark -t 10 -s 10 -n 1 dma mapping benchmark: average map_nsec:3619 average unmap_nsec:2423
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com --- MAINTAINERS | 6 ++ tools/testing/selftests/dma/Makefile | 6 ++ tools/testing/selftests/dma/config | 1 + .../testing/selftests/dma/dma_map_benchmark.c | 72 +++++++++++++++++++ 4 files changed, 85 insertions(+) create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/MAINTAINERS b/MAINTAINERS index f310f0a09904..552389874ca2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5220,6 +5220,12 @@ F: include/linux/dma-mapping.h F: include/linux/dma-map-ops.h F: kernel/dma/
+DMA MAPPING BENCHMARK +M: Barry Song song.bao.hua@hisilicon.com +L: iommu@lists.linux-foundation.org +F: kernel/dma/map_benchmark.c +F: tools/testing/selftests/dma/ + DMA-BUF HEAPS FRAMEWORK M: Sumit Semwal sumit.semwal@linaro.org R: Andrew F. Davis afd@ti.com diff --git a/tools/testing/selftests/dma/Makefile b/tools/testing/selftests/dma/Makefile new file mode 100644 index 000000000000..aa8e8b5b3864 --- /dev/null +++ b/tools/testing/selftests/dma/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +CFLAGS += -I../../../../usr/include/ + +TEST_GEN_PROGS := dma_map_benchmark + +include ../lib.mk diff --git a/tools/testing/selftests/dma/config b/tools/testing/selftests/dma/config new file mode 100644 index 000000000000..6102ee3c43cd --- /dev/null +++ b/tools/testing/selftests/dma/config @@ -0,0 +1 @@ +CONFIG_DMA_MAP_BENCHMARK=y diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c new file mode 100644 index 000000000000..e03bd03e101e --- /dev/null +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#include <fcntl.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <linux/types.h> + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) + +struct map_benchmark { + __u64 map_nsec; + __u64 unmap_nsec; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10]; /* For future use */ +}; + +int main(int argc, char **argv) +{ + struct map_benchmark map; + int fd, opt, threads = 0, seconds = 0, node = -1; + int cmd = DMA_MAP_BENCHMARK; + char *p; + + while ((opt = getopt(argc, argv, "t:s:n:")) != -1) { + switch (opt) { + case 't': + threads = atoi(optarg); + break; + case 's': + seconds = atoi(optarg); + break; + case 'n': + node = atoi(optarg); + break; + default: + return -1; + } + } + + if (threads <= 0 || seconds <= 0) { + perror("invalid number of threads or seconds"); + exit(1); + } + + fd = open("/sys/kernel/debug/dma_map_benchmark", O_RDWR); + if (fd == -1) { + perror("open"); + exit(1); + } + + map.seconds = seconds; + map.threads = threads; + map.node = node; + if (ioctl(fd, cmd, &map)) { + perror("ioctl"); + exit(1); + } + + printf("dma mapping benchmark: average map_nsec:%lld average unmap_nsec:%lld\n", + map.map_nsec, + map.unmap_nsec); + + return 0; +}
On 2020-10-27 03:53, Barry Song wrote:
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
For this moment, it supports platform device only, PCI device will also be supported afterwards.
Neat! This is something I've thought about many times, but never got round to attempting :)
I think the basic latency measurement for mapping and unmapping pages is enough to start with, but there are definitely some more things that would be interesting to look into for future enhancements:
- a choice of mapping sizes, both smaller and larger than one page, to help characterise stuff like cache maintenance overhead and bounce buffer/IOVA fragmentation. - alternative allocation patterns like doing lots of maps first, then all their corresponding unmaps (to provoke things like the worst-case IOVA rcache behaviour). - ways to exercise a range of those parameters at once across different threads in a single test.
But let's get a basic framework nailed down first...
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com
kernel/dma/Kconfig | 8 ++ kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 202 +++++++++++++++++++++++++++++++++++++ 3 files changed, 211 insertions(+) create mode 100644 kernel/dma/map_benchmark.c
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec. If unsure, say N.
+config DMA_MAP_BENCHMARK
- bool "Enable benchmarking of streaming DMA mapping"
- help
Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
performance of dma_(un)map_page.
See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o obj-$(CONFIG_DMA_REMAP) += remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index 000000000000..16a5d7779d67 --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,202 @@ +// SPDX-License-Identifier: GPL-2.0-only +/*
- Copyright (C) 2020 Hisilicon Limited.
- */
+#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/kthread.h> +#include <linux/slab.h> +#include <linux/debugfs.h> +#include <linux/dma-mapping.h> +#include <linux/timekeeping.h> +#include <linux/delay.h> +#include <linux/platform_device.h>
Nit: alphabetical order is always nice, when there's not an existing precedent of a complete mess...
+#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
+struct map_benchmark {
- __u64 map_nsec;
- __u64 unmap_nsec;
- __u32 threads; /* how many threads will do map/unmap in parallel */
- __u32 seconds; /* how long the test will last */
- int node; /* which numa node this benchmark will run on */
- __u64 expansion[10]; /* For future use */
+};
I'm no expert on userspace ABIs (and what little experience I do have is mostly of Win32...), so hopefully someone else will comment if there's anything of concern here. One thing I wonder is that there's a fair likelihood of functionality evolving here over time, so might it be appropriate to have some sort of explicit versioning parameter for robustness?
+struct map_benchmark_data {
- struct map_benchmark bparam;
- struct device *dev;
- struct dentry *debugfs;
- atomic64_t total_map_nsecs;
- atomic64_t total_map_loops;
- atomic64_t total_unmap_nsecs;
- atomic64_t total_unmap_loops;
+};
+static int map_benchmark_thread(void *data) +{
- struct page *page;
- dma_addr_t dma_addr;
- struct map_benchmark_data *map = data;
- int ret = 0;
- page = alloc_page(GFP_KERNEL);
- if (!page)
return -ENOMEM;
- while (!kthread_should_stop()) {
ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
map_stime = ktime_get();
dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
Note that for a non-coherent device, this will give an underestimate of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, since the page will never be dirty in the cache (except possibly the very first time through).
if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
dev_err(map->dev, "dma_map_page failed\n");
ret = -ENOMEM;
goto out;
}
map_etime = ktime_get();
unmap_stime = ktime_get();
dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
Ahem, dma_map_page() pairs with dma_unmap_page(). Unfortunately DMA_API_DEBUG is unable to shout at you for that...
However, maybe changing the other end to dma_map_single() might make more sense, since you're not allocating highmem pages or anything wacky, and that'll be easier to expand in future.
unmap_etime = ktime_get();
atomic64_add((long long)ktime_to_ns(ktime_sub(map_etime, map_stime)),
&map->total_map_nsecs);
ktime_to_ns() returns s64, which is already long long.
atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime, unmap_stime)),
&map->total_unmap_nsecs);
atomic64_inc(&map->total_map_loops);
atomic64_inc(&map->total_unmap_loops);
I think it would be worth keeping track of the variances as well - it can be hard to tell if a reasonable-looking average is hiding terrible worst-case behaviour.
- }
+out:
- __free_page(page);
- return ret;
+}
+static int do_map_benchmark(struct map_benchmark_data *map) +{
- struct task_struct **tsk;
- int threads = map->bparam.threads;
I know it's debugfs, but maybe a bit of parameter validation wouldn't go amiss? I'm already imaginging that sinking feeling when the SSH connection stops responding and you realise you've just inadvertently launched INT_MAX threads...
- int node = map->bparam.node;
- const cpumask_t *cpu_mask = cpumask_of_node(node);
- int ret = 0;
- int i;
- tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL);
- if (!tsk)
return -ENOMEM;
- get_device(map->dev);
- for (i = 0; i < threads; i++) {
tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
map->bparam.node, "dma-map-benchmark/%d", i);
if (IS_ERR(tsk[i])) {
dev_err(map->dev, "create dma_map thread failed\n");
return PTR_ERR(tsk[i]);
}
if (node != NUMA_NO_NODE && node_online(node))
kthread_bind_mask(tsk[i], cpu_mask);
wake_up_process(tsk[i]);
Might it be better to create all the threads first, *then* start kicking them?
- }
- ssleep(map->bparam.seconds);
- /* wait for the completion of benchmark threads */
- for (i = 0; i < threads; i++) {
ret = kthread_stop(tsk[i]);
if (ret)
goto out;
- }
- /* average map nsec and unmap nsec */
- map->bparam.map_nsec = atomic64_read(&map->total_map_nsecs) /
atomic64_read(&map->total_map_loops);
- map->bparam.unmap_nsec = atomic64_read(&map->total_unmap_nsecs) /
atomic64_read(&map->total_unmap_loops);
+out:
- put_device(map->dev);
- kfree(tsk);
- return ret;
+}
+static long map_benchmark_ioctl(struct file *filep, unsigned int cmd,
unsigned long arg)
+{
- struct map_benchmark_data *map = filep->private_data;
- int ret;
- if (copy_from_user(&map->bparam, (void __user *)arg, sizeof(map->bparam)))
return -EFAULT;
- switch (cmd) {
- case DMA_MAP_BENCHMARK:
ret = do_map_benchmark(map);
break;
- default:
return -EINVAL;
- }
- if (copy_to_user((void __user *)arg, &map->bparam, sizeof(map->bparam)))
return -EFAULT;
- return ret;
+}
+static const struct file_operations map_benchmark_fops = {
- .open = simple_open,
- .unlocked_ioctl = map_benchmark_ioctl,
+};
+static int map_benchmark_probe(struct platform_device *pdev) +{
- struct dentry *entry;
- struct map_benchmark_data *map;
- map = devm_kzalloc(&pdev->dev, sizeof(*map), GFP_KERNEL);
- if (!map)
return -ENOMEM;
- map->dev = &pdev->dev;
- platform_set_drvdata(pdev, map);
- /*
* we only permit a device bound with this driver, 2nd probe
* will fail
*/
- entry = debugfs_create_file("dma_map_benchmark", 0600, NULL, map,
&map_benchmark_fops);
- if (IS_ERR(entry))
return PTR_ERR(entry);
- map->debugfs = entry;
- return 0;
+}
+static int map_benchmark_remove(struct platform_device *pdev) +{
- struct map_benchmark_data *map = platform_get_drvdata(pdev);
- debugfs_remove(map->debugfs);
Consider a trivial 3-line wrapper plus a devm_add_action() call ;)
Robin.
- return 0;
+}
+static struct platform_driver map_benchmark_driver = {
- .driver = {
.name = "dma_map_benchmark",
- },
- .probe = map_benchmark_probe,
- .remove = map_benchmark_remove,
+};
+module_platform_driver(map_benchmark_driver);
+MODULE_AUTHOR("Barry Song song.bao.hua@hisilicon.com"); +MODULE_DESCRIPTION("dma_map benchmark driver"); +MODULE_LICENSE("GPL");
-----Original Message----- From: Robin Murphy [mailto:robin.murphy@arm.com] Sent: Friday, October 30, 2020 8:38 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
On 2020-10-27 03:53, Barry Song wrote:
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU.
This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with
the
specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual device by: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
For this moment, it supports platform device only, PCI device will also be supported afterwards.
Neat! This is something I've thought about many times, but never got round to attempting :)
I am happy you have the same needs. When I came to IOMMU area a half year ago, the first thing I've done was writing a rough benchmark. At that time, I hacked kernel to get a device behind an IOMMU.
Recently, I got some time to think about how to get "device" without ugly hacking and then clean up code for sending patches out to provide a common benchmark in order that everybody can use.
I think the basic latency measurement for mapping and unmapping pages is enough to start with, but there are definitely some more things that would be interesting to look into for future enhancements:
- a choice of mapping sizes, both smaller and larger than one page, to
help characterise stuff like cache maintenance overhead and bounce buffer/IOVA fragmentation.
- alternative allocation patterns like doing lots of maps first, then
all their corresponding unmaps (to provoke things like the worst-case IOVA rcache behaviour).
- ways to exercise a range of those parameters at once across
different threads in a single test.
Yes, sure. Once we have a basic framework, we can add more benchmark patterns by using different parameters in the userspace tool: testing/selftests/dma/dma_map_benchmark.c
Similar function extensions have been carried out in GUP_BENCHMARK.
But let's get a basic framework nailed down first...
Sure.
Cc: Joerg Roedel joro@8bytes.org Cc: Will Deacon will@kernel.org Cc: Shuah Khan shuah@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com
kernel/dma/Kconfig | 8 ++ kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c | 202
+++++++++++++++++++++++++++++++++++++
3 files changed, 211 insertions(+) create mode 100644 kernel/dma/map_benchmark.c
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec.
If unsure, say N.
+config DMA_MAP_BENCHMARK
- bool "Enable benchmarking of streaming DMA mapping"
- help
Provides /sys/kernel/debug/dma_map_benchmark that helps with
testing
performance of dma_(un)map_page.
See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o obj-$(CONFIG_DMA_REMAP) += remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index 000000000000..16a5d7779d67 --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,202 @@ +// SPDX-License-Identifier: GPL-2.0-only +/*
- Copyright (C) 2020 Hisilicon Limited.
- */
+#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/kthread.h> +#include <linux/slab.h> +#include <linux/debugfs.h> +#include <linux/dma-mapping.h> +#include <linux/timekeeping.h> +#include <linux/delay.h> +#include <linux/platform_device.h>
Nit: alphabetical order is always nice, when there's not an existing precedent of a complete mess...
+#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
+struct map_benchmark {
- __u64 map_nsec;
- __u64 unmap_nsec;
- __u32 threads; /* how many threads will do map/unmap in parallel */
- __u32 seconds; /* how long the test will last */
- int node; /* which numa node this benchmark will run on */
- __u64 expansion[10]; /* For future use */
+};
I'm no expert on userspace ABIs (and what little experience I do have is mostly of Win32...), so hopefully someone else will comment if there's anything of concern here. One thing I wonder is that there's a fair likelihood of functionality evolving here over time, so might it be appropriate to have some sort of explicit versioning parameter for robustness?
I copied that from gup_benchmark. There is no this kind of code to compare version. I believe there is a likelihood that kernel module is changed but users are still using old userspace tool, this might lead to the incompatible data structure. But not sure if it is a big problem :-)
+struct map_benchmark_data {
- struct map_benchmark bparam;
- struct device *dev;
- struct dentry *debugfs;
- atomic64_t total_map_nsecs;
- atomic64_t total_map_loops;
- atomic64_t total_unmap_nsecs;
- atomic64_t total_unmap_loops;
+};
+static int map_benchmark_thread(void *data) +{
- struct page *page;
- dma_addr_t dma_addr;
- struct map_benchmark_data *map = data;
- int ret = 0;
- page = alloc_page(GFP_KERNEL);
- if (!page)
return -ENOMEM;
- while (!kthread_should_stop()) {
ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
map_stime = ktime_get();
dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
DMA_BIDIRECTIONAL);
Note that for a non-coherent device, this will give an underestimate of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, since the page will never be dirty in the cache (except possibly the very first time through).
Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1" after we have this basic framework.
if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
dev_err(map->dev, "dma_map_page failed\n");
ret = -ENOMEM;
goto out;
}
map_etime = ktime_get();
unmap_stime = ktime_get();
dma_unmap_single(map->dev, dma_addr, PAGE_SIZE,
DMA_BIDIRECTIONAL);
Ahem, dma_map_page() pairs with dma_unmap_page(). Unfortunately DMA_API_DEBUG is unable to shout at you for that...
This is a typo. At the beginning, we used dma_map_single and dma_unmap_single. After that, I changed to alloc_page so moved to dma_map_page. But I missed the unmap part.
However, maybe changing the other end to dma_map_single() might make more sense, since you're not allocating highmem pages or anything wacky, and that'll be easier to expand in future.
Yes. I can either change both to map_page/unmap_page or both to map_single/ unmap_single.
unmap_etime = ktime_get();
atomic64_add((long long)ktime_to_ns(ktime_sub(map_etime,
map_stime)),
&map->total_map_nsecs);
ktime_to_ns() returns s64, which is already long long.
Got it.
atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
unmap_stime)),
&map->total_unmap_nsecs);
atomic64_inc(&map->total_map_loops);
atomic64_inc(&map->total_unmap_loops);
I think it would be worth keeping track of the variances as well - it can be hard to tell if a reasonable-looking average is hiding terrible worst-case behaviour.
This is a sensible requirement. I believe it is better to be handled by the existing kernel tracing method.
Maybe we need a histogram like: Delay sample count 1-2us 1000 *** 2-3us 2000 ******* 3-4us 100 * ..... This will be more precise than the maximum latency in the worst case.
I'd believe this can be handled by: tracepoint A Map Tracepoint B
Tracepoint C Unmap Tracepoint D
Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
So I am planning to put this requirement into todo list and write an userspace ebpf/bcc script for histogram and put in tools/ directory.
Please give your comments on this.
- }
+out:
- __free_page(page);
- return ret;
+}
+static int do_map_benchmark(struct map_benchmark_data *map) +{
- struct task_struct **tsk;
- int threads = map->bparam.threads;
I know it's debugfs, but maybe a bit of parameter validation wouldn't go amiss? I'm already imaginging that sinking feeling when the SSH connection stops responding and you realise you've just inadvertently launched INT_MAX threads...
Agreed.
- int node = map->bparam.node;
- const cpumask_t *cpu_mask = cpumask_of_node(node);
- int ret = 0;
- int i;
- tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL);
- if (!tsk)
return -ENOMEM;
- get_device(map->dev);
- for (i = 0; i < threads; i++) {
tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
map->bparam.node, "dma-map-benchmark/%d", i);
if (IS_ERR(tsk[i])) {
dev_err(map->dev, "create dma_map thread failed\n");
return PTR_ERR(tsk[i]);
}
if (node != NUMA_NO_NODE && node_online(node))
kthread_bind_mask(tsk[i], cpu_mask);
wake_up_process(tsk[i]);
Might it be better to create all the threads first, *then* start kicking them?
The difficulty is that we don't know how many threads we should create as the thread number is a parameter to test the contention of IOMMU driver. In my test case, I'd like to test things like One thread Two threads .... 8 threads 12 threads 16 threads...
On the other hand, I think it is better to drop the memory of task_struct of those test threads while we are not testing dma map.
- }
- ssleep(map->bparam.seconds);
- /* wait for the completion of benchmark threads */
- for (i = 0; i < threads; i++) {
ret = kthread_stop(tsk[i]);
if (ret)
goto out;
- }
- /* average map nsec and unmap nsec */
- map->bparam.map_nsec = atomic64_read(&map->total_map_nsecs) /
atomic64_read(&map->total_map_loops);
- map->bparam.unmap_nsec = atomic64_read(&map->total_unmap_nsecs)
/
atomic64_read(&map->total_unmap_loops);
+out:
- put_device(map->dev);
- kfree(tsk);
- return ret;
+}
+static long map_benchmark_ioctl(struct file *filep, unsigned int cmd,
unsigned long arg)
+{
- struct map_benchmark_data *map = filep->private_data;
- int ret;
- if (copy_from_user(&map->bparam, (void __user *)arg,
sizeof(map->bparam)))
return -EFAULT;
- switch (cmd) {
- case DMA_MAP_BENCHMARK:
ret = do_map_benchmark(map);
break;
- default:
return -EINVAL;
- }
- if (copy_to_user((void __user *)arg, &map->bparam,
sizeof(map->bparam)))
return -EFAULT;
- return ret;
+}
+static const struct file_operations map_benchmark_fops = {
- .open = simple_open,
- .unlocked_ioctl = map_benchmark_ioctl,
+};
+static int map_benchmark_probe(struct platform_device *pdev) +{
- struct dentry *entry;
- struct map_benchmark_data *map;
- map = devm_kzalloc(&pdev->dev, sizeof(*map), GFP_KERNEL);
- if (!map)
return -ENOMEM;
- map->dev = &pdev->dev;
- platform_set_drvdata(pdev, map);
- /*
* we only permit a device bound with this driver, 2nd probe
* will fail
*/
- entry = debugfs_create_file("dma_map_benchmark", 0600, NULL, map,
&map_benchmark_fops);
- if (IS_ERR(entry))
return PTR_ERR(entry);
- map->debugfs = entry;
- return 0;
+}
+static int map_benchmark_remove(struct platform_device *pdev) +{
- struct map_benchmark_data *map = platform_get_drvdata(pdev);
- debugfs_remove(map->debugfs);
Consider a trivial 3-line wrapper plus a devm_add_action() call ;)
Yes, I will.
Robin.
- return 0;
+}
+static struct platform_driver map_benchmark_driver = {
- .driver = {
.name = "dma_map_benchmark",
- },
- .probe = map_benchmark_probe,
- .remove = map_benchmark_remove,
+};
+module_platform_driver(map_benchmark_driver);
+MODULE_AUTHOR("Barry Song song.bao.hua@hisilicon.com"); +MODULE_DESCRIPTION("dma_map benchmark driver"); +MODULE_LICENSE("GPL");
Thanks Barry
On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote: [...]
+struct map_benchmark {
- __u64 map_nsec;
- __u64 unmap_nsec;
- __u32 threads; /* how many threads will do map/unmap in parallel */
- __u32 seconds; /* how long the test will last */
- int node; /* which numa node this benchmark will run on */
- __u64 expansion[10]; /* For future use */
+};
I'm no expert on userspace ABIs (and what little experience I do have is mostly of Win32...), so hopefully someone else will comment if there's anything of concern here. One thing I wonder is that there's a fair likelihood of functionality evolving here over time, so might it be appropriate to have some sort of explicit versioning parameter for robustness?
I copied that from gup_benchmark. There is no this kind of code to compare version. I believe there is a likelihood that kernel module is changed but users are still using old userspace tool, this might lead to the incompatible data structure. But not sure if it is a big problem :-)
Yeah, like I say I don't really have a good feeling for what would be best here, I'm just thinking of what I do know and wary of the potential for a "640 bits ought to be enough for anyone" issue ;)
+struct map_benchmark_data {
- struct map_benchmark bparam;
- struct device *dev;
- struct dentry *debugfs;
- atomic64_t total_map_nsecs;
- atomic64_t total_map_loops;
- atomic64_t total_unmap_nsecs;
- atomic64_t total_unmap_loops;
+};
+static int map_benchmark_thread(void *data) +{
- struct page *page;
- dma_addr_t dma_addr;
- struct map_benchmark_data *map = data;
- int ret = 0;
- page = alloc_page(GFP_KERNEL);
- if (!page)
return -ENOMEM;
- while (!kthread_should_stop()) {
ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
map_stime = ktime_get();
dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
DMA_BIDIRECTIONAL);
Note that for a non-coherent device, this will give an underestimate of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, since the page will never be dirty in the cache (except possibly the very first time through).
Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1" after we have this basic framework.
That wasn't so much about the direction itself, just that if it's anything other than FROM_DEVICE, we should probably do something to dirty the buffer by a reasonable amount before each map. Otherwise the measured performance is going to be unrealistic on many systems.
[...]
atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
unmap_stime)),
&map->total_unmap_nsecs);
atomic64_inc(&map->total_map_loops);
atomic64_inc(&map->total_unmap_loops);
I think it would be worth keeping track of the variances as well - it can be hard to tell if a reasonable-looking average is hiding terrible worst-case behaviour.
This is a sensible requirement. I believe it is better to be handled by the existing kernel tracing method.
Maybe we need a histogram like: Delay sample count 1-2us 1000 *** 2-3us 2000 ******* 3-4us 100 * ..... This will be more precise than the maximum latency in the worst case.
I'd believe this can be handled by: tracepoint A Map Tracepoint B
Tracepoint C Unmap Tracepoint D
Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
So I am planning to put this requirement into todo list and write an userspace ebpf/bcc script for histogram and put in tools/ directory.
Please give your comments on this.
Right, I wasn't suggesting trying to homebrew a full data collection system here - I agree there are better tools for that already - just that it's basically free to track a sum of squares alongside a sum, so that we can trivially calculate a useful variance (or standard deviation) figure alongside the mean at the end.
[...]
- for (i = 0; i < threads; i++) {
tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
map->bparam.node, "dma-map-benchmark/%d", i);
if (IS_ERR(tsk[i])) {
dev_err(map->dev, "create dma_map thread failed\n");
return PTR_ERR(tsk[i]);
}
if (node != NUMA_NO_NODE && node_online(node))
kthread_bind_mask(tsk[i], cpu_mask);
wake_up_process(tsk[i]);
Might it be better to create all the threads first, *then* start kicking them?
The difficulty is that we don't know how many threads we should create as the thread number is a parameter to test the contention of IOMMU driver. In my test case, I'd like to test things like One thread Two threads .... 8 threads 12 threads 16 threads...
On the other hand, I think it is better to drop the memory of task_struct of those test threads while we are not testing dma map.
I simply meant splitting the loop here into two - one to create the threads and set their affinity, then another to wake them all up - so we don't start unnecessarily thrashing the system while we're still trying to set up the rest of the test ;)
Robin.
-----Original Message----- From: Robin Murphy [mailto:robin.murphy@arm.com] Sent: Saturday, October 31, 2020 4:48 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote: [...]
+struct map_benchmark {
- __u64 map_nsec;
- __u64 unmap_nsec;
- __u32 threads; /* how many threads will do map/unmap in parallel
*/
- __u32 seconds; /* how long the test will last */
- int node; /* which numa node this benchmark will run on */
- __u64 expansion[10]; /* For future use */
+};
I'm no expert on userspace ABIs (and what little experience I do have is mostly of Win32...), so hopefully someone else will comment if there's anything of concern here. One thing I wonder is that there's a fair likelihood of functionality evolving here over time, so might it be appropriate to have some sort of explicit versioning parameter for robustness?
I copied that from gup_benchmark. There is no this kind of code to compare version. I believe there is a likelihood that kernel module is changed but users are still using old userspace tool, this might lead to the incompatible data structure. But not sure if it is a big problem :-)
Yeah, like I say I don't really have a good feeling for what would be best here, I'm just thinking of what I do know and wary of the potential for a "640 bits ought to be enough for anyone" issue ;)
+struct map_benchmark_data {
- struct map_benchmark bparam;
- struct device *dev;
- struct dentry *debugfs;
- atomic64_t total_map_nsecs;
- atomic64_t total_map_loops;
- atomic64_t total_unmap_nsecs;
- atomic64_t total_unmap_loops;
+};
+static int map_benchmark_thread(void *data) {
- struct page *page;
- dma_addr_t dma_addr;
- struct map_benchmark_data *map = data;
- int ret = 0;
- page = alloc_page(GFP_KERNEL);
- if (!page)
return -ENOMEM;
- while (!kthread_should_stop()) {
ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
map_stime = ktime_get();
dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
DMA_BIDIRECTIONAL);
Note that for a non-coherent device, this will give an underestimate of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, since the page will never be dirty in the cache (except possibly the very first time through).
Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1" after we have this basic framework.
That wasn't so much about the direction itself, just that if it's anything other than FROM_DEVICE, we should probably do something to dirty the buffer by a reasonable amount before each map. Otherwise the measured performance is going to be unrealistic on many systems.
Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?
[...]
atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
unmap_stime)),
&map->total_unmap_nsecs);
atomic64_inc(&map->total_map_loops);
atomic64_inc(&map->total_unmap_loops);
I think it would be worth keeping track of the variances as well - it can be hard to tell if a reasonable-looking average is hiding terrible worst-case behaviour.
This is a sensible requirement. I believe it is better to be handled by the existing kernel tracing method.
Maybe we need a histogram like: Delay sample count 1-2us 1000 *** 2-3us 2000 ******* 3-4us 100 * ..... This will be more precise than the maximum latency in the worst case.
I'd believe this can be handled by: tracepoint A Map Tracepoint B
Tracepoint C Unmap Tracepoint D
Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
So I am planning to put this requirement into todo list and write an userspace ebpf/bcc script for histogram and put in tools/ directory.
Please give your comments on this.
Right, I wasn't suggesting trying to homebrew a full data collection system here
- I agree there are better tools for that already - just that it's basically free to
track a sum of squares alongside a sum, so that we can trivially calculate a useful variance (or standard deviation) figure alongside the mean at the end.
For this case, I am not sure if it is true. Unless we expose more data such as min, max etc. to userspace, it makes no difference whether total_(un)map_nsecs and total_(un)map_loops are exposed or not.
As total loops = seconds / (avg_map_latency + avg_unmap_latency); total_map_nsecs = total loop count * avg_map_latency total_unmap_nsecs = total loop count * avg_unmap_latency
all of seconds, avg_unmap_latency, avg_unmap_latency are known by userspace tool.
[...]
- for (i = 0; i < threads; i++) {
tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
map->bparam.node, "dma-map-benchmark/%d", i);
if (IS_ERR(tsk[i])) {
dev_err(map->dev, "create dma_map thread failed\n");
return PTR_ERR(tsk[i]);
}
if (node != NUMA_NO_NODE && node_online(node))
kthread_bind_mask(tsk[i], cpu_mask);
wake_up_process(tsk[i]);
Might it be better to create all the threads first, *then* start kicking them?
The difficulty is that we don't know how many threads we should create as the thread number is a parameter to test the contention of IOMMU driver. In my test case, I'd like to test things like One thread Two threads .... 8 threads 12 threads 16 threads...
On the other hand, I think it is better to drop the memory of task_struct of those test threads while we are not testing dma map.
I simply meant splitting the loop here into two - one to create the threads and set their affinity, then another to wake them all up - so we don't start unnecessarily thrashing the system while we're still trying to set up the rest of the test ;)
Agreed.
Robin.
Thanks Barry
-----Original Message----- From: Song Bao Hua (Barry Song) [mailto:song.bao.hua@hisilicon.com] Sent: Saturday, October 31, 2020 10:45 PM To: Robin Murphy robin.murphy@arm.com; iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
-----Original Message----- From: Robin Murphy [mailto:robin.murphy@arm.com] Sent: Saturday, October 31, 2020 4:48 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm linuxarm@huawei.com; linux-kselftest@vger.kernel.org Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for
streaming
DMA APIs
On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote: [...]
+struct map_benchmark {
- __u64 map_nsec;
- __u64 unmap_nsec;
- __u32 threads; /* how many threads will do map/unmap in parallel
*/
- __u32 seconds; /* how long the test will last */
- int node; /* which numa node this benchmark will run on */
- __u64 expansion[10]; /* For future use */
+};
I'm no expert on userspace ABIs (and what little experience I do have is mostly of Win32...), so hopefully someone else will comment if there's anything of concern here. One thing I wonder is that there's a fair likelihood of functionality evolving here over time, so might it be appropriate to have some sort of explicit versioning parameter for robustness?
I copied that from gup_benchmark. There is no this kind of code to compare version. I believe there is a likelihood that kernel module is changed but users are still using old userspace tool, this might lead to the incompatible data structure. But not sure if it is a big problem :-)
Yeah, like I say I don't really have a good feeling for what would be best here, I'm just thinking of what I do know and wary of the potential for a "640 bits ought to be enough for anyone" issue ;)
+struct map_benchmark_data {
- struct map_benchmark bparam;
- struct device *dev;
- struct dentry *debugfs;
- atomic64_t total_map_nsecs;
- atomic64_t total_map_loops;
- atomic64_t total_unmap_nsecs;
- atomic64_t total_unmap_loops;
+};
+static int map_benchmark_thread(void *data) {
- struct page *page;
- dma_addr_t dma_addr;
- struct map_benchmark_data *map = data;
- int ret = 0;
- page = alloc_page(GFP_KERNEL);
- if (!page)
return -ENOMEM;
- while (!kthread_should_stop()) {
ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
map_stime = ktime_get();
dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
DMA_BIDIRECTIONAL);
Note that for a non-coherent device, this will give an underestimate of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, since the page will never be dirty in the cache (except possibly the very first time through).
Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1" after we have this basic framework.
That wasn't so much about the direction itself, just that if it's anything other than FROM_DEVICE, we should probably do something to dirty the buffer by
a
reasonable amount before each map. Otherwise the measured performance
is
going to be unrealistic on many systems.
Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?
[...]
atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
unmap_stime)),
&map->total_unmap_nsecs);
atomic64_inc(&map->total_map_loops);
atomic64_inc(&map->total_unmap_loops);
I think it would be worth keeping track of the variances as well - it can be hard to tell if a reasonable-looking average is hiding terrible worst-case behaviour.
This is a sensible requirement. I believe it is better to be handled by the existing kernel tracing method.
Maybe we need a histogram like: Delay sample count 1-2us 1000 *** 2-3us 2000 ******* 3-4us 100 * ..... This will be more precise than the maximum latency in the worst case.
I'd believe this can be handled by: tracepoint A Map Tracepoint B
Tracepoint C Unmap Tracepoint D
Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
So I am planning to put this requirement into todo list and write an userspace ebpf/bcc script for histogram and put in tools/ directory.
Please give your comments on this.
Right, I wasn't suggesting trying to homebrew a full data collection system
here
- I agree there are better tools for that already - just that it's basically free to
track a sum of squares alongside a sum, so that we can trivially calculate a useful variance (or standard deviation) figure alongside the mean at the end.
For this case, I am not sure if it is true. Unless we expose more data such as min, max etc. to userspace, it makes no difference whether total_(un)map_nsecs and total_(un)map_loops are exposed or not.
As total loops = seconds / (avg_map_latency + avg_unmap_latency); total_map_nsecs = total loop count * avg_map_latency total_unmap_nsecs = total loop count * avg_unmap_latency
all of seconds, avg_unmap_latency, avg_unmap_latency are known by userspace tool.
After second thought, it seems you mean the kernel code can output the below to userspace: 1. total loops 2. sum of map and unmap latencies 3. sum of square of map and unmap latencies
+struct map_benchmark { + __u64 total_loops; + __u64 sum_map_nsec; + __u64 sum_unmap_nsec; + __u64 sum_of_square_map_nsec; + __u64 sum_of_square_unmap_nsec; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10]; /* For future use */ +};
Then we calculate average map/unmap nsec and variance in the userspace tool.
[...]
- for (i = 0; i < threads; i++) {
tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
map->bparam.node, "dma-map-benchmark/%d", i);
if (IS_ERR(tsk[i])) {
dev_err(map->dev, "create dma_map thread failed\n");
return PTR_ERR(tsk[i]);
}
if (node != NUMA_NO_NODE && node_online(node))
kthread_bind_mask(tsk[i], cpu_mask);
wake_up_process(tsk[i]);
Might it be better to create all the threads first, *then* start kicking them?
The difficulty is that we don't know how many threads we should create as the thread number is a parameter to test the contention of IOMMU
driver.
In my test case, I'd like to test things like One thread Two threads .... 8 threads 12 threads 16 threads...
On the other hand, I think it is better to drop the memory of task_struct of those test threads while we are not testing dma map.
I simply meant splitting the loop here into two - one to create the threads and set their affinity, then another to wake them all up - so we don't start unnecessarily thrashing the system while we're still trying to set up the rest
of
the test ;)
Agreed.
Robin.
Thanks Barry
linux-kselftest-mirror@lists.linaro.org