RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

31 Oct 2020

      ...
-----Original Message-----
From: Song Bao Hua (Barry Song) [mailto:song.bao.hua@hisilicon.com]
Sent: Saturday, October 31, 2020 10:45 PM
To: Robin Murphy robin.murphy@arm.com;
iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com
Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm
linuxarm@huawei.com; linux-kselftest@vger.kernel.org
Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming
DMA APIs
...
-----Original Message-----
From: Robin Murphy [mailto:robin.murphy@arm.com]
Sent: Saturday, October 31, 2020 4:48 AM
To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com;
iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com
Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm
linuxarm@huawei.com; linux-kselftest@vger.kernel.org
Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for
streaming
...
DMA APIs
On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
[...]
...
...
...
+struct map_benchmark {

__u64 map_nsec;
__u64 unmap_nsec;
__u32 threads; /* how many threads will do map/unmap in parallel

*/
...
...
...

__u32 seconds; /* how long the test will last */
int node; /* which numa node this benchmark will run on */
__u64 expansion[10];	/* For future use */

+};
I'm no expert on userspace ABIs (and what little experience I do have
is mostly of Win32...), so hopefully someone else will comment if
there's anything of concern here. One thing I wonder is that there's
a fair likelihood of functionality evolving here over time, so might
it be appropriate to have some sort of explicit versioning parameter
for robustness?
I copied that from gup_benchmark. There is no this kind of code to
compare version.
I believe there is a likelihood that kernel module is changed but
users are still using old userspace tool, this might lead to the
incompatible data structure.
But not sure if it is a big problem :-)
Yeah, like I say I don't really have a good feeling for what would be best here,
I'm just thinking of what I do know and wary of the potential for a "640 bits
ought to be enough for anyone" issue ;)
...
...
...
+struct map_benchmark_data {

struct map_benchmark bparam;
struct device *dev;
struct dentry  *debugfs;
atomic64_t total_map_nsecs;
atomic64_t total_map_loops;
atomic64_t total_unmap_nsecs;
atomic64_t total_unmap_loops;

+};

+static int map_benchmark_thread(void *data) {

struct page *page;
dma_addr_t dma_addr;
struct map_benchmark_data *map = data;
int ret = 0;

page = alloc_page(GFP_KERNEL);
if (!page)
return -ENOMEM;

while (!kthread_should_stop())  {
ktime_t map_stime, map_etime, unmap_stime, unmap_etime;

map_stime = ktime_get();

dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,

DMA_BIDIRECTIONAL);
Note that for a non-coherent device, this will give an underestimate
of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
since the page will never be dirty in the cache (except possibly the
very first time through).
Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
after we have this basic framework.
That wasn't so much about the direction itself, just that if it's anything other
than FROM_DEVICE, we should probably do something to dirty the buffer by
a
...
reasonable amount before each map. Otherwise the measured performance
is
...
going to be unrealistic on many systems.
Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?
...
[...]
...
...
...

atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,

unmap_stime)),
...

		&map->total_unmap_nsecs);

atomic64_inc(&map->total_map_loops);

atomic64_inc(&map->total_unmap_loops);

I think it would be worth keeping track of the variances as well - it
can be hard to tell if a reasonable-looking average is hiding
terrible worst-case behaviour.
This is a sensible requirement. I believe it is better to be handled
by the existing kernel tracing method.
Maybe we need a histogram like:
Delay   sample count
1-2us   1000              ***
2-3us   2000              *******
3-4us   100               *
.....
This will be more precise than the maximum latency in the worst case.
I'd believe this can be handled by:
tracepoint  A
Map
Tracepoint  B
Tracepoint   C
Unmap
Tracepoint   D
Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
So I am planning to put this requirement into todo list and write an
userspace ebpf/bcc script for histogram and put in tools/ directory.
Please give your comments on this.
Right, I wasn't suggesting trying to homebrew a full data collection system
here
...

I agree there are better tools for that already - just that it's basically free to

track a sum of squares alongside a sum, so that we can trivially calculate a
useful variance (or standard
deviation) figure alongside the mean at the end.
For this case, I am not sure if it is true. Unless we expose more data such as
min, max etc. to userspace, it makes no difference whether
total_(un)map_nsecs
and total_(un)map_loops are exposed or not.
As
total loops = seconds / (avg_map_latency + avg_unmap_latency);
total_map_nsecs = total loop count * avg_map_latency
total_unmap_nsecs = total loop count * avg_unmap_latency
all of seconds, avg_unmap_latency, avg_unmap_latency are known by
userspace tool.
After second thought, it seems you mean the kernel code can output the below
to userspace:
1. total loops
2. sum of map and unmap latencies
3. sum of square of map and unmap latencies
+struct map_benchmark {
+	__u64 total_loops;
+	__u64 sum_map_nsec;
+	__u64 sum_unmap_nsec;
+	__u64 sum_of_square_map_nsec;
+	__u64 sum_of_square_unmap_nsec;
+	__u32 threads; /* how many threads will do map/unmap in parallel */
+	__u32 seconds; /* how long the test will last */
+	int node; /* which numa node this benchmark will run on */
+	__u64 expansion[10];	/* For future use */
+};
Then we calculate average map/unmap nsec and variance in the userspace
tool.
...
...
[...]
...
...
...

for (i = 0; i < threads; i++) {
tsk[i] = kthread_create_on_node(map_benchmark_thread, map,

		map->bparam.node, "dma-map-benchmark/%d", i);

if (IS_ERR(tsk[i])) {

	dev_err(map->dev, "create dma_map thread failed\n");

	return PTR_ERR(tsk[i]);

}

if (node != NUMA_NO_NODE && node_online(node))

	kthread_bind_mask(tsk[i], cpu_mask);

wake_up_process(tsk[i]);

Might it be better to create all the threads first, *then* start
kicking them?
The difficulty is that we don't know how many threads we should create
as the thread number is a parameter to test the contention of IOMMU
driver.
...
...
In my test case, I'd like to test things like One thread Two threads
....
8 threads
12 threads
16 threads...
On the other hand, I think it is better to drop the memory of
task_struct of those test threads while we are not testing dma map.
I simply meant splitting the loop here into two - one to create the threads and
set their affinity, then another to wake them all up - so we don't start
unnecessarily thrashing the system while we're still trying to set up the rest
of
...
the test ;)
Agreed.
...
Robin.
Thanks
Barry

2025

2024

2023

2022

2021

2020

2019

2018

2017

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs