On 2024/9/9 13:43, Mina Almasry wrote:
Perf - page-pool benchmark:
bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results.
With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains.
Sorry for the late report, as I was adding a testing page_pool ko basing on [1] to avoid introducing performance regression when fixing the bug in [2]. I used it to test the performance impact of devmem patchset for page_pool too, it seems there might be some noticable performance impact quite stably for the below testcases, about 5%~16% performance degradation as below in the arm64 system:
Before the devmem patchset: Performance counter stats for 'insmod ./page_pool_test.ko test_push_cpu=16 test_pop_cpu=16 nr_test=100000000 test_napi=1' (100 runs):
17.167561 task-clock (msec) # 0.003 CPUs utilized ( +- 0.40% ) 8 context-switches # 0.474 K/sec ( +- 0.65% ) 0 cpu-migrations # 0.001 K/sec ( +-100.00% ) 84 page-faults # 0.005 M/sec ( +- 0.13% ) 44576552 cycles # 2.597 GHz ( +- 0.40% ) 59627412 instructions # 1.34 insn per cycle ( +- 0.03% ) 14370325 branches # 837.063 M/sec ( +- 0.02% ) 21902 branch-misses # 0.15% of all branches ( +- 0.27% )
6.818873600 seconds time elapsed ( +- 0.02% )
Performance counter stats for 'insmod ./page_pool_test.ko test_push_cpu=16 test_pop_cpu=16 nr_test=100000000 test_napi=1 test_direct=1' (100 runs):
17.595423 task-clock (msec) # 0.004 CPUs utilized ( +- 0.01% ) 8 context-switches # 0.460 K/sec ( +- 0.50% ) 0 cpu-migrations # 0.000 K/sec 84 page-faults # 0.005 M/sec ( +- 0.15% ) 45693020 cycles # 2.597 GHz ( +- 0.01% ) 59676212 instructions # 1.31 insn per cycle ( +- 0.00% ) 14385384 branches # 817.564 M/sec ( +- 0.00% ) 21786 branch-misses # 0.15% of all branches ( +- 0.14% )
4.098627802 seconds time elapsed ( +- 0.11% )
After the devmem patchset: Performance counter stats for 'insmod ./page_pool_test.ko test_push_cpu=16 test_pop_cpu=16 nr_test=100000000 test_napi=1' (100 runs):
17.047973 task-clock (msec) # 0.002 CPUs utilized ( +- 0.39% ) 8 context-switches # 0.488 K/sec ( +- 0.82% ) 0 cpu-migrations # 0.001 K/sec ( +- 70.35% ) 84 page-faults # 0.005 M/sec ( +- 0.12% ) 44269558 cycles # 2.597 GHz ( +- 0.39% ) 59594383 instructions # 1.35 insn per cycle ( +- 0.02% ) 14362599 branches # 842.481 M/sec ( +- 0.02% ) 21949 branch-misses # 0.15% of all branches ( +- 0.25% )
7.964890303 seconds time elapsed ( +- 0.16% )
Performance counter stats for 'insmod ./page_pool_test.ko test_push_cpu=16 test_pop_cpu=16 nr_test=100000000 test_napi=1 test_direct=1' (100 runs):
17.660975 task-clock (msec) # 0.004 CPUs utilized ( +- 0.02% ) 8 context-switches # 0.458 K/sec ( +- 0.57% ) 0 cpu-migrations # 0.003 K/sec ( +- 43.81% ) 84 page-faults # 0.005 M/sec ( +- 0.17% ) 45862652 cycles # 2.597 GHz ( +- 0.02% ) 59764866 instructions # 1.30 insn per cycle ( +- 0.01% ) 14404323 branches # 815.602 M/sec ( +- 0.01% ) 21826 branch-misses # 0.15% of all branches ( +- 0.19% )
4.304644609 seconds time elapsed ( +- 0.75% )
1. https://lore.kernel.org/all/20240906073646.2930809-2-linyunsheng@huawei.com/ 2. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org...