Re: [tip: x86/misc] x86/csum: Improve performance of `csum_partial`

4 Jul 2023

      On Sun, 2 Jul 2023 at 23:19, Geert Uytterhoeven geert@linux-m68k.org wrote:
...
Hi Noah,
On Thu, May 25, 2023 at 8:04 PM tip-bot2 for Noah Goldstein
tip-bot2@linutronix.de wrote:
...
The following commit has been merged into the x86/misc branch of tip:
Commit-ID:     688eb8191b475db5acfd48634600b04fd3dda9ad
Gitweb:        https://git.kernel.org/tip/688eb8191b475db5acfd48634600b04fd3dda9ad
Author:        Noah Goldstein goldstein.w.n@gmail.com
AuthorDate:    Wed, 10 May 2023 20:10:02 -05:00
Committer:     Dave Hansen dave.hansen@linux.intel.com
CommitterDate: Thu, 25 May 2023 10:55:18 -07:00
x86/csum: Improve performance of `csum_partial`

Add special case for len == 40 as that is the hottest value. The
nets a ~8-9% latency improvement and a ~30% throughput improvement
in the len == 40 case.

Use multiple accumulators in the 64-byte loop. This dramatically
improves ILP and results in up to a 40% latency/throughput
improvement (better for more iterations).

Results from benchmarking on Icelake. Times measured with rdtsc()
 len   lat_new   lat_old      r    tput_new  tput_old      r
   8      3.58      3.47  1.032        3.58      3.51  1.021
  16      4.14      4.02  1.028        3.96      3.78  1.046
  24      4.99      5.03  0.992        4.23      4.03  1.050
  32      5.09      5.08  1.001        4.68      4.47  1.048
  40      5.57      6.08  0.916        3.05      4.43  0.690
  48      6.65      6.63  1.003        4.97      4.69  1.059
  56      7.74      7.72  1.003        5.22      4.95  1.055
  64      6.65      7.22  0.921        6.38      6.42  0.994
  96      9.43      9.96  0.946        7.46      7.54  0.990
 128      9.39     12.15  0.773        8.90      8.79  1.012
 200     12.65     18.08  0.699       11.63     11.60  1.002
 272     15.82     23.37  0.677       14.43     14.35  1.005
 440     24.12     36.43  0.662       21.57     22.69  0.951
 952     46.20     74.01  0.624       42.98     53.12  0.809
1024     47.12     78.24  0.602       46.36     58.83  0.788
1552     72.01    117.30  0.614       71.92     96.78  0.743
2048     93.07    153.25  0.607       93.28    137.20  0.680
2600    114.73    194.30  0.590      114.28    179.32  0.637
3608    156.34    268.41  0.582      154.97    254.02  0.610
4096    175.01    304.03  0.576      175.89    292.08  0.602
There is no such thing as a free lunch, however, and the special case
for len == 40 does add overhead to the len != 40 cases. This seems to
amount to be ~5% throughput and slightly less in terms of latency.
Testing:
Part of this change is a new kunit test. The tests check all
alignment X length pairs in [0, 64) X [0, 512).
There are three cases.
    1) Precomputed random inputs/seed. The expected results where
       generated use the generic implementation (which is assumed to be
       non-buggy).
    2) An input of all 1s. The goal of this test is to catch any case
       a carry is missing.
    3) An input that never carries. The goal of this test si to catch
       any case of incorrectly carrying.
More exhaustive tests that test all alignment X length pairs in
[0, 8192) X [0, 8192] on random data are also available here:
https://github.com/goldsteinn/csum-reproduction
The reposity also has the code for reproducing the above benchmark
numbers.
Signed-off-by: Noah Goldstein goldstein.w.n@gmail.com
Signed-off-by: Dave Hansen dave.hansen@linux.intel.com
Thanks for your patch, which is now commit 688eb8191b475db5 ("x86/csum:
Improve performance of `csum_partial`") in linus/master stable/master
...
Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.co...
This does not seem to be a message sent to a public mailing list
archived at lore (yet).
On m68k (ARAnyM):
KTAP version 1
# Subtest: checksum
1..3
# test_csum_fixed_random_inputs: ASSERTION FAILED at

lib/checksum_kunit.c:243
    Expected result == expec, but
        result == 54991 (0xd6cf)
        expec == 33316 (0x8224)
    not ok 1 test_csum_fixed_random_inputs
    # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267
    Expected result == expec, but
        result == 255 (0xff)
        expec == 65280 (0xff00)
Endianness issue in the test?
not ok 2 test_csum_all_carry_inputs
# test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306
Expected result == expec, but
    result == 64515 (0xfc03)
    expec == 0 (0x0)
not ok 3 test_csum_no_carry_inputs

# checksum: pass:0 fail:3 skip:0 total:3
# Totals: pass:0 fail:3 skip:0 total:3
not ok 1 checksum
This also seems to be broken (albeit slightly differently) on 32-bit
UML and non-UML x86 with CONFIG_M486=y.
In those cases, it's an alignment issue, not an endianness one, but I
suspect this may be the first test to try calling the checksum
functions with unaligned data: certainly the x86 version seems to have
originally been written to assume 2-byte alignment (and the fixes for
unaligned data missed some corner cases like UML and pre-pentium-pro
codepaths).
It definitely looks like there are endianness issues as well, but do
be on the lookout for alignment ones, too.
Patches for the x86 and UML issues, FTR:
https://lore.kernel.org/linux-um/20230704083022.692368-2-davidgow@google.com...
https://lore.kernel.org/linux-kselftest/20230704083206.693155-2-davidgow@goo...
-- David

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [tip: x86/misc] x86/csum: Improve performance of `csum_partial`