Willy Tarreau wrote:
+#if defined(__GNUC__) && (__GNUC__ >= 12) +__attribute__((optimize("no-tree-loop-distribute-patterns"))) +#endif static __attribute__((unused)) -size_t nolibc_strlen(const char *str
I'd suggest to use asm("") in the loop body. It worked in the past to prevent folding division loop back into division instruction.
Or switch to
size_t f(const char *s) { const char *s0 = s; while (*s++) ; return s - s0 - 1; }
which compiles to 1 branch, not 2.
But of course they could recognise this pattern too.
Hi Alexey,
On Sun, Oct 09, 2022 at 06:45:49PM +0300, Alexey Dobriyan wrote:
Willy Tarreau wrote:
+#if defined(__GNUC__) && (__GNUC__ >= 12) +__attribute__((optimize("no-tree-loop-distribute-patterns"))) +#endif static __attribute__((unused)) -size_t nolibc_strlen(const char *str
I'd suggest to use asm("") in the loop body. It worked in the past to prevent folding division loop back into division instruction.
Ah excellent idea! I initially thought about using asm() to hide a variable provenance but didn't like it much because it undermines code optimization. But you're right, with an empty asm() statement alone, the loop will not look like an strlen() anymore. Just tried and it works like a charm, I'll resend a patch so that we can get rid of the ugly ifdef.
Or switch to
size_t f(const char *s) { const char *s0 = s; while (*s++) ; return s - s0 - 1; }
which compiles to 1 branch, not 2.
In fact it depends. In the original code that approach was part of the ones I had considered, but it doesn't always in better code due to the prologue and epilogue being larger. It's only better at -O1, and -O2, but not -Os, and once you add asm() into it, only -O1 remains better:
$ nm --size len.o|grep O|rev|sort|rev 000000000000001a T len_while_O1 0000000000000022 T len_while_asm_O1 0000000000000026 T len_for_O1 000000000000001a T len_while_O2 000000000000002b T len_while_asm_O2 0000000000000021 T len_for_O2 0000000000000013 T len_while_Os 0000000000000015 T len_while_asm_Os 000000000000000e T len_for_Os
This observation seems consistent for me on x86_64, i386, arm and arm64.
But of course they could recognise this pattern too.
Yes definitely, hence the need for asm() there as well to complete the comparison.
Thanks for the suggestion, I'll send a v2 shortly. Willy
On Sun, Oct 09, 2022 at 07:59:20PM +0200, Willy Tarreau wrote:
Hi Alexey,
On Sun, Oct 09, 2022 at 06:45:49PM +0300, Alexey Dobriyan wrote:
Willy Tarreau wrote:
+#if defined(__GNUC__) && (__GNUC__ >= 12) +__attribute__((optimize("no-tree-loop-distribute-patterns"))) +#endif static __attribute__((unused)) -size_t nolibc_strlen(const char *str
I'd suggest to use asm("") in the loop body. It worked in the past to prevent folding division loop back into division instruction.
Ah excellent idea! I initially thought about using asm() to hide a variable provenance but didn't like it much because it undermines code optimization. But you're right, with an empty asm() statement alone, the loop will not look like an strlen() anymore. Just tried and it works like a charm, I'll resend a patch so that we can get rid of the ugly ifdef.
Or switch to
size_t f(const char *s) { const char *s0 = s; while (*s++) ; return s - s0 - 1; }
which compiles to 1 branch, not 2.
In fact it depends. In the original code that approach was part of the ones I had considered, but it doesn't always in better code due to the prologue and epilogue being larger. It's only better at -O1, and -O2, but not -Os, and once you add asm() into it, only -O1 remains better:
$ nm --size len.o|grep O|rev|sort|rev 000000000000001a T len_while_O1 0000000000000022 T len_while_asm_O1 0000000000000026 T len_for_O1 000000000000001a T len_while_O2 000000000000002b T len_while_asm_O2 0000000000000021 T len_for_O2 0000000000000013 T len_while_Os 0000000000000015 T len_while_asm_Os 000000000000000e T len_for_Os
This observation seems consistent for me on x86_64, i386, arm and arm64.
By the way, just for the sake of completeness, the one that consistently gives me a better output is this one:
size_t strlen(const char *str) { const char *s0 = str--;
while (*++str) ; return str - s0; }
Which gives me this:
0000000000000000 <strlen>: 0: 48 8d 47 ff lea -0x1(%rdi),%rax 4: 48 ff c0 inc %rax 7: 80 38 00 cmpb $0x0,(%rax) a: 75 f8 jne 4 <len+0x4> c: 48 29 f8 sub %rdi,%rax f: c3 ret
But this is totally ruined by the addition of asm() in the loop. However I suspect that the construct is difficult to match against a real strlen() since it starts on an extra character, thus placing the asm() statement before the loop could durably preserve it. It does work here (the code remains the exact same one), but for how long, that's the question. Maybe we can revisit the various loop-based functions in the future with this in mind.
Cheers, Willy
From: Willy Tarreau w@1wt.eu
Sent: 09 October 2022 19:36
...
By the way, just for the sake of completeness, the one that consistently gives me a better output is this one:
size_t strlen(const char *str) { const char *s0 = str--;
while (*++str) ; return str - s0;
}
Which gives me this:
0000000000000000 <strlen>: 0: 48 8d 47 ff lea -0x1(%rdi),%rax 4: 48 ff c0 inc %rax 7: 80 38 00 cmpb $0x0,(%rax) a: 75 f8 jne 4 <len+0x4> c: 48 29 f8 sub %rdi,%rax f: c3 ret
But this is totally ruined by the addition of asm() in the loop. However I suspect that the construct is difficult to match against a real strlen() since it starts on an extra character, thus placing the asm() statement before the loop could durably preserve it. It does work here (the code remains the exact same one), but for how long, that's the question. Maybe we can revisit the various loop-based functions in the future with this in mind.
clang wilfully and persistently generates:
strlen: # @strlen movq $-1, %rax .LBB0_1: # =>This Inner Loop Header: Depth=1 cmpb $0, 1(%rdi,%rax) leaq 1(%rax), %rax jne .LBB0_1 retq
But feed the C for that into gcc and it generates a 'jmp strlen' at everything above -O1. I suspect that might run with less clocks/byte than the code above.
Somewhere I hate some complier pessimisations. Substituting a call to strlen() is typical. strlen() is almost certainly optimised for long strings. If the string is short the coded loop will be faster. The same is true (and probably more so) for memcpy.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Oct 10, 2022 at 10:03:53AM +0000, David Laight wrote:
From: Willy Tarreau w@1wt.eu
Sent: 09 October 2022 19:36
...
By the way, just for the sake of completeness, the one that consistently gives me a better output is this one:
size_t strlen(const char *str) { const char *s0 = str--;
while (*++str) ; return str - s0;
}
Which gives me this:
0000000000000000 <strlen>: 0: 48 8d 47 ff lea -0x1(%rdi),%rax 4: 48 ff c0 inc %rax 7: 80 38 00 cmpb $0x0,(%rax) a: 75 f8 jne 4 <len+0x4> c: 48 29 f8 sub %rdi,%rax f: c3 ret
But this is totally ruined by the addition of asm() in the loop. However I suspect that the construct is difficult to match against a real strlen() since it starts on an extra character, thus placing the asm() statement before the loop could durably preserve it. It does work here (the code remains the exact same one), but for how long, that's the question. Maybe we can revisit the various loop-based functions in the future with this in mind.
clang wilfully and persistently generates:
strlen: # @strlen movq $-1, %rax .LBB0_1: # =>This Inner Loop Header: Depth=1 cmpb $0, 1(%rdi,%rax) leaq 1(%rax), %rax jne .LBB0_1 retq
But feed the C for that into gcc and it generates a 'jmp strlen' at everything above -O1.
Interesting, that's not the case for me here with 12.2 from kernel.org on x86_64, which gives this at -O1, -O2, -O3 and -Ofast:
0000000000000000 <strlen>: 0: 48 8d 47 ff lea -0x1(%rdi),%rax 4: 0f 1f 40 00 nopl 0x0(%rax) 8: 48 83 c0 01 add $0x1,%rax c: 80 38 00 cmpb $0x0,(%rax) f: 75 f7 jne 8 <strlen+0x8> 11: 48 29 f8 sub %rdi,%rax 14: c3 ret
Out of curiosity what version were you using ?
I suspect that might run with less clocks/byte than the code above.
Certainly for large strings, but not for short ones.
Somewhere I hate some complier pessimisations. Substituting a call to strlen() is typical. strlen() is almost certainly optimised for long strings. If the string is short the coded loop will be faster.
Yes, and more importantly, if a developer takes the time to explicitly write a loop to do something that matches such a function, it's very likely that they already considered the function and did *not* want to use it for whatever reason. And the problem we're currently having with compilers is that they are not willing to respect the developer's intent because they always know better.
The same is true (and probably more so) for memcpy.
Yes, I think that we'll eventually have to stuff a few asm() here and there in a few such loops as compilers become less and less trustable.
Willy
From: Willy Tarreau
Sent: 11 October 2022 07:21
On Mon, Oct 10, 2022 at 10:03:53AM +0000, David Laight wrote:
From: Willy Tarreau w@1wt.eu
Sent: 09 October 2022 19:36
...
By the way, just for the sake of completeness, the one that consistently gives me a better output is this one:
size_t strlen(const char *str) { const char *s0 = str--;
while (*++str) ; return str - s0;
}
Which gives me this:
0000000000000000 <strlen>: 0: 48 8d 47 ff lea -0x1(%rdi),%rax 4: 48 ff c0 inc %rax 7: 80 38 00 cmpb $0x0,(%rax) a: 75 f8 jne 4 <len+0x4> c: 48 29 f8 sub %rdi,%rax f: c3 ret
But this is totally ruined by the addition of asm() in the loop. However I suspect that the construct is difficult to match against a real strlen() since it starts on an extra character, thus placing the asm() statement before the loop could durably preserve it. It does work here (the code remains the exact same one), but for how long, that's the question. Maybe we can revisit the various loop-based functions in the future with this in mind.
clang wilfully and persistently generates:
strlen: # @strlen movq $-1, %rax .LBB0_1: # =>This Inner Loop Header: Depth=1 cmpb $0, 1(%rdi,%rax) leaq 1(%rax), %rax jne .LBB0_1 retq
But feed the C for that into gcc and it generates a 'jmp strlen' at everything above -O1.
Interesting, that's not the case for me here with 12.2 from kernel.org on x86_64, which gives this at -O1, -O2, -O3 and -Ofast:
0000000000000000 <strlen>: 0: 48 8d 47 ff lea -0x1(%rdi),%rax 4: 0f 1f 40 00 nopl 0x0(%rax) 8: 48 83 c0 01 add $0x1,%rax c: 80 38 00 cmpb $0x0,(%rax) f: 75 f7 jne 8 <strlen+0x8> 11: 48 29 f8 sub %rdi,%rax 14: c3 ret
Out of curiosity what version were you using ?
Clang 12.0.0 onwards, see https://godbolt.org/z/67Gnzs8js
I suspect that might run with less clocks/byte than the code above.
Certainly for large strings, but not for short ones.
For short strings not needing the final sub and not having the read depend on the increment should make the leal one faster. (The nop to align the loop label is monumentally pointless.)
For long strings what matters is how many clocks it takes to schedule the 4 uops in the loop. It might be possible to get down to 2 clocks - but I think both the loops are 3 clocks (assuming the adjacent cmp/jne fuse).
I'm not going to try to instrument the loops though!
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
linux-kselftest-mirror@lists.linaro.org