On Sun, Oct 09, 2022 at 07:59:20PM +0200, Willy Tarreau wrote:
Hi Alexey,
On Sun, Oct 09, 2022 at 06:45:49PM +0300, Alexey Dobriyan wrote:
Willy Tarreau wrote:
+#if defined(__GNUC__) && (__GNUC__ >= 12) +__attribute__((optimize("no-tree-loop-distribute-patterns"))) +#endif static __attribute__((unused)) -size_t nolibc_strlen(const char *str
I'd suggest to use asm("") in the loop body. It worked in the past to prevent folding division loop back into division instruction.
Ah excellent idea! I initially thought about using asm() to hide a variable provenance but didn't like it much because it undermines code optimization. But you're right, with an empty asm() statement alone, the loop will not look like an strlen() anymore. Just tried and it works like a charm, I'll resend a patch so that we can get rid of the ugly ifdef.
Or switch to
size_t f(const char *s) { const char *s0 = s; while (*s++) ; return s - s0 - 1; }
which compiles to 1 branch, not 2.
In fact it depends. In the original code that approach was part of the ones I had considered, but it doesn't always in better code due to the prologue and epilogue being larger. It's only better at -O1, and -O2, but not -Os, and once you add asm() into it, only -O1 remains better:
$ nm --size len.o|grep O|rev|sort|rev 000000000000001a T len_while_O1 0000000000000022 T len_while_asm_O1 0000000000000026 T len_for_O1 000000000000001a T len_while_O2 000000000000002b T len_while_asm_O2 0000000000000021 T len_for_O2 0000000000000013 T len_while_Os 0000000000000015 T len_while_asm_Os 000000000000000e T len_for_Os
This observation seems consistent for me on x86_64, i386, arm and arm64.
By the way, just for the sake of completeness, the one that consistently gives me a better output is this one:
size_t strlen(const char *str) { const char *s0 = str--;
while (*++str) ; return str - s0; }
Which gives me this:
0000000000000000 <strlen>: 0: 48 8d 47 ff lea -0x1(%rdi),%rax 4: 48 ff c0 inc %rax 7: 80 38 00 cmpb $0x0,(%rax) a: 75 f8 jne 4 <len+0x4> c: 48 29 f8 sub %rdi,%rax f: c3 ret
But this is totally ruined by the addition of asm() in the loop. However I suspect that the construct is difficult to match against a real strlen() since it starts on an extra character, thus placing the asm() statement before the loop could durably preserve it. It does work here (the code remains the exact same one), but for how long, that's the question. Maybe we can revisit the various loop-based functions in the future with this in mind.
Cheers, Willy