Attached user space program I used to see the difference. Usage: gcc -02 -o strscpy strscpy_test.c ./strscpy {b|w} src_str_len count
src_str_len - length of source string in between 1-4096 count - how many strscpy() to execute.
Also I've noticed something strange. I'm not sure why, but certain src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results for byte-at-a-time copy:
$ perf stat ./strscpy b 29 10000000
Performance counter stats for './strscpy b 29 10000000':
165.354974 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 48 page-faults:u # 0.290 K/sec 640,475,981 cycles:u # 3.873 GHz 2,500,090,080 instructions:u # 3.90 insn per cycle 640,017,126 branches:u # 3870.565 M/sec 1,589 branch-misses:u # 0.00% of all branches
0.165568346 seconds time elapsed
Performance counter stats for './strscpy b 30 10000000':
250.835659 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 46 page-faults:u # 0.183 K/sec 974,528,780 cycles:u # 3.885 GHz 2,580,090,165 instructions:u # 2.65 insn per cycle 660,017,211 branches:u # 2631.273 M/sec 14,488,234 branch-misses:u # 2.20% of all branches
0.251147341 seconds time elapsed
Performance counter stats for './strscpy b 31 10000000':
176.598368 task-clock:u (msec) # 0.997 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 46 page-faults:u # 0.260 K/sec 681,367,948 cycles:u # 3.858 GHz 2,660,090,092 instructions:u # 3.90 insn per cycle 680,017,138 branches:u # 3850.642 M/sec 1,817 branch-misses:u # 0.00% of all branches
0.177150181 seconds time elapsed