strscpy() performs the word-at-a-time optimistic reads. So it may may access the memory past the end of the object, which is perfectly fine since strscpy() doesn't use that (past-the-end) data and makes sure the optimistic read won't cross a page boundary.
But KASAN doesn't know anything about that so it will complain. There are several possible ways to address this issue, but none are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com
It seems the best solution is to simply disable word-at-a-time optimization. My trivial testing shows that byte-at-a-time could be up to x4.3 times slower than word-at-a-time. It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy() in a performance critical paths to copy large amounts of data, so it shouldn't matter anyway.
Fixes: 30035e45753b7 ("string: provide strscpy()") Signed-off-by: Andrey Ryabinin aryabinin@virtuozzo.com Cc: stable@vger.kernel.org --- lib/string.c | 38 -------------------------------------- 1 file changed, 38 deletions(-)
diff --git a/lib/string.c b/lib/string.c index 64a9e33f1daa..6205dd71aa0f 100644 --- a/lib/string.c +++ b/lib/string.c @@ -29,7 +29,6 @@ #include <linux/errno.h>
#include <asm/byteorder.h> -#include <asm/word-at-a-time.h> #include <asm/page.h>
#ifndef __HAVE_ARCH_STRNCASECMP @@ -177,45 +176,8 @@ EXPORT_SYMBOL(strlcpy); */ ssize_t strscpy(char *dest, const char *src, size_t count) { - const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS; - size_t max = count; long res = 0;
- if (count == 0) - return -E2BIG; - -#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS - /* - * If src is unaligned, don't cross a page boundary, - * since we don't know if the next page is mapped. - */ - if ((long)src & (sizeof(long) - 1)) { - size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1)); - if (limit < max) - max = limit; - } -#else - /* If src or dest is unaligned, don't do word-at-a-time. */ - if (((long) dest | (long) src) & (sizeof(long) - 1)) - max = 0; -#endif - - while (max >= sizeof(unsigned long)) { - unsigned long c, data; - - c = *(unsigned long *)(src+res); - if (has_zero(c, &data, &constants)) { - data = prep_zero_mask(c, data, &constants); - data = create_zero_mask(data); - *(unsigned long *)(dest+res) = c & zero_bytemask(data); - return res + find_zero(data); - } - *(unsigned long *)(dest+res) = c; - res += sizeof(unsigned long); - count -= sizeof(unsigned long); - max -= sizeof(unsigned long); - } - while (count) { char c;
Attached user space program I used to see the difference. Usage: gcc -02 -o strscpy strscpy_test.c ./strscpy {b|w} src_str_len count
src_str_len - length of source string in between 1-4096 count - how many strscpy() to execute.
Also I've noticed something strange. I'm not sure why, but certain src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results for byte-at-a-time copy:
$ perf stat ./strscpy b 29 10000000
Performance counter stats for './strscpy b 29 10000000':
165.354974 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 48 page-faults:u # 0.290 K/sec 640,475,981 cycles:u # 3.873 GHz 2,500,090,080 instructions:u # 3.90 insn per cycle 640,017,126 branches:u # 3870.565 M/sec 1,589 branch-misses:u # 0.00% of all branches
0.165568346 seconds time elapsed
Performance counter stats for './strscpy b 30 10000000':
250.835659 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 46 page-faults:u # 0.183 K/sec 974,528,780 cycles:u # 3.885 GHz 2,580,090,165 instructions:u # 2.65 insn per cycle 660,017,211 branches:u # 2631.273 M/sec 14,488,234 branch-misses:u # 2.20% of all branches
0.251147341 seconds time elapsed
Performance counter stats for './strscpy b 31 10000000':
176.598368 task-clock:u (msec) # 0.997 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 46 page-faults:u # 0.260 K/sec 681,367,948 cycles:u # 3.858 GHz 2,660,090,092 instructions:u # 3.90 insn per cycle 680,017,138 branches:u # 3850.642 M/sec 1,817 branch-misses:u # 0.00% of all branches
0.177150181 seconds time elapsed
On 2018-01-09 17:47, Andrey Ryabinin wrote:
Attached user space program I used to see the difference. Usage: gcc -02 -o strscpy strscpy_test.c ./strscpy {b|w} src_str_len count
src_str_len - length of source string in between 1-4096 count - how many strscpy() to execute. Also I've noticed something strange. I'm not sure why, but certain src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results for byte-at-a-time copy:
I see something similar, but at the 30->31 transition, and the branch-misses remain at 1-3% for higher values, until 42 where it drops back to 0%. Anyway, I highly doubt we do a lot of string copies of strings longer then 32.
$ perf stat ./strscpy_test b 30 10000000
Performance counter stats for './strscpy_test b 30 10000000':
156,777082 task-clock (msec) # 0,999 CPUs utilized 0 context-switches # 0,000 K/sec
0 cpu-migrations # 0,000 K/sec
48 page-faults # 0,306 K/sec
584.646.177 cycles # 3,729 GHz
<not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 2.580.599.614 instructions # 4,41 insns per cycle 660.114.283 branches # 4210,528 M/sec
4.891 branch-misses # 0,00% of all branches
0,156970910 seconds time elapsed
$ perf stat ./strscpy_test b 31 10000000
Performance counter stats for './strscpy_test b 31 10000000':
258,533250 task-clock (msec) # 0,999 CPUs utilized 0 context-switches # 0,000 K/sec
0 cpu-migrations # 0,000 K/sec
50 page-faults # 0,193 K/sec
965.505.138 cycles # 3,735 GHz
<not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 2.660.773.463 instructions # 2,76 insns per cycle 680.141.051 branches # 2630,768 M/sec
19.150.367 branch-misses # 2,82% of all branches
0,258725192 seconds time elapsed
Rasmus
On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes rasmus.villemoes@prevas.dk wrote:
I see something similar, but at the 30->31 transition, and the branch-misses remain at 1-3% for higher values, until 42 where it drops back to 0%. Anyway, I highly doubt we do a lot of string copies of strings longer then 32.
So I really dislike that microbenchmark, because it just has the same length all the time. Which is very wrong, and makes the benchmark pointless. A big part of this all is branch mispredicts, you shouldn't just hand it the pattern on a plate.
Anyway, the reason I really dislike the patch is not because I think strscpy() is all that important, but I *do* think that the word-at-a-time thing is conceptually something we do care about, and I hate removing it just because of KASAN not understanding it.
So I'd *much* rather have some way to tell KASAN that word-at-a-time is going on. Because that approach definitely makes a difference in other places.
Linus
On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes rasmus.villemoes@prevas.dk wrote:
I see something similar, but at the 30->31 transition, and the branch-misses remain at 1-3% for higher values, until 42 where it drops back to 0%. Anyway, I highly doubt we do a lot of string copies of strings longer then 32.
So I really dislike that microbenchmark, because it just has the same length all the time. Which is very wrong, and makes the benchmark pointless. A big part of this all is branch mispredicts, you shouldn't just hand it the pattern on a plate.
Anyway, the reason I really dislike the patch is not because I think strscpy() is all that important, but I *do* think that the word-at-a-time thing is conceptually something we do care about, and I hate removing it just because of KASAN not understanding it.
So I'd *much* rather have some way to tell KASAN that word-at-a-time is going on. Because that approach definitely makes a difference in other places.
The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read once" part will affect codegen here, though. But if word-at-a-time thing is conceptually something we do care about, we could also introduce something like READ_PARTIALLY_VALID(), which would check that at least first byte of the read is valid and that it does not cross heap block boundary (but outside of KASAN is a normal read).
From: Dmitry Vyukov [mailto:dvyukov@google.com]
Sent: 25 January 2018 08:33
On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes rasmus.villemoes@prevas.dk wrote:
I see something similar, but at the 30->31 transition, and the branch-misses remain at 1-3% for higher values, until 42 where it drops back to 0%. Anyway, I highly doubt we do a lot of string copies of strings longer then 32.
So I really dislike that microbenchmark, because it just has the same length all the time. Which is very wrong, and makes the benchmark pointless. A big part of this all is branch mispredicts, you shouldn't just hand it the pattern on a plate.
Anyway, the reason I really dislike the patch is not because I think strscpy() is all that important, but I *do* think that the word-at-a-time thing is conceptually something we do care about, and I hate removing it just because of KASAN not understanding it.
So I'd *much* rather have some way to tell KASAN that word-at-a-time is going on. Because that approach definitely makes a difference in other places.
The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read once" part will affect codegen here, though. But if word-at-a-time thing is conceptually something we do care about, we could also introduce something like READ_PARTIALLY_VALID(), which would check that at least first byte of the read is valid and that it does not cross heap block boundary (but outside of KASAN is a normal read).
The first byte might not have been written either. For example, doing a strlen() on a misaligned string you might read the aligned word containing the first byte and adjust the value so that the initial byte(s) are not zero. After scanning for a zero byte the length would be corrected.
David
On Thu, Jan 25, 2018 at 9:42 AM, David Laight David.Laight@aculab.com wrote:
From: Dmitry Vyukov [mailto:dvyukov@google.com]
Sent: 25 January 2018 08:33
On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes rasmus.villemoes@prevas.dk wrote:
I see something similar, but at the 30->31 transition, and the branch-misses remain at 1-3% for higher values, until 42 where it drops back to 0%. Anyway, I highly doubt we do a lot of string copies of strings longer then 32.
So I really dislike that microbenchmark, because it just has the same length all the time. Which is very wrong, and makes the benchmark pointless. A big part of this all is branch mispredicts, you shouldn't just hand it the pattern on a plate.
Anyway, the reason I really dislike the patch is not because I think strscpy() is all that important, but I *do* think that the word-at-a-time thing is conceptually something we do care about, and I hate removing it just because of KASAN not understanding it.
So I'd *much* rather have some way to tell KASAN that word-at-a-time is going on. Because that approach definitely makes a difference in other places.
The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read once" part will affect codegen here, though. But if word-at-a-time thing is conceptually something we do care about, we could also introduce something like READ_PARTIALLY_VALID(), which would check that at least first byte of the read is valid and that it does not cross heap block boundary (but outside of KASAN is a normal read).
The first byte might not have been written either. For example, doing a strlen() on a misaligned string you might read the aligned word containing the first byte and adjust the value so that the initial byte(s) are not zero. After scanning for a zero byte the length would be corrected.
Was the first byte at least kmalloc-ed? That's what KASAN checks, it does not care about "written". KMSAN can detect uses of uninit data, but that's another story.
On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov dvyukov@google.com wrote:
On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
So I'd *much* rather have some way to tell KASAN that word-at-a-time is going on. Because that approach definitely makes a difference in other places.
The other option was to use READ_ONCE_NOCHECK().
How about just using the same accessor that we do for the dcache case. That gives a reasonable example of the whole word-at-a-time model, and should be good.
COMPLETELY UNTESTED patch attached. This needs to be verified.
It does limit the word-at-a-time code to the architectures that select both HAVE_EFFICIENT_UNALIGNED_ACCESS and DCACHE_WORD_ACCESS, but that seems a reasonable choice anyway.
Linus
On 01/25/2018 08:55 PM, Linus Torvalds wrote:
On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov dvyukov@google.com wrote:
On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
So I'd *much* rather have some way to tell KASAN that word-at-a-time is going on. Because that approach definitely makes a difference in other places.
The other option was to use READ_ONCE_NOCHECK().
How about just using the same accessor that we do for the dcache case. That gives a reasonable example of the whole word-at-a-time model, and should be good.
If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1), than it should be fine. We don't want completely unchecked read of a source string.
But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports") So I was going to send something like the hunk bellow (split in several patches).
Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong to use it to load *cs only to shut up KASAN.
--- fs/dcache.c | 2 +- include/linux/compiler.h | 11 +++++++++++ lib/string.c | 2 +- 3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c index 5c7df1df81ff..6aa7be55a96d 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char unsigned long a,b,mask;
for (;;) { - a = *(unsigned long *)cs; + a = READ_PARTIAL_CHECK(*(unsigned long *)cs); b = load_unaligned_zeropad(ct); if (tcount < sizeof(unsigned long)) break; diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 52e611ab9a6c..85b63c2e196e 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s * required ordering. */ #include <asm/barrier.h> +#include <linux/kasan-checks.h>
#define __READ_ONCE(x, check) \ ({ \ @@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s */ #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
+#ifdef CONFIG_KASAN +#define READ_PARTIAL_CHECK(x) \ +({ \ + kasan_check_read(&(x), 1); \ + READ_ONCE_NOCHECK(x); \ +}) +#else +#define READ_PARTIAL_CHECK(x) (x) +#endif + #define WRITE_ONCE(x, val) \ ({ \ union { typeof(x) __val; char __c[1]; } __u = \ diff --git a/lib/string.c b/lib/string.c index 64a9e33f1daa..2396856e4c56 100644 --- a/lib/string.c +++ b/lib/string.c @@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count) while (max >= sizeof(unsigned long)) { unsigned long c, data;
- c = *(unsigned long *)(src+res); + c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res)); if (has_zero(c, &data, &constants)) { data = prep_zero_mask(c, data, &constants); data = create_zero_mask(data);
On Thu, Jan 25, 2018 at 8:13 PM, Andrey Ryabinin aryabinin@virtuozzo.com wrote:
On 01/25/2018 08:55 PM, Linus Torvalds wrote:
On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov dvyukov@google.com wrote:
On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
So I'd *much* rather have some way to tell KASAN that word-at-a-time is going on. Because that approach definitely makes a difference in other places.
The other option was to use READ_ONCE_NOCHECK().
How about just using the same accessor that we do for the dcache case. That gives a reasonable example of the whole word-at-a-time model, and should be good.
If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1), than it should be fine. We don't want completely unchecked read of a source string.
But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports") So I was going to send something like the hunk bellow (split in several patches).
Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong to use it to load *cs only to shut up KASAN.
fs/dcache.c | 2 +- include/linux/compiler.h | 11 +++++++++++ lib/string.c | 2 +- 3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c index 5c7df1df81ff..6aa7be55a96d 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char unsigned long a,b,mask;
for (;;) {
a = *(unsigned long *)cs;
a = READ_PARTIAL_CHECK(*(unsigned long *)cs); b = load_unaligned_zeropad(ct); if (tcount < sizeof(unsigned long)) break;
diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 52e611ab9a6c..85b63c2e196e 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
- required ordering.
*/ #include <asm/barrier.h> +#include <linux/kasan-checks.h>
#define __READ_ONCE(x, check) \ ({ \ @@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s */ #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
+#ifdef CONFIG_KASAN +#define READ_PARTIAL_CHECK(x) \ +({ \
kasan_check_read(&(x), 1); \
READ_ONCE_NOCHECK(x); \
+}) +#else +#define READ_PARTIAL_CHECK(x) (x) +#endif
#define WRITE_ONCE(x, val) \ ({ \ union { typeof(x) __val; char __c[1]; } __u = \ diff --git a/lib/string.c b/lib/string.c index 64a9e33f1daa..2396856e4c56 100644 --- a/lib/string.c +++ b/lib/string.c @@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count) while (max >= sizeof(unsigned long)) { unsigned long c, data;
c = *(unsigned long *)(src+res);
c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res)); if (has_zero(c, &data, &constants)) { data = prep_zero_mask(c, data, &constants); data = create_zero_mask(data);
Looks good to me a general way to support word-at-a-time pattern.
This will also get rid of this in fs/dcache.c:
if (IS_ENABLED(CONFIG_DCACHE_WORD_ACCESS)) kasan_unpoison_shadow(dname, round_up(name->len + 1, sizeof(unsigned long)));
On 2018-01-09 17:37, Andrey Ryabinin wrote:
strscpy() performs the word-at-a-time optimistic reads. So it may may access the memory past the end of the object, which is perfectly fine since strscpy() doesn't use that (past-the-end) data and makes sure the optimistic read won't cross a page boundary.
But KASAN doesn't know anything about that so it will complain. There are several possible ways to address this issue, but none are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com
It seems the best solution is to simply disable word-at-a-time optimization. My trivial testing shows that byte-at-a-time could be up to x4.3 times slower than word-at-a-time. It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy() in a performance critical paths to copy large amounts of data, so it shouldn't matter anyway.
Fixes: 30035e45753b7 ("string: provide strscpy()") Signed-off-by: Andrey Ryabinin aryabinin@virtuozzo.com Cc: stable@vger.kernel.org
Acked-by: Rasmus Villemoes linux@rasmusvillemoes.dk
Your microbenchmark even favours word-at-a-time slightly, since in practice I think at least one of src or dst will be unaligned a lot of the time, and while x86 may HAVE_EFFICIENT_UNALIGNED_ACCESS, it's still a little more expensive than doing aligned access. And since strscpy is not called that often, I expect some of the ~300 bytes of instruction cache it occupies can be put to better use elsewhere.
Rasmus
On Tue, Jan 9, 2018 at 6:37 PM, Andrey Ryabinin aryabinin@virtuozzo.com wrote:
strscpy() performs the word-at-a-time optimistic reads. So it may may access the memory past the end of the object, which is perfectly fine since strscpy() doesn't use that (past-the-end) data and makes sure the optimistic read won't cross a page boundary.
But KASAN doesn't know anything about that so it will complain. There are several possible ways to address this issue, but none are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com
It seems the best solution is to simply disable word-at-a-time optimization. My trivial testing shows that byte-at-a-time could be up to x4.3 times slower than word-at-a-time. It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy() in a performance critical paths to copy large amounts of data, so it shouldn't matter anyway.
What prevents you to revert the patch? After this one the all advantages of the function becomes useless AFAIU.
linux-stable-mirror@lists.linaro.org