From: Changbin Du changbin.du@gmail.com
The userspace can ask kprobe to intercept strings at any memory address, including invalid kernel address. In this case, fetch_store_strlen() would crash since it uses general usercopy function, and user access functions are no longer allowed to access kernel memory.
For example, we can crash the kernel by doing something as below:
$ sudo kprobe 'p:do_sys_open +0(+0(%si)):string'
[ 103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?) [ 103.622104] general protection fault: 0000 [#1] SMP PTI [ 103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba1-dirty #96 [ 103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014 [ 103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0 [ 103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63 ca 89 f8 <8a> 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48 [ 103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246 [ 103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000 [ 103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000 [ 103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 103.644040] FS: 0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000 [ 103.646019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0 [ 103.649147] Call Trace: [ 103.649781] ? sched_clock_cpu+0xc/0xa0 [ 103.650747] ? do_sys_open+0x5/0x220 [ 103.651635] kprobe_trace_func+0x303/0x380 [ 103.652645] ? do_sys_open+0x5/0x220 [ 103.653528] kprobe_dispatcher+0x45/0x50 [ 103.654682] ? do_sys_open+0x1/0x220 [ 103.655875] kprobe_ftrace_handler+0x90/0xf0 [ 103.657282] ftrace_ops_assist_func+0x54/0xf0 [ 103.658564] ? __call_rcu+0x1dc/0x280 [ 103.659482] 0xffffffffc00000bf [ 103.660384] ? __ia32_sys_open+0x20/0x20 [ 103.661682] ? do_sys_open+0x1/0x220 [ 103.662863] do_sys_open+0x5/0x220 [ 103.663988] do_syscall_64+0x60/0x210 [ 103.665201] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 103.666862] RIP: 0033:0x7fc22fadccdd [ 103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44 [ 103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101 [ 103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd [ 103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c [ 103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8 [ 103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 103.688056] Modules linked in: [ 103.689131] ---[ end trace 43792035c28984a1 ]---
This can be fixed by using probe_mem_read() instead, as it can handle faulting kernel memory addresses, which kprobes can legitimately do.
Link: http://lkml.kernel.org/r/20190125151051.7381-1-changbin.du@gmail.com
Cc: stable@vger.kernel.org Fixes: 9da3f2b7405 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses") Signed-off-by: Changbin Du changbin.du@gmail.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org --- kernel/trace/trace_kprobe.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d5fb09ebba8b..9eaf07f99212 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -861,22 +861,14 @@ static const struct file_operations kprobe_profile_ops = { static nokprobe_inline int fetch_store_strlen(unsigned long addr) { - mm_segment_t old_fs; int ret, len = 0; u8 c;
- old_fs = get_fs(); - set_fs(KERNEL_DS); - pagefault_disable(); - do { - ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1); + ret = probe_mem_read(&c, (u8 *)addr + len, 1); len++; } while (c && ret == 0 && len < MAX_STRING_SIZE);
- pagefault_enable(); - set_fs(old_fs); - return (ret < 0) ? ret : len; }
On Fri, Feb 15, 2019 at 9:49 AM Steven Rostedt rostedt@goodmis.org wrote:
From: Changbin Du changbin.du@gmail.com
The userspace can ask kprobe to intercept strings at any memory address, including invalid kernel address.
Again, this is not about a "kernel address" at all.
It's neither a kernel address _nor_ a user address. It's an invalid address entirely, and there is nothing that makes it "kernel".
Please don't talk about "invalid kernel addresses" when it is no such thing.
Linus
On Fri, 15 Feb 2019 09:55:32 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Feb 15, 2019 at 9:49 AM Steven Rostedt rostedt@goodmis.org wrote:
From: Changbin Du changbin.du@gmail.com
The userspace can ask kprobe to intercept strings at any memory address, including invalid kernel address.
Again, this is not about a "kernel address" at all.
It's neither a kernel address _nor_ a user address. It's an invalid address entirely, and there is nothing that makes it "kernel".
Please don't talk about "invalid kernel addresses" when it is no such thing.
Ah, I see what you are saying. The example of the bug that the patch fixed used a non-canonical address, which is "garbage", and not kernel or user space. Point taken.
But the issue is deeper than that. This code never bugged until the following commit was added:
9da3f2b7405 "x86/fault: BUG() when uaccess helpers fault on kernel addresses"
Before that commit, this worked fine.
In fact, we can trigger the same bug (with a slightly different message), if we use a kernel space address.
To test this, I added the following variable somewhere in the code:
void *sdr_ptr = 0xffffffff70112200;
And then did the following:
# echo 'p:fault do_sys_open arg=+0(@sdr_ptr):string' > /debug/tracing/kprobe_events
Which will read the sdr_ptr variable just like the example did with +0(+0(%si)):string. But this time I control the what address it is triggered on.
And it crashed with:
[ 1447.876799] BUG: pagefault on kernel address 0xffffffff70112200 in non-whitelisted uaccess
The above message in the bug in the patch was: "BUG: GPF in non-whitelisted uaccess (non-canonical address?)"
[ 1447.885053] BUG: unable to handle kernel paging request at ffffffff70112200 [ 1447.891997] #PF error: [normal kernel read fault] [ 1447.896695] PGD 8a212067 P4D 8a212067 PUD 0 [ 1447.900679] BUG: pagefault on kernel address 0xffffffff70112200 in non-whitelisted uaccess [ 1447.900967] Oops: 0000 [#1] PREEMPT SMP PTI [ 1447.913394] CPU: 7 PID: 1688 Comm: cat Not tainted 5.0.0-rc5-test+ #28 [ 1447.919905] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 [ 1447.928843] RIP: 0010:process_fetch_insn+0x1a0/0x470 [ 1447.933804] Code: ff f0 80 48 03 80 83 80 18 1a 00 00 01 31 c9 eb 10 48 83 c2 01 85 c0 75 1f 81 f9 ff 0f 00 00 7f 17 0f 1f 00 0f ae e8 44 89 e0 <40> 8a 32 0f 1f 00 83 c1 01 40 84 f6 75 d9 65 48 8b 14 25 00 5c 01 [ 1447.952520] RSP: 0018:ffffb77b80673d08 EFLAGS: 00010246 [ 1447.957736] RAX: 0000000000000000 RBX: ffff91bfb7ab2c30 RCX: 0000000000000000 [ 1447.964857] RDX: ffffffff70112200 RSI: ffffffff97267a80 RDI: 00007ffffffff000 [ 1447.971981] RBP: 0000000000000000 R08: ffffffff70112200 R09: ffff91c014518000 [ 1447.979105] R10: ffff91c01451801c R11: ffff91c0185e5800 R12: 0000000000000000 [ 1447.986226] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1447.993342] FS: 0000000000000000(0000) GS:ffff91c01a9c0000(0000) knlGS:0000000000000000 [ 1448.001417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1448.007157] CR2: ffffffff70112200 CR3: 00000000b3f32004 CR4: 00000000001606e0 [ 1448.014280] Call Trace: [ 1448.016737] kprobe_trace_func+0x278/0x360 [ 1448.020834] ? preempt_count_sub+0x98/0xe0 [ 1448.024931] ? do_sys_open+0x5/0x220 [ 1448.028503] kprobe_dispatcher+0x36/0x50 [ 1448.032426] ? do_sys_open+0x1/0x220 [ 1448.036002] kprobe_ftrace_handler+0xb5/0x120 [ 1448.040356] ftrace_ops_assist_func+0x87/0x110 [ 1448.044797] 0xffffffffc06a30bf [ 1448.047939] ? __ia32_sys_open+0x20/0x20 [ 1448.051860] ? do_syscall_64+0x1c/0x160 [ 1448.055694] ? do_sys_open+0x1/0x220 [ 1448.059268] do_sys_open+0x5/0x220 [ 1448.062672] do_syscall_64+0x60/0x160 [ 1448.066335] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Which triggered on the address: 0xffffffff70112200
Which is perfectly canonical.
The reason I use "kernel address" is because this became an issue after "x86/fault: BUG() when uaccess helpers fault on kernel addresses" was added and that explicitly states "kernel address".
That patch added:
+ /* + * This is a faulting memory access in kernel space, on a kernel + * address, in a usercopy function. This can e.g. be caused by improper + * use of helpers like __put_user and by improper attempts to access + * userspace addresses in KERNEL_DS regions. + * The one (semi-)legitimate exception are probe_kernel_{read,write}(), + * which can be invoked from places like kgdb, /dev/mem (for reading) + * and privileged BPF code (for reading). + * The probe_kernel_*() functions set the kernel_uaccess_faults_ok flag + * to tell us that faulting on kernel addresses, and even noncanonical + * addresses, in a userspace accessor does not necessarily imply a + * kernel bug, root might just be doing weird stuff. + */ + if (current->kernel_uaccess_faults_ok) + return false; + + /* This is bad. Refuse the fixup so that we go into die(). */ + if (trapnr == X86_TRAP_PF) { + pr_emerg("BUG: pagefault on kernel address 0x%lx in non-whitelisted uaccess\n", + fault_addr); + } else { + pr_emerg("BUG: GPF in non-whitelisted uaccess (non-canonical address?)\n"); + }
Where this path triggered by the kprobe using copy_from_user(). So yeah, it can happen if it triggered on a non-canonical address (which is just garbage), but it can also trigger if it's a kernel address that isn't mapped either.
So the comment in the code is not 100% accurate, because it isn't just a kernel address, but also a non-canonical one. Something tells me that the change log of the commit that checks this isn't written well either. What exactly can't be done now with uaccess code? I'm guessing that it should only be allowed to fault on user space addresses? So should I change this commit log to:
kprobe: Do not use uaccess function to access non-user address that can fault
And change all the "kernel address" mentions to "non-user address" as non-user covers both kernel address and non-canonical?
-- Steve
This patch allows me to modify the sdr_ptr as well from the tracing directory:
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2cf3c747a357..292fe2ef0a45 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7928,6 +7928,45 @@ static const struct file_operations buffer_percent_fops = { .llseek = default_llseek, };
+void *sdr_ptr = 0xffffffff70112200; + +static ssize_t +sdr_ptr_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[64]; + int r; + + r = sprintf(buf, "%px\n", sdr_ptr); + + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static ssize_t +sdr_ptr_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + unsigned long val; + int ret; + + ret = kstrtoul_from_user(ubuf, cnt, 0, &val); + if (ret) + return ret; + + sdr_ptr = val; + + (*ppos)++; + + return cnt; +} + +static const struct file_operations sdr_ptr_fops = { + .open = tracing_open_generic, + .read = sdr_ptr_read, + .write = sdr_ptr_write, + .llseek = default_llseek, +}; + struct dentry *trace_instance_dir;
static void @@ -8188,6 +8227,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) struct trace_event_file *file; int cpu;
+ trace_create_file("sdr_ptr", 0644, d_tracer, + NULL, &sdr_ptr_fops); + trace_create_file("available_tracers", 0444, d_tracer, tr, &show_traces_fops);
On Feb 15, 2019, at 2:15 PM, Steven Rostedt rostedt@goodmis.org wrote:
On Fri, 15 Feb 2019 09:55:32 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Feb 15, 2019 at 9:49 AM Steven Rostedt rostedt@goodmis.org wrote:
From: Changbin Du changbin.du@gmail.com
The userspace can ask kprobe to intercept strings at any memory address, including invalid kernel address.
Again, this is not about a "kernel address" at all.
It's neither a kernel address _nor_ a user address. It's an invalid address entirely, and there is nothing that makes it "kernel".
Please don't talk about "invalid kernel addresses" when it is no such thing.
Ah, I see what you are saying. The example of the bug that the patch fixed used a non-canonical address, which is "garbage", and not kernel or user space. Point taken.
But the issue is deeper than that. This code never bugged until the following commit was added:
9da3f2b7405 "x86/fault: BUG() when uaccess helpers fault on kernel addresses"
Before that commit, this worked fine.
In fact, we can trigger the same bug (with a slightly different message), if we use a kernel space address.
To test this, I added the following variable somewhere in the code:
void *sdr_ptr = 0xffffffff70112200;
And then did the following:
# echo 'p:fault do_sys_open arg=+0(@sdr_ptr):string' > /debug/tracing/kprobe_events
Which will read the sdr_ptr variable just like the example did with +0(+0(%si)):string. But this time I control the what address it is triggered on.
And it crashed with:
[ 1447.876799] BUG: pagefault on kernel address 0xffffffff70112200 in non-whitelisted uaccess
The above message in the bug in the patch was: "BUG: GPF in non-whitelisted uaccess (non-canonical address?)"
[ 1447.885053] BUG: unable to handle kernel paging request at ffffffff70112200 [ 1447.891997] #PF error: [normal kernel read fault] [ 1447.896695] PGD 8a212067 P4D 8a212067 PUD 0 [ 1447.900679] BUG: pagefault on kernel address 0xffffffff70112200 in non-whitelisted uaccess [ 1447.900967] Oops: 0000 [#1] PREEMPT SMP PTI [ 1447.913394] CPU: 7 PID: 1688 Comm: cat Not tainted 5.0.0-rc5-test+ #28 [ 1447.919905] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 [ 1447.928843] RIP: 0010:process_fetch_insn+0x1a0/0x470 [ 1447.933804] Code: ff f0 80 48 03 80 83 80 18 1a 00 00 01 31 c9 eb 10 48 83 c2 01 85 c0 75 1f 81 f9 ff 0f 00 00 7f 17 0f 1f 00 0f ae e8 44 89 e0 <40> 8a 32 0f 1f 00 83 c1 01 40 84 f6 75 d9 65 48 8b 14 25 00 5c 01 [ 1447.952520] RSP: 0018:ffffb77b80673d08 EFLAGS: 00010246 [ 1447.957736] RAX: 0000000000000000 RBX: ffff91bfb7ab2c30 RCX: 0000000000000000 [ 1447.964857] RDX: ffffffff70112200 RSI: ffffffff97267a80 RDI: 00007ffffffff000 [ 1447.971981] RBP: 0000000000000000 R08: ffffffff70112200 R09: ffff91c014518000 [ 1447.979105] R10: ffff91c01451801c R11: ffff91c0185e5800 R12: 0000000000000000 [ 1447.986226] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1447.993342] FS: 0000000000000000(0000) GS:ffff91c01a9c0000(0000) knlGS:0000000000000000 [ 1448.001417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1448.007157] CR2: ffffffff70112200 CR3: 00000000b3f32004 CR4: 00000000001606e0 [ 1448.014280] Call Trace: [ 1448.016737] kprobe_trace_func+0x278/0x360 [ 1448.020834] ? preempt_count_sub+0x98/0xe0 [ 1448.024931] ? do_sys_open+0x5/0x220 [ 1448.028503] kprobe_dispatcher+0x36/0x50 [ 1448.032426] ? do_sys_open+0x1/0x220 [ 1448.036002] kprobe_ftrace_handler+0xb5/0x120 [ 1448.040356] ftrace_ops_assist_func+0x87/0x110 [ 1448.044797] 0xffffffffc06a30bf [ 1448.047939] ? __ia32_sys_open+0x20/0x20 [ 1448.051860] ? do_syscall_64+0x1c/0x160 [ 1448.055694] ? do_sys_open+0x1/0x220 [ 1448.059268] do_sys_open+0x5/0x220 [ 1448.062672] do_syscall_64+0x60/0x160 [ 1448.066335] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Which triggered on the address: 0xffffffff70112200
Which is perfectly canonical.
The reason I use "kernel address" is because this became an issue after "x86/fault: BUG() when uaccess helpers fault on kernel addresses" was added and that explicitly states "kernel address".
That patch added:
/*
* This is a faulting memory access in kernel space, on a kernel
* address, in a usercopy function. This can e.g. be caused by improper
* use of helpers like __put_user and by improper attempts to access
* userspace addresses in KERNEL_DS regions.
* The one (semi-)legitimate exception are probe_kernel_{read,write}(),
* which can be invoked from places like kgdb, /dev/mem (for reading)
* and privileged BPF code (for reading).
* The probe_kernel_*() functions set the kernel_uaccess_faults_ok flag
* to tell us that faulting on kernel addresses, and even noncanonical
* addresses, in a userspace accessor does not necessarily imply a
* kernel bug, root might just be doing weird stuff.
*/
if (current->kernel_uaccess_faults_ok)
return false;
/* This is bad. Refuse the fixup so that we go into die(). */
if (trapnr == X86_TRAP_PF) {
pr_emerg("BUG: pagefault on kernel address 0x%lx in non-whitelisted uaccess\n",
fault_addr);
} else {
pr_emerg("BUG: GPF in non-whitelisted uaccess (non-canonical address?)\n");
}
Where this path triggered by the kprobe using copy_from_user(). So yeah, it can happen if it triggered on a non-canonical address (which is just garbage), but it can also trigger if it's a kernel address that isn't mapped either.
So the comment in the code is not 100% accurate, because it isn't just a kernel address, but also a non-canonical one. Something tells me that the change log of the commit that checks this isn't written well either. What exactly can't be done now with uaccess code? I'm guessing that it should only be allowed to fault on user space addresses? So should I change this commit log to:
kprobe: Do not use uaccess function to access non-user address that can fault
And change all the "kernel address" mentions to "non-user address" as non-user covers both kernel address and non-canonical?
-- Steve
This patch allows me to modify the sdr_ptr as well from the tracing directory:
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2cf3c747a357..292fe2ef0a45 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7928,6 +7928,45 @@ static const struct file_operations buffer_percent_fops = { .llseek = default_llseek, };
+void *sdr_ptr = 0xffffffff70112200;
+static ssize_t +sdr_ptr_read(struct file *filp, char __user *ubuf,
size_t cnt, loff_t *ppos)
+{
- char buf[64];
- int r;
- r = sprintf(buf, "%px\n", sdr_ptr);
- return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+static ssize_t +sdr_ptr_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *ppos)
+{
- unsigned long val;
- int ret;
- ret = kstrtoul_from_user(ubuf, cnt, 0, &val);
- if (ret)
return ret;
- sdr_ptr = val;
- (*ppos)++;
- return cnt;
+}
+static const struct file_operations sdr_ptr_fops = {
- .open = tracing_open_generic,
- .read = sdr_ptr_read,
- .write = sdr_ptr_write,
- .llseek = default_llseek,
+};
struct dentry *trace_instance_dir;
static void @@ -8188,6 +8227,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) struct trace_event_file *file; int cpu;
- trace_create_file("sdr_ptr", 0644, d_tracer,
NULL, &sdr_ptr_fops);
- trace_create_file("available_tracers", 0444, d_tracer, tr, &show_traces_fops);
I’m missing most of the context here, but even probe_kernel_...() is unwise for a totally untrustworthy address. It could be MMIO, for example.
If needed, we could come up with a safe-ish helper for tracing. For direct-map addresses, probe_kernel_...() is probably okay. Same for the current stack. Otherwise we could walk the page tables and check that the address is cacheable, I suppose, although this is slightly dubious if we don’t also check MTRRs. We could also check that the PA is in main memory, I suppose, although this may have unfortunate interactions with the MCE code.
On Fri, 15 Feb 2019 15:49:35 -0800 Andy Lutomirski luto@amacapital.net wrote:
I’m missing most of the context here, but even probe_kernel_...() is unwise for a totally untrustworthy address. It could be MMIO, for example.
True, but kprobes are used like modules, and only allowed by root. They are used to poke literally anywhere one wants. That's the entire purpose of kprobes.
If needed, we could come up with a safe-ish helper for tracing. For direct-map addresses, probe_kernel_...() is probably okay. Same for the current stack. Otherwise we could walk the page tables and check that the address is cacheable, I suppose, although this is slightly dubious if we don’t also check MTRRs. We could also check that the PA is in main memory, I suppose, although this may have unfortunate interactions with the MCE code.
I added you just because I wanted help getting the change log correct, as that's what Linus was complaining about. I kept using "kernel address" when the sample bug used for the patch was really a non-canonical address (as Linus said, it's just garbage. Neither kernel or user space). But I pointed out that this can also bug if the address is canonical and in the kernel address space. The old code didn't complain about non-canonical or kernel address faulting before commit 9da3f2b7405, which only talks about kernel address space faulting (which is why I only mentioned that in my messages).
Would changing all the mention of "kernel address" to "non user space" be accurate?
For reference:
http://lkml.kernel.org/r/20190215174945.557218316@goodmis.org http://lkml.kernel.org/r/20190215142015.860423791@goodmis.org
-- Steve
On Feb 15, 2019, at 4:19 PM, Steven Rostedt rostedt@goodmis.org wrote:
On Fri, 15 Feb 2019 15:49:35 -0800 Andy Lutomirski luto@amacapital.net wrote:
I’m missing most of the context here, but even probe_kernel_...() is unwise for a totally untrustworthy address. It could be MMIO, for example.
True, but kprobes are used like modules, and only allowed by root. They are used to poke literally anywhere one wants. That's the entire purpose of kprobes.
If needed, we could come up with a safe-ish helper for tracing. For direct-map addresses, probe_kernel_...() is probably okay. Same for the current stack. Otherwise we could walk the page tables and check that the address is cacheable, I suppose, although this is slightly dubious if we don’t also check MTRRs. We could also check that the PA is in main memory, I suppose, although this may have unfortunate interactions with the MCE code.
I added you just because I wanted help getting the change log correct, as that's what Linus was complaining about. I kept using "kernel address" when the sample bug used for the patch was really a non-canonical address (as Linus said, it's just garbage. Neither kernel or user space). But I pointed out that this can also bug if the address is canonical and in the kernel address space. The old code didn't complain about non-canonical or kernel address faulting before commit 9da3f2b7405, which only talks about kernel address space faulting (which is why I only mentioned that in my messages).
Would changing all the mention of "kernel address" to "non user space" be accurate?
I think “kernel address” is right. It’s illegal to access anything that isn’t known to be a valid kernel address while in KERNEL_DS.
The old __copy seems likely to have always been a bit bogus.
BTW, what is this probe_mem_read() thing? Some minimal inspection suggests it’s a buggy reimplementation of probe_kernel_read(). Can you delete it and just use probe_kernel_read() directly?
For reference:
http://lkml.kernel.org/r/20190215174945.557218316@goodmis.org http://lkml.kernel.org/r/20190215142015.860423791@goodmis.org
-- Steve
On Fri, 15 Feb 2019 17:32:55 -0800 Andy Lutomirski luto@amacapital.net wrote:
I added you just because I wanted help getting the change log correct, as that's what Linus was complaining about. I kept using "kernel address" when the sample bug used for the patch was really a non-canonical address (as Linus said, it's just garbage. Neither kernel or user space). But I pointed out that this can also bug if the address is canonical and in the kernel address space. The old code didn't complain about non-canonical or kernel address faulting before commit 9da3f2b7405, which only talks about kernel address space faulting (which is why I only mentioned that in my messages).
Would changing all the mention of "kernel address" to "non user space" be accurate?
I think “kernel address” is right. It’s illegal to access anything that isn’t known to be a valid kernel address while in KERNEL_DS.
But an non-canonical address is not a "kernel address", and that will cause a bug too. This patch came about because it was changed that if we do a uaccess on something other than a user space address and take a fault (either because it was a non-canonical address, or a kernel address), we BUG! Where before that one patch, it would just return a fault.
The old __copy seems likely to have always been a bit bogus.
BTW, what is this probe_mem_read() thing? Some minimal inspection suggests it’s a buggy reimplementation of probe_kernel_read(). Can you delete it and just use probe_kernel_read() directly?
Well, the issue is that we have trace_probe_tmpl.h in that same directory, which does the work for kprobes and uprobes. The trace_kprobes.c defines all the functions for handling kprobes, and trace_uprobes.c does all the handling of uprobes, then they include trace_probe_tmpl.h which does the bulk of the work.
In the uprobes case, we have:
static nokprobe_inline int probe_mem_read(void *dest, void *src, size_t size) { void __user *vaddr = (void __force __user *)src;
return copy_from_user(dest, vaddr, size) ? -EFAULT : 0; }
Because that is adding probes on userspace code.
-- Steve
On Feb 15, 2019, at 6:08 PM, Steven Rostedt rostedt@goodmis.org wrote:
On Fri, 15 Feb 2019 17:32:55 -0800 Andy Lutomirski luto@amacapital.net wrote:
I added you just because I wanted help getting the change log correct, as that's what Linus was complaining about. I kept using "kernel address" when the sample bug used for the patch was really a non-canonical address (as Linus said, it's just garbage. Neither kernel or user space). But I pointed out that this can also bug if the address is canonical and in the kernel address space. The old code didn't complain about non-canonical or kernel address faulting before commit 9da3f2b7405, which only talks about kernel address space faulting (which is why I only mentioned that in my messages).
Would changing all the mention of "kernel address" to "non user space" be accurate?
I think “kernel address” is right. It’s illegal to access anything that isn’t known to be a valid kernel address while in KERNEL_DS.
But an non-canonical address is not a "kernel address", and that will cause a bug too.
Indeed. A non-canonical address is not known to be a valid kernel address. I stand by my slightly pedantic statement :)
This patch came about because it was changed that if we do a uaccess on something other than a user space address and take a fault (either because it was a non-canonical address, or a kernel address), we BUG! Where before that one patch, it would just return a fault.
The old __copy seems likely to have always been a bit bogus.
BTW, what is this probe_mem_read() thing? Some minimal inspection suggests it’s a buggy reimplementation of probe_kernel_read(). Can you delete it and just use probe_kernel_read() directly?
Well, the issue is that we have trace_probe_tmpl.h in that same directory, which does the work for kprobes and uprobes. The trace_kprobes.c defines all the functions for handling kprobes, and trace_uprobes.c does all the handling of uprobes, then they include trace_probe_tmpl.h which does the bulk of the work.
In the uprobes case, we have:
static nokprobe_inline int probe_mem_read(void *dest, void *src, size_t size) { void __user *vaddr = (void __force __user *)src;
return copy_from_user(dest, vaddr, size) ? -EFAULT : 0; }
Because that is adding probes on userspace code.
Can the kprobe case call probe_kernel_read? Maybe it does already?
On Fri, 15 Feb 2019 18:14:21 -0800 Andy Lutomirski luto@amacapital.net wrote:
In the uprobes case, we have:
static nokprobe_inline int probe_mem_read(void *dest, void *src, size_t size) { void __user *vaddr = (void __force __user *)src;
return copy_from_user(dest, vaddr, size) ? -EFAULT : 0; }
Because that is adding probes on userspace code.
Can the kprobe case call probe_kernel_read? Maybe it does already?
Yes, the probe_mem_read() is only used in the trace_probe_tmpl.h which for uprobes is the above "copy_from_user()" and for the kprobes case it is probe_kernel_read().
-- Steve
[ Sorry for not looking at this earlier, I have family in town so I was mostly busy over the weekend... ]
On Fri, Feb 15, 2019 at 4:19 PM Steven Rostedt rostedt@goodmis.org wrote:
Would changing all the mention of "kernel address" to "non user space" be accurate?
That would have worked, but in the meantime I just decided to pull the existing tag, because it's not _horribly_ misleading any more and now we're just talking details of what is a "kernel address" etc. And the patch itself is better than what we used to have.
That said, I do agree with Andy that both the old _copy_from_user_inatomic() thing and the new probe_mem_read() are just fundamentally broken, nasty and slow. It just so happens that probe_mem_read() works _reasonably_ well in practice on x86.
Basically, probe_mem_read() -> probe_kernel_read() really only works on true kernel addresses. And some of that is very fundamental: some architectures can have two entirely different address spaces for user and kernel memory, so if you give _any_ function an "try to read this address", it fundamentally has to be one or the other.
The fact that on x86, there is a unified address space for kernel/user, and it can work for one or the other, is just happenstance (and admittedly the common case).
So I've pulled the existing pull request, because it papers over one particular issue, but the real fix would require:
- knowing whether it's kernel or user space you access
- actually using that knowledge to then limit the addresses we are willing to probe, and _how_ we probe them.
The user-space case is fairly easy: just check the address with "access_ok()", and then use _copy_user_atomic() without any set_fs() games. That should "JustWork(tm)".
And if it's truly supposed to be a kernel address, then we probably need to add a "is this possibly a valid kernel data pointer" interface. Before we then do what "probe_kernel_read()" currently does.
Linus
On Mon, Feb 18, 2019 at 9:58 AM Linus Torvalds torvalds@linux-foundation.org wrote:
So I've pulled the existing pull request, because it papers over one particular issue, but the real fix would require:
- knowing whether it's kernel or user space you access
Actually, it would be good to split the "kernel space" access into finer granularities, because it would be good to limit those addresses by what you expect to happen.
For example, on x86-64, we have a separate virtual mapping for kernel text. If you're following a data pointer, you probably shouldn't even look at that area, and just disallow it up-front.
And the vmalloc space, we should probably look up the vmalloc descriptor for the address, to make sure we don't go into ioremap'ed areas (but we'd generally *do* want to follow pages for module data, or for vmap'ed stacks).
And fixmap and/or percpu pointers may or may not be something that we'd want to follow.
Etc.
So it would be good to not just say "user or kernel", but actually say what *kind* of kernel access it expects.
Linus
[ Added Masami too. Start of thread is here: http://lkml.kernel.org/r/20190215174712.372898450@goodmis.org ]
On Mon, 18 Feb 2019 10:23:44 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
So it would be good to not just say "user or kernel", but actually say what *kind* of kernel access it expects.
Note, kprobes are a different kind of beast. I've used kprobes to probe userspace information as well as kernel. Heck, I could see someone even using kprobes to probe IO memory to check if a device is doing what they expect it's doing.
Basically, a kprobe is mostly used for debugging what's happening in a live kernel, to read any address. But for those that are more security minded, perhaps we could add "layers" via CONFIG or /sys files that prevent kprobes from doing certain kinds of probes?
-- Steve
On Tue, Feb 19, 2019 at 8:18 AM Steven Rostedt rostedt@goodmis.org wrote:
So it would be good to not just say "user or kernel", but actually say what *kind* of kernel access it expects.
Note, kprobes are a different kind of beast. I've used kprobes to probe userspace information as well as kernel. Heck, I could see someone even using kprobes to probe IO memory to check if a device is doing what they expect it's doing.
Note that even if that is the case, you _need_ to special "user vs kernel" information.
Because the exact same address might exist in both.
Right now I think that only happens on sparc32, but vendors used to have that issue on x86-32 too (if they had the 4G:4G patches).
Basically, a kprobe is mostly used for debugging what's happening in a live kernel, to read any address.
My point is that "any address" is not sufficient to begin with. You need "kernel or user".
Having a flag for what _kind_ of kernel address is ok might then be required for other cases if they might not be ok with following page tables to IO space..
Linus
On Tue, 19 Feb 2019 10:43:36 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, Feb 19, 2019 at 8:18 AM Steven Rostedt rostedt@goodmis.org wrote:
So it would be good to not just say "user or kernel", but actually say what *kind* of kernel access it expects.
Note, kprobes are a different kind of beast. I've used kprobes to probe userspace information as well as kernel. Heck, I could see someone even using kprobes to probe IO memory to check if a device is doing what they expect it's doing.
Note that even if that is the case, you _need_ to special "user vs kernel" information.
Because the exact same address might exist in both.
Right now I think that only happens on sparc32, but vendors used to have that issue on x86-32 too (if they had the 4G:4G patches).
Hmm, I didn't realize that Linux allowed the same address to be different depending on being in kernel or user space (learn something new everyday). I guess it makes sense, and even easier with the switch of cr3 from user to kernel. And I even knew of 4G:4G, but never used it, nor put too much thought about its implementation.
Basically, a kprobe is mostly used for debugging what's happening in a live kernel, to read any address.
My point is that "any address" is not sufficient to begin with. You need "kernel or user".
Having a flag for what _kind_ of kernel address is ok might then be required for other cases if they might not be ok with following page tables to IO space..
Good point. Looks like we should add a new flag for kprobe trace parameters, that tell kprobes if the address is expected to be user or kernel. That would be good regardless of the duplicate meanings, as we could use copy_from_user without touching KERNEL_DS, if the probe argument specifically states "this is user space". For example, when probing do_sys_open, and you want to read what path string was passed into the kernel.
Masami, thoughts?
-- Steve
On Tue, 19 Feb 2019 14:03:30 -0500 Steven Rostedt rostedt@goodmis.org wrote:
Basically, a kprobe is mostly used for debugging what's happening in a live kernel, to read any address.
My point is that "any address" is not sufficient to begin with. You need "kernel or user".
Having a flag for what _kind_ of kernel address is ok might then be required for other cases if they might not be ok with following page tables to IO space..
Good point. Looks like we should add a new flag for kprobe trace parameters, that tell kprobes if the address is expected to be user or kernel. That would be good regardless of the duplicate meanings, as we could use copy_from_user without touching KERNEL_DS, if the probe argument specifically states "this is user space". For example, when probing do_sys_open, and you want to read what path string was passed into the kernel.
Masami, thoughts?
Let me ensure what you want. So you want to access a "string" in user-space, not a data structure? In that case, it is very easy to me. It is enough to add a "ustring" type to kprobe events. For example, do_sys_opsn's path variable is one example. That will be +0(+0(%si)):ustring, and fetcher finally copy the string using strncpy_from_user() instead of strncpy_from_unsafe(). (*)
But if you consider to access a field in a data-structure in user space, it might need some more work (E.g. ioctl's parameter), becase if the __user pointer to the data structure is on the memory, we have to dereference the address inside kernel using probe_kernel_read(), but after getting the data strucutre address, we have to dereference the address with copy_from_user(). At this moment, we have no such strong syntax...
To solve that, maybe we need to introduce something like "back reference" of arguments in the event, e.g.
p somewhere user_data=+0(%si) field_val=+8(\user_data):u32:user
or
p somewhere +0(%si) field_val=+8(\1):u32:user
This ":user" additional suffix tells kprobe events to change fetching method to fetch the data by copy_from_user().
(*) BTW, there is another concern to use _from_user APIs in kprobe. Are those APIs might sleep??
Thank you,
On Wed, Feb 20, 2019 at 9:10 AM Masami Hiramatsu mhiramat@kernel.org wrote:
On Tue, 19 Feb 2019 14:03:30 -0500 Steven Rostedt rostedt@goodmis.org wrote:
Basically, a kprobe is mostly used for debugging what's happening in a live kernel, to read any address.
My point is that "any address" is not sufficient to begin with. You need "kernel or user".
Having a flag for what _kind_ of kernel address is ok might then be required for other cases if they might not be ok with following page tables to IO space..
Good point. Looks like we should add a new flag for kprobe trace parameters, that tell kprobes if the address is expected to be user or kernel. That would be good regardless of the duplicate meanings, as we could use copy_from_user without touching KERNEL_DS, if the probe argument specifically states "this is user space". For example, when probing do_sys_open, and you want to read what path string was passed into the kernel.
Masami, thoughts?
Let me ensure what you want. So you want to access a "string" in user-space, not a data structure? In that case, it is very easy to me. It is enough to add a "ustring" type to kprobe events. For example, do_sys_opsn's path variable is one example. That will be +0(+0(%si)):ustring, and fetcher finally copy the string using strncpy_from_user() instead of strncpy_from_unsafe(). (*)
[...]
(*) BTW, there is another concern to use _from_user APIs in kprobe. Are those APIs might sleep??
If you want to access userspace without sleeping, and ignore data in non-present pages, you can do `pagefault_disable(); err = __copy_from_user_inatomic(...); pagefault_enable();`. (Actually, maybe the kernel should have a helper for that...)
On Wed, 20 Feb 2019 14:57:31 +0100 Jann Horn jannh@google.com wrote:
(*) BTW, there is another concern to use _from_user APIs in kprobe. Are those APIs might sleep??
If you want to access userspace without sleeping, and ignore data in non-present pages, you can do `pagefault_disable(); err = __copy_from_user_inatomic(...); pagefault_enable();`. (Actually, maybe the kernel should have a helper for that...)
I was about to say pretty much the same thing (about using _inatomic()), but yeah, we should also have a helper function if we don't already to include the page_fault_disable/enable() too.
-- Steve
Hi Jann,
On Wed, 20 Feb 2019 14:57:31 +0100 Jann Horn jannh@google.com wrote:
On Wed, Feb 20, 2019 at 9:10 AM Masami Hiramatsu mhiramat@kernel.org wrote:
On Tue, 19 Feb 2019 14:03:30 -0500 Steven Rostedt rostedt@goodmis.org wrote:
Basically, a kprobe is mostly used for debugging what's happening in a live kernel, to read any address.
My point is that "any address" is not sufficient to begin with. You need "kernel or user".
Having a flag for what _kind_ of kernel address is ok might then be required for other cases if they might not be ok with following page tables to IO space..
Good point. Looks like we should add a new flag for kprobe trace parameters, that tell kprobes if the address is expected to be user or kernel. That would be good regardless of the duplicate meanings, as we could use copy_from_user without touching KERNEL_DS, if the probe argument specifically states "this is user space". For example, when probing do_sys_open, and you want to read what path string was passed into the kernel.
Masami, thoughts?
Let me ensure what you want. So you want to access a "string" in user-space, not a data structure? In that case, it is very easy to me. It is enough to add a "ustring" type to kprobe events. For example, do_sys_opsn's path variable is one example. That will be +0(+0(%si)):ustring, and fetcher finally copy the string using strncpy_from_user() instead of strncpy_from_unsafe(). (*)
[...]
(*) BTW, there is another concern to use _from_user APIs in kprobe. Are those APIs might sleep??
If you want to access userspace without sleeping, and ignore data in non-present pages, you can do `pagefault_disable(); err = __copy_from_user_inatomic(...); pagefault_enable();`. (Actually, maybe the kernel should have a helper for that...)
Ok, we are going back to the start point of this thread :)
http://lkml.kernel.org/r/20190215174712.372898450@goodmis.org
So, if user tells kprobe it is user-pointer, we check it with access_ok(), and will do something similar to the strnlen_user() and strncpy_from_user(), but using __copy_from_user_inatomic() and pagefault_disable() for kprobes.
Thank you!
On Wed, 20 Feb 2019 17:10:19 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Let me ensure what you want. So you want to access a "string" in user-space, not a data structure? In that case, it is very easy to me. It is enough to add a "ustring" type to kprobe events. For example, do_sys_opsn's path variable is one example. That will be +0(+0(%si)):ustring, and fetcher finally copy the string using strncpy_from_user() instead of strncpy_from_unsafe(). (*)
ustring would be good.
But if you consider to access a field in a data-structure in user space, it might need some more work (E.g. ioctl's parameter), becase if the __user pointer to the data structure is on the memory, we have to dereference the address inside kernel using probe_kernel_read(), but after getting the data strucutre address, we have to dereference the address with copy_from_user(). At this moment, we have no such strong syntax...
To solve that, maybe we need to introduce something like "back reference" of arguments in the event, e.g.
p somewhere user_data=+0(%si) field_val=+8(\user_data):u32:user
or
p somewhere +0(%si) field_val=+8(\1):u32:user
This ":user" additional suffix tells kprobe events to change fetching method to fetch the data by copy_from_user().
What about just adding 'u' to the end of the offset? Say you have a data structure in kernel space that has a field in user space you want to reference?
field_val=+8u(+0(%si))
Although, I would say having ways to access current parameters may also be a nice touch ;-)
-- Steve
On Wed, 20 Feb 2019 09:49:26 -0500 Steven Rostedt rostedt@goodmis.org wrote:
On Wed, 20 Feb 2019 17:10:19 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Let me ensure what you want. So you want to access a "string" in user-space, not a data structure? In that case, it is very easy to me. It is enough to add a "ustring" type to kprobe events. For example, do_sys_opsn's path variable is one example. That will be +0(+0(%si)):ustring, and fetcher finally copy the string using strncpy_from_user() instead of strncpy_from_unsafe(). (*)
ustring would be good.
OK.
But if you consider to access a field in a data-structure in user space, it might need some more work (E.g. ioctl's parameter), becase if the __user pointer to the data structure is on the memory, we have to dereference the address inside kernel using probe_kernel_read(), but after getting the data strucutre address, we have to dereference the address with copy_from_user(). At this moment, we have no such strong syntax...
To solve that, maybe we need to introduce something like "back reference" of arguments in the event, e.g.
p somewhere user_data=+0(%si) field_val=+8(\user_data):u32:user
or
p somewhere +0(%si) field_val=+8(\1):u32:user
This ":user" additional suffix tells kprobe events to change fetching method to fetch the data by copy_from_user().
What about just adding 'u' to the end of the offset? Say you have a data structure in kernel space that has a field in user space you want to reference?
field_val=+8u(+0(%si))
Ah, that looks good :~) thank you for this idea!
Although, I would say having ways to access current parameters may also be a nice touch ;-)
Hehe, OK, let's implement both.
Thanks!
-- Steve
On Thu, 21 Feb 2019 01:04:53 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
What about just adding 'u' to the end of the offset? Say you have a data structure in kernel space that has a field in user space you want to reference?
field_val=+8u(+0(%si))
Ah, that looks good :~) thank you for this idea!
<bike-shed>
Hmm, I wonder if we should make it +u8 or u+8? as +8u may be confused as unsigned? Like 8ULL. I don't know. Kernel developers suck at naming :-p
</bike-shed>
-- Steve
On Wed, 20 Feb 2019 11:42:17 -0500 Steven Rostedt rostedt@goodmis.org wrote:
On Thu, 21 Feb 2019 01:04:53 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
What about just adding 'u' to the end of the offset? Say you have a data structure in kernel space that has a field in user space you want to reference?
field_val=+8u(+0(%si))
Ah, that looks good :~) thank you for this idea!
<bike-shed>
Hmm, I wonder if we should make it +u8 or u+8? as +8u may be confused as unsigned? Like 8ULL. I don't know. Kernel developers suck at naming :-p
I like +u8 since it is easier to implement :-p.
Thank you,
</bike-shed>
-- Steve
Hi Steve,
On Wed, 20 Feb 2019 09:49:26 -0500 Steven Rostedt rostedt@goodmis.org wrote:
On Wed, 20 Feb 2019 17:10:19 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Let me ensure what you want. So you want to access a "string" in user-space, not a data structure? In that case, it is very easy to me. It is enough to add a "ustring" type to kprobe events. For example, do_sys_opsn's path variable is one example. That will be +0(+0(%si)):ustring, and fetcher finally copy the string using strncpy_from_user() instead of strncpy_from_unsafe(). (*)
ustring would be good.
I've tried to implement ustring and u-offsets, but I got some issues.
- access_ok() warns if it is called in IRQ context (kprobes is.) - copy_from_user uses access_ok(), so it is not designed for irq handler.
Moreover, if we have different kernel/user address spaces, we have to assign target user-pages to kernel vma. Can we do that (doesn't it involve mutex locks)?
If not, I think what we can do "in kprobes" is only probe_kernel_read() and strncpy_from_unsafe(). This means, on the architechture whoes kernel address space doesn't map user space, we can not support user-space data fetching in kprobe evevnts.
Thank you,
On Fri, 22 Feb 2019 17:27:45 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Hi Steve,
On Wed, 20 Feb 2019 09:49:26 -0500 Steven Rostedt rostedt@goodmis.org wrote:
On Wed, 20 Feb 2019 17:10:19 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Let me ensure what you want. So you want to access a "string" in user-space, not a data structure? In that case, it is very easy to me. It is enough to add a "ustring" type to kprobe events. For example, do_sys_opsn's path variable is one example. That will be +0(+0(%si)):ustring, and fetcher finally copy the string using strncpy_from_user() instead of strncpy_from_unsafe(). (*)
ustring would be good.
I've tried to implement ustring and u-offsets, but I got some issues.
- access_ok() warns if it is called in IRQ context (kprobes is.)
- copy_from_user uses access_ok(), so it is not designed for irq handler.
Moreover, if we have different kernel/user address spaces, we have to assign target user-pages to kernel vma. Can we do that (doesn't it involve mutex locks)?
Or, can we do this?
long __probe_user_read(void *dst, const void *src, size_t size) { long ret; mm_segment_t old_fs = get_fs();
set_fs(USER_DS); /* Only this is changed */ pagefault_disable(); current->kernel_uaccess_faults_ok++; ret = __copy_from_user_inatomic(dst, (__force const void __user *)src, size); current->kernel_uaccess_faults_ok--; pagefault_enable(); set_fs(old_fs);
return ret ? -EFAULT : 0; }
On Fri, Feb 22, 2019 at 12:35 AM Masami Hiramatsu mhiramat@kernel.org wrote:
Or, can we do this?
long __probe_user_read(void *dst, const void *src, size_t size) {
Add a
if (!access_ok(src, size)) ret = -EFAULT; else { .. do the pagefault_disable() etc .. }
to after the "set_fs()", and it looks good to me. Make it clear that yes, this works _only_ for user reads.
Adn that makes all the games with "kernel_uaccess_faults_ok" pointless, so you can just remove them.
(note that the "access_ok()" has to come after we've done "set_fs()", because it takes the address limit from that).
Also, since normally we'd expect that we already have USER_DS, it might be worthwhile to do this with a wrapper, something along the lines of
mm_segment_t old_fs = get_fs();
if (segment_eq(old_fs, USER_DS)) return __normal_probe_user_read(); set_fs(USER_DS); ret = __normal_probe_user_read(); set_fs(old_fs); return ret;
and have that __normal_probe_user_read() just do
if (!access_ok(src, size)) return -EFAULT; pagefault_disable(); ret = __copy_from_user_inatomic(dst, ...); pagefault_enable(); return ret ? -EFAULT : 0;
which looks more obvious.
Also, I would suggest that you just make the argument type be "const void __user *", since the whole point is that this takes a user pointer, and nothing else.
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
The nice thing about that is that usually developers will have access to exactly those modern boxes, so the people who notice that it doesn't work are the right people.
Alternatively, we should just make it be architecture-specific, so that architectures can decide "this address cannot be a kernel address" and refuse to do it.
Linus
On Feb 22, 2019, at 9:43 AM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Feb 22, 2019 at 12:35 AM Masami Hiramatsu mhiramat@kernel.org wrote:
Or, can we do this?
long __probe_user_read(void *dst, const void *src, size_t size) {
Add a
if (!access_ok(src, size)) ret = -EFAULT; else {
.. do the pagefault_disable() etc .. }
to after the "set_fs()", and it looks good to me. Make it clear that yes, this works _only_ for user reads.
Adn that makes all the games with "kernel_uaccess_faults_ok" pointless, so you can just remove them.
(note that the "access_ok()" has to come after we've done "set_fs()", because it takes the address limit from that).
Also, since normally we'd expect that we already have USER_DS, it might be worthwhile to do this with a wrapper, something along the lines of
mm_segment_t old_fs = get_fs(); if (segment_eq(old_fs, USER_DS)) return __normal_probe_user_read(); set_fs(USER_DS); ret = __normal_probe_user_read(); set_fs(old_fs); return ret;
and have that __normal_probe_user_read() just do
if (!access_ok(src, size)) return -EFAULT; pagefault_disable(); ret = __copy_from_user_inatomic(dst, ...); pagefault_enable(); return ret ? -EFAULT : 0;
which looks more obvious.
Also, I would suggest that you just make the argument type be "const void __user *", since the whole point is that this takes a user pointer, and nothing else.
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
The nice thing about that is that usually developers will have access to exactly those modern boxes, so the people who notice that it doesn't work are the right people.
We use probe_kernel_read() from oops code. I’d rather it return -EFAULT than oops harder and kill the first oops.
On Fri, Feb 22, 2019 at 9:48 AM Andy Lutomirski luto@amacapital.net wrote:
On Feb 22, 2019, at 9:43 AM, Linus Torvalds torvalds@linux-foundation.org wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
The nice thing about that is that usually developers will have access to exactly those modern boxes, so the people who notice that it doesn't work are the right people.
We use probe_kernel_read() from oops code. I’d rather it return -EFAULT than oops harder and kill the first oops.
It would still do that.
Using the unsafe_get_user() macros doesn't remove the exception handling, and we wouldn't remove the whole "pagefault_disable()" either. So it would work exactly the same way it does now, except on a modern CPU it would return -EFAULT for a user space access due to AC not being set.
So no "oops harder", only "safer accesses".
Linus
On Feb 22, 2019, at 10:28 AM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Feb 22, 2019 at 9:48 AM Andy Lutomirski luto@amacapital.net wrote:
On Feb 22, 2019, at 9:43 AM, Linus Torvalds torvalds@linux-foundation.org wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
The nice thing about that is that usually developers will have access to exactly those modern boxes, so the people who notice that it doesn't work are the right people.
We use probe_kernel_read() from oops code. I’d rather it return -EFAULT than oops harder and kill the first oops.
It would still do that.
Using the unsafe_get_user() macros doesn't remove the exception handling, and we wouldn't remove the whole "pagefault_disable()" either. So it would work exactly the same way it does now, except on a modern CPU it would return -EFAULT for a user space access due to AC not being set.
Hmm. I misunderstood you. I thought you wanted the oops.
We’d have to check that we don’t trip the “SMAP violation, egads!” check.
On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
If __probe_kernel_read will suddenly start failing on all user addresses it will break the expectations. How do we solve it in bpf_probe_read? Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte in the loop? That's doable, but people already complain that bpf_probe_read() is slow and shows up in their perf report.
On Fri, 22 Feb 2019 11:27:05 -0800 Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
If __probe_kernel_read will suddenly start failing on all user addresses it will break the expectations. How do we solve it in bpf_probe_read? Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte in the loop? That's doable, but people already complain that bpf_probe_read() is slow and shows up in their perf report.
We're changing kprobes to add a specific flag to say that we want to differentiate between kernel or user reads. Can this be done with bpf_probe_read()? If it's showing up in perf report, I doubt a single check is going to cause an issue. In fact, it may actually help speed things up as the read will be optimized for either user or kernel address reading.
-- Steve
On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
On Fri, 22 Feb 2019 11:27:05 -0800 Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
If __probe_kernel_read will suddenly start failing on all user addresses it will break the expectations. How do we solve it in bpf_probe_read? Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte in the loop? That's doable, but people already complain that bpf_probe_read() is slow and shows up in their perf report.
We're changing kprobes to add a specific flag to say that we want to differentiate between kernel or user reads. Can this be done with bpf_probe_read()? If it's showing up in perf report, I doubt a single
so you're saying you will break existing kprobe scripts? I don't think it's a good idea. It's not acceptable to break bpf_probe_read uapi.
On Fri, 22 Feb 2019 11:34:58 -0800 Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
so you're saying you will break existing kprobe scripts?
Yes we may.
I don't think it's a good idea. It's not acceptable to break bpf_probe_read uapi.
Then you may need to add more code to determine if the address is user space or not in the kernel, and then go the appropriate route, before calling probe_kernel_read().
-- Steve
On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote: On Fri, 22 Feb 2019 11:27:05 -0800 Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
If __probe_kernel_read will suddenly start failing on all user addresses it will break the expectations. How do we solve it in bpf_probe_read? Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte in the loop? That's doable, but people already complain that bpf_probe_read() is slow and shows up in their perf report.
We're changing kprobes to add a specific flag to say that we want to differentiate between kernel or user reads. Can this be done with bpf_probe_read()? If it's showing up in perf report, I doubt a single
so you're saying you will break existing kprobe scripts? I don't think it's a good idea. It's not acceptable to break bpf_probe_read uapi.
If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd. I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
What to do about existing scripts is a different question.
On Fri, Feb 22, 2019 at 11:27 AM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
As mentioned earlier in the thread, that's actually fundamentally broken.
There are architectures that have physically separate address spaces, with the same pointer value in both kernel and user space.
They are rare, but they exist. At least sparc32 and the old 4G:4G split x86.
So a pointer really should always unambiguously always be explicitly _either_ a kernel pointer, or a user pointer. You can't have "this is a pointer", and then try to figure it out by looking at the value. That may happen to work on x86-64, but it's literally a "happen to work on the most common architectures", not a design thing.
Linus
From: Linus Torvalds torvalds@linux-foundation.org Date: Fri, 22 Feb 2019 13:20:58 -0800
On Fri, Feb 22, 2019 at 11:27 AM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
As mentioned earlier in the thread, that's actually fundamentally broken.
There are architectures that have physically separate address spaces, with the same pointer value in both kernel and user space.
They are rare, but they exist. At least sparc32 and the old 4G:4G split x86.
And sparc64.
So a pointer really should always unambiguously always be explicitly _either_ a kernel pointer, or a user pointer. You can't have "this is a pointer", and then try to figure it out by looking at the value. That may happen to work on x86-64, but it's literally a "happen to work on the most common architectures", not a design thing.
Don't be surprised if we see more separation like this in the future too.
So it's not a smart thing to code against even if you can discount all of the examples Linus gives above.
On Fri, Feb 22, 2019 at 1:38 PM David Miller davem@davemloft.net wrote:
Don't be surprised if we see more separation like this in the future too.
Yes, with the whole meltdown fiasco, there's actually more pressure to add more support for separation of kernel/user address spaces. As Andy pointed out, it's been discussed as a future wish-list for x86-64 too.
But yeah, right now the *common* architectures all distinguish kernel and user space by pointers (ie x86-64, arm64 and powerpc).
Linus
On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
On Fri, Feb 22, 2019 at 1:38 PM David Miller davem@davemloft.net wrote:
Don't be surprised if we see more separation like this in the future too.
Yes, with the whole meltdown fiasco, there's actually more pressure to add more support for separation of kernel/user address spaces. As Andy pointed out, it's been discussed as a future wish-list for x86-64 too.
But yeah, right now the *common* architectures all distinguish kernel and user space by pointers (ie x86-64, arm64 and powerpc).
That's all fine. I'm missing rationale for making probe_kernel_read() fail on user addresses. What is fundamentally wrong with a function probe_any_address_read() ?
For context, typical bpf kprobe program looks like this: #define probe_read(P) \ ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;}) SEC("kprobe/__set_task_comm") int bpf_prog(struct pt_regs *ctx) { struct signal_struct *signal; struct task_struct *tsk; char oldcomm[16] = {}; char newcomm[16] = {}; u16 oom_score_adj; u32 pid;
tsk = (void *)PT_REGS_PARM1(ctx); pid = probe_read(tsk->pid); bpf_probe_read(oldcomm, sizeof(oldcomm), &tsk->comm); bpf_probe_read(newcomm, sizeof(newcomm), (void *)PT_REGS_PARM2(ctx)); signal = probe_read(tsk->signal); oom_score_adj = probe_read(signal->oom_score_adj); ... }
where PT_REGS_PARMx macros are defined per architecture. On x86 it's #define PT_REGS_PARM1(x) ((x)->di)
The program writer has to know the meaning of function arguments. In this example they need to know that __set_task_comm is defined as void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) in the kernel.
Right now these programs just call bpf_probe_read() on whatever data they need to access and not differentiating whether it's user or kernel.
One idea we discussed is to split bpf_probe_read() into kernel_read and user_read helpers, but in the BPF verifier we cannot determine which address space the program wants to access. The prog writer needs to manually analyze the program to use correct one. But mistakes are possible and cannot be fatal. On the kernel side we have to be safe. Both probe_kernel_read and probe_user_read must not panic if a pointer from wrong address space was passed.
Hence my preference is to keep probe_kernel_read as "try read any address". The function can be renamed to indicate so.
On Fri, Feb 22, 2019 at 11:51 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
On Fri, Feb 22, 2019 at 1:38 PM David Miller davem@davemloft.net wrote:
Don't be surprised if we see more separation like this in the future too.
Yes, with the whole meltdown fiasco, there's actually more pressure to add more support for separation of kernel/user address spaces. As Andy pointed out, it's been discussed as a future wish-list for x86-64 too.
But yeah, right now the *common* architectures all distinguish kernel and user space by pointers (ie x86-64, arm64 and powerpc).
That's all fine. I'm missing rationale for making probe_kernel_read() fail on user addresses. What is fundamentally wrong with a function probe_any_address_read() ?
I think what Linus is saying is: There are some scenarios (like a system with the old 4G/4G X86 patch) where *the same* address can refer to two different pieces of memory, depending on whether you interpret it as a kernel pointer or a user pointer. So for example, if your BPF program tries to read tsk->comm, it works, but if the BPF program then tries to read from PT_REGS_PARM2(ctx), it's going to actually interpret the userspace address as a kernel address and read completely different memory.
On top of that, from the security angle, this means that if a user passes a kernel pointer into a syscall, they can trick a tracing BPF program into looking at random kernel memory instead of the user's memory. That may or may not be problematic, depending on what you do afterwards with the data you've read. (For example, if this is a service that collects performance data and then saves it to some world-readable location on disk because the data it is collecting (including comm strings) isn't supposed to be sensitive, you might have a problem.)
For context, typical bpf kprobe program looks like this: #define probe_read(P) \ ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;}) SEC("kprobe/__set_task_comm") int bpf_prog(struct pt_regs *ctx) { struct signal_struct *signal; struct task_struct *tsk; char oldcomm[16] = {}; char newcomm[16] = {}; u16 oom_score_adj; u32 pid;
tsk = (void *)PT_REGS_PARM1(ctx); pid = probe_read(tsk->pid); bpf_probe_read(oldcomm, sizeof(oldcomm), &tsk->comm); bpf_probe_read(newcomm, sizeof(newcomm), (void *)PT_REGS_PARM2(ctx)); signal = probe_read(tsk->signal); oom_score_adj = probe_read(signal->oom_score_adj); ...
}
where PT_REGS_PARMx macros are defined per architecture. On x86 it's #define PT_REGS_PARM1(x) ((x)->di)
The program writer has to know the meaning of function arguments. In this example they need to know that __set_task_comm is defined as void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) in the kernel.
Right now these programs just call bpf_probe_read() on whatever data they need to access and not differentiating whether it's user or kernel.
One idea we discussed is to split bpf_probe_read() into kernel_read and user_read helpers, but in the BPF verifier we cannot determine which address space the program wants to access. The prog writer needs to manually analyze the program to use correct one. But mistakes are possible and cannot be fatal. On the kernel side we have to be safe. Both probe_kernel_read and probe_user_read must not panic if a pointer from wrong address space was passed.
Hence my preference is to keep probe_kernel_read as "try read any address". The function can be renamed to indicate so.
From: Jann Horn jannh@google.com Date: Sat, 23 Feb 2019 00:11:58 +0100
I think what Linus is saying is: There are some scenarios (like a system with the old 4G/4G X86 patch) where *the same* address can refer to two different pieces of memory, depending on whether you interpret it as a kernel pointer or a user pointer.
Exactly.
On sparc64 the kernel is mapped exactly at the same virtual addresses as userspace processes usually are mapped, even 32-bit ones. The difference is the MMU context only.
On Fri, Feb 22, 2019 at 2:51 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
That's all fine. I'm missing rationale for making probe_kernel_read() fail on user addresses.
Because it already WON'T WORK in general!
What is fundamentally wrong with a function probe_any_address_read() ?
What part of "the same pointer value can be a user address and a kernel address" is not getting through?
The user address space and the kernel address space have separate page tables on some architectures. We used to avoid it on x86, because switching address spaces was expensive, but even on x86 some vendors did it on 32-bit simply to get 4GB of user (and kernel) address space. And now we end up doing it anyway just because of meltdown.
So a kernel pointer value of 0x12345678 could be a value kernel pointer pointing to some random kmalloc'ed kernel memory, and a user pointer value of 0x12345678 could be a valid _user_ pointer pointing to some user mapping.
See?
If you access a user pointer, you need to use a user accessor function (eg "get_user()"), while if you access a kernel pointer you need to just dereference it directly (unless you can't trust it, in which case you need to use a _different_ accessor function).
The fact that user and kernel pointers happen to be distinct on x86-64 (right now) is just a random implementation detail.
Really.
I didn't realize how many people seem to have been confused about this. But it's always been true. It's just that the common architectures have had that "one single address space for both kernel and user pointers" in practice.
In fact, the *very* first kernel version had separate address spaces for kernel and user mode even on x86 (using segments, not paging). So it has literally been true since day one in Linux that a kernel address can be indistinguishable from a user address from a pure value standpoint.
Linus
On Fri, Feb 22, 2019 at 03:16:35PM -0800, Linus Torvalds wrote:
So a kernel pointer value of 0x12345678 could be a value kernel pointer pointing to some random kmalloc'ed kernel memory, and a user pointer value of 0x12345678 could be a valid _user_ pointer pointing to some user mapping.
See?
If you access a user pointer, you need to use a user accessor function (eg "get_user()"), while if you access a kernel pointer you need to just dereference it directly (unless you can't trust it, in which case you need to use a _different_ accessor function).
that was clear already. Reading 0x12345678 via probe_kernel_read can return valid value and via get_user() can return another valid value on _some_ architectures.
The fact that user and kernel pointers happen to be distinct on x86-64 (right now) is just a random implementation detail.
yes and my point that people already rely on this implementation detail. Say we implement int bpf_probe_read(void *val, void *unsafe_ptr) { if (probe_kernel_read(val, unsafe_ptr) == OK) { return 0; } else (get_user(val, unsafe_ptr) == OK) { return 0; } else { *val = 0; return -EFAULT; } } It will preserve existing bpf_probe_read() behavior on x86. If x86 implementation changes tomorrow then progs that read user addresses may start failing randomly because first probe_kernel_read() will be returning random values from kernel memory and that's no good, but at least we won't be breaking them today, so we have time to introduce bpf_user_read and bpf_kernel_read and folks have time to adopt them.
Imo that's much better than making current bpf_probe_read() fail on user addresses today and not providing a non disruptive path forward.
On Fri, Feb 22, 2019 at 3:56 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
It will preserve existing bpf_probe_read() behavior on x86.
... but that's the worst possible situation.
It appears that people haven't understood that kernel and user addresses are distinct, and may have written programs that are fundamentally buggy.
And we _want_ to make it clear that they are buggy on x86-64, exactly because x86-64 is the one that gets the most testing - by far.
So if x86-64 continues working for buggy programs, then that only means that those bugs never get fixed.
It would be much better to try to get those things fixed, and make the x86-64 implementation stricter, exactly so that people end up _realizing_ that they can't just think "a pointer is a pointer, and the context doesn't matter".
From a pure functional safety standpoint, I thought bpf already knew
what kind of a pointer it had?
Linus
On Fri, Feb 22, 2019 at 04:08:59PM -0800, Linus Torvalds wrote:
On Fri, Feb 22, 2019 at 3:56 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
It will preserve existing bpf_probe_read() behavior on x86.
... but that's the worst possible situation.
It appears that people haven't understood that kernel and user addresses are distinct, and may have written programs that are fundamentally buggy.
And we _want_ to make it clear that they are buggy on x86-64, exactly because x86-64 is the one that gets the most testing - by far.
So if x86-64 continues working for buggy programs, then that only means that those bugs never get fixed.
It would be much better to try to get those things fixed, and make the x86-64 implementation stricter, exactly so that people end up _realizing_ that they can't just think "a pointer is a pointer, and the context doesn't matter".
From a pure functional safety standpoint, I thought bpf already knew what kind of a pointer it had?
when bpf verifier knows the type of pointer it allows direct access to it. That's the case for skb, socket, packet data, hash maps, arrays, stack, etc. Networking progs cannot call bpf_probe_read(). It's available to tracing progs only and their goal is to walk the kernel and user memory with addresses that cannot be statically verified at program load time. We are working on adding type information (BTF) to vmlinux. Soon we'll be able to tell the type of every kernel function argument. Right now arg1 and arg2 in a kprobed function are just u64 pt_regs->di, si. Soon we'll be able to precisely identify their C type.
I completely agree on the direction that x86 is the architecture that sets an example and users need to learn the difference in pointers. I only disagree on timing. Right now users don't have an alternative. In our repo I counted ~400 calls to bpf_probe_read and about 10 times more 'indirect' calls. What's happening with 'indirect' calls is BCC toolchain using clang to automatically replace struct_a->field_foo access with bpf_probe_read(struct_a + offsetof(typeof(struct_a), field_foo));
If we had __user vs __kernel annotation available to clang we could have taught BCC to replace this '->' dereference with appropriate kernel vs user helper. Also we need to teach GCC to recognize __user and store into dwarf, so we can take it further into BTF and verify later.
Also I think disallowing bpf_probe_read() to read user pointer will not make a desired teaching effect. It will only cause painful debugging to folks when their progs will stop working. It's better to remove bpf_probe_read() completely. imo the process of teaching the users of kernel vs user pointer difference needs to be gradual. First we introduce bpf_probe_kernel_read and bpf_probe_user_read and introduce clang/gcc tooling to catch the mistakes. Going over this 400+ places and manually grepping kernel sources for __user keyword is not a great proposal if we want to keep those users. Once we have this working we can remove bpf_probe_read() altogether. Rejecting bpf prog at load time is a clear signal that user has to fix it (instead of changing run-time behavior). When the verifier gets even smarter it could potentially replace prob_read with probe_kernel_read and probe_user_read when it has that type info.
imo this kernel release should finish as-is and in the next cycle we can change probe_kernel_read() to reject user address, have temporary workaround in bpf_probe_read() with probe_kernel_read+get_user hack, introduce new bpf helpers, new tooling and eventually remove buggy bpf_probe_read.
On Fri, Feb 22, 2019 at 6:29 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
imo this kernel release should finish as-is and in the next cycle we can change probe_kernel_read() to reject user address [..]
Absolutely. Nothing is going to change right now for 5.0, which is imminent.
It's really a "long-term we really need to fix this", where the only question is how soon "long-term" is.
Linus
On Fri, 22 Feb 2019 18:28:53 -0800 Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
First we introduce bpf_probe_kernel_read and bpf_probe_user_read and introduce clang/gcc tooling to catch the mistakes. Going over this 400+ places and manually grepping kernel sources for __user keyword is not a great proposal if we want to keep those users. Once we have this working we can remove bpf_probe_read() altogether. Rejecting bpf prog at load time is a clear signal that user has to fix it (instead of changing run-time behavior). When the verifier gets even smarter it could potentially replace prob_read with probe_kernel_read and probe_user_read when it has that type info.
I was about to suggest this approach. Document that bpf_probe_read() is known to be buggy and will be deprecated in the future, and that all new bpf scripts should start using bpf_probe_kernel/user_read() instead (after they have been implemented of course). And give time for people to fix their current scripts.
Perhaps in the near future, trigger some kind of warning for users that use bpf_probe_read().
-- Steve
On Fri, 22 Feb 2019 15:56:20 -0800 Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Fri, Feb 22, 2019 at 03:16:35PM -0800, Linus Torvalds wrote:
So a kernel pointer value of 0x12345678 could be a value kernel pointer pointing to some random kmalloc'ed kernel memory, and a user pointer value of 0x12345678 could be a valid _user_ pointer pointing to some user mapping.
See?
If you access a user pointer, you need to use a user accessor function (eg "get_user()"), while if you access a kernel pointer you need to just dereference it directly (unless you can't trust it, in which case you need to use a _different_ accessor function).
that was clear already. Reading 0x12345678 via probe_kernel_read can return valid value and via get_user() can return another valid value on _some_ architectures.
The fact that user and kernel pointers happen to be distinct on x86-64 (right now) is just a random implementation detail.
yes and my point that people already rely on this implementation detail. Say we implement int bpf_probe_read(void *val, void *unsafe_ptr) { if (probe_kernel_read(val, unsafe_ptr) == OK) { return 0; } else (get_user(val, unsafe_ptr) == OK) { return 0; } else { *val = 0; return -EFAULT; } }
Note that we can not use get_user() form kprobe handler. If you use it, you have to prepare fault_handler() and make bpf itself can be aborted. So, maybe you can use probe_user_read().
Hmm, however, it still doesn't work correctly on "some" architecture, since whether a pointer (address) points user-space or kernel-space depends on the context. In kprobe/bpf, the context means where you put the probe and which pointer you record.
I think only "__user" tag tells us which one is user-space. But unfortunately, that "__user" tag is only for compiler or checker, not for runtime binary. Such useful attribute goes away when we execute it.
So, even if we introduce "ustring", ftrace/perf users has to decide to use it by themselves. As far as I know, DWARF(debuginfo) also doesn't have that attribute. So perf-probe can not help it from debuginfo. (Maybe if we introduce C parser, it might be detected...)
It will preserve existing bpf_probe_read() behavior on x86. If x86 implementation changes tomorrow then progs that read user addresses may start failing randomly because first probe_kernel_read() will be returning random values from kernel memory and that's no good, but at least we won't be breaking them today, so we have time to introduce bpf_user_read and bpf_kernel_read and folks have time to adopt them.
I see. I think bpf also has to introduce new bpf_probe_read_user() and keep bpf_probe_read() for kernel dataa only.
Imo that's much better than making current bpf_probe_read() fail on user addresses today and not providing a non disruptive path forward.
Agreed.
Thank you,
On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
On Fri, Feb 22, 2019 at 1:38 PM David Miller davem@davemloft.net wrote:
Don't be surprised if we see more separation like this in the future too.
Yes, with the whole meltdown fiasco, there's actually more pressure to add more support for separation of kernel/user address spaces. As Andy pointed out, it's been discussed as a future wish-list for x86-64 too.
S/390 is another example.
I've also proposed a RISC-V extension for it, including prototypes for Rocketchip and Qemu, and a Linux kernel support, but it didn't go any way.
On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a kprobe on do_sys_open to monitor calls to the open syscall globally.
do_sys_open() has prototype:
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);
This causes a "blank" filename to be displayed by opensnoop when I run it on my Pixel 3 (arm64), possibly because this is a user pointer. However, it works fine on x86-64.
So it seems to me that on arm64, reading user pointers directly still doesn't work even if there is a distinction between user/kernel addresses. In that case reading the user pointer using user accessors (possibly using bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong also privately discussed with me).
[1] https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py#L140
thanks!
- Joel
If __probe_kernel_read will suddenly start failing on all user addresses it will break the expectations. How do we solve it in bpf_probe_read? Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte in the loop? That's doable, but people already complain that bpf_probe_read() is slow and shows up in their perf report.
On Tue, 26 Feb 2019 10:24:47 -0500 Joel Fernandes joel@joelfernandes.org wrote:
On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a kprobe on do_sys_open to monitor calls to the open syscall globally.
do_sys_open() has prototype:
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);
This causes a "blank" filename to be displayed by opensnoop when I run it on my Pixel 3 (arm64), possibly because this is a user pointer. However, it works fine on x86-64.
So it seems to me that on arm64, reading user pointers directly still doesn't work even if there is a distinction between user/kernel addresses. In that case reading the user pointer using user accessors (possibly using bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong also privately discussed with me).
OK, it sounds like the same issue. Please add a bpf_user_read() and use it for __user pointer.
Thank you,
On Thu, Feb 28, 2019 at 09:29:13PM +0900, Masami Hiramatsu wrote:
On Tue, 26 Feb 2019 10:24:47 -0500 Joel Fernandes joel@joelfernandes.org wrote:
On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address.
Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a kprobe on do_sys_open to monitor calls to the open syscall globally.
do_sys_open() has prototype:
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);
This causes a "blank" filename to be displayed by opensnoop when I run it on my Pixel 3 (arm64), possibly because this is a user pointer. However, it works fine on x86-64.
So it seems to me that on arm64, reading user pointers directly still doesn't work even if there is a distinction between user/kernel addresses. In that case reading the user pointer using user accessors (possibly using bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong also privately discussed with me).
OK, it sounds like the same issue. Please add a bpf_user_read() and use it for __user pointer.
CC'd Yonghong who said eariler to me he would add it, but I could add it too if he wants me to.
thanks,
- Joel
On Fri, 22 Feb 2019 09:43:14 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Feb 22, 2019 at 12:35 AM Masami Hiramatsu mhiramat@kernel.org wrote:
Or, can we do this?
long __probe_user_read(void *dst, const void *src, size_t size) {
Add a
if (!access_ok(src, size)) ret = -EFAULT; else {
.. do the pagefault_disable() etc .. }
Since kprobes handler runs in IRQ context, we can not use access_ok() in it. (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
In arch/x86/include/asm/uaccess.h:
#define access_ok(addr, size) \ ({ \ WARN_ON_IN_IRQ(); \ likely(!__range_not_ok(addr, size, user_addr_max())); \ })
Do we need acccess_ok_inatomic()?
BTW, it seems a bit strange that this WARN_ON_IN_IRQ() is only in x86 access_ok() implementation, since CONFIG_DEBUG_ATOMIC_SLEEP(which defines WARN_ON_IN_IRQ) doesn't depend on x86, and access_ok() is widely used in kernel. I think it would be better that each arch provides __access_ok() and include/linux/uaccess.h provides access_ok() with WARN_ON_IN_IRQ().
to after the "set_fs()", and it looks good to me. Make it clear that yes, this works _only_ for user reads.
Adn that makes all the games with "kernel_uaccess_faults_ok" pointless, so you can just remove them.
OK.
(note that the "access_ok()" has to come after we've done "set_fs()", because it takes the address limit from that).
Also, since normally we'd expect that we already have USER_DS, it might be worthwhile to do this with a wrapper, something along the lines of
mm_segment_t old_fs = get_fs(); if (segment_eq(old_fs, USER_DS)) return __normal_probe_user_read(); set_fs(USER_DS); ret = __normal_probe_user_read(); set_fs(old_fs); return ret;
and have that __normal_probe_user_read() just do
if (!access_ok(src, size)) return -EFAULT; pagefault_disable(); ret = __copy_from_user_inatomic(dst, ...); pagefault_enable(); return ret ? -EFAULT : 0;
which looks more obvious.
OK.
Also, I would suggest that you just make the argument type be "const void __user *", since the whole point is that this takes a user pointer, and nothing else.
Ah, right.
Then we should still probably fix up "__probe_kernel_read()" to not allow user accesses. The easiest way to do that is actually likely to use the "unsafe_get_user()" functions *without* doing a uaccess_begin(), which will mean that modern CPU's will simply fault on a kernel access to user space.
Or, use __chk_user_ptr(ptr) to check it?
Thank you,
The nice thing about that is that usually developers will have access to exactly those modern boxes, so the people who notice that it doesn't work are the right people.
Alternatively, we should just make it be architecture-specific, so that architectures can decide "this address cannot be a kernel address" and refuse to do it.
Linus
On Sat, 23 Feb 2019 12:47:46 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Since kprobes handler runs in IRQ context, we can not use access_ok() in it. (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
Is it really IRQ context or exception context? That is, one (interrupts) happen for any task, but exceptions happen because of the software that is executed (like a breakpoint). Although you can have a kprobe trigger in an interrupt handler (where user access wouldn't make sense anyway). But there should be no problem with user access from an exception handler.
-- Steve
On Sat, Feb 23, 2019 at 4:44 PM Steven Rostedt rostedt@goodmis.org wrote:
On Sat, 23 Feb 2019 12:47:46 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Since kprobes handler runs in IRQ context, we can not use access_ok() in it. (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
Is it really IRQ context or exception context? That is, one (interrupts) happen for any task, but exceptions happen because of the software that is executed (like a breakpoint). Although you can have a kprobe trigger in an interrupt handler (where user access wouldn't make sense anyway). But there should be no problem with user access from an exception handler.
Can we just get rid of this might_sleep()? access_ok() doesn't sleep as far as I know.
On Sat, 23 Feb 2019 20:38:03 -0800 Andy Lutomirski luto@kernel.org wrote:
On Sat, Feb 23, 2019 at 4:44 PM Steven Rostedt rostedt@goodmis.org wrote:
On Sat, 23 Feb 2019 12:47:46 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Since kprobes handler runs in IRQ context, we can not use access_ok() in it. (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
Is it really IRQ context or exception context? That is, one (interrupts) happen for any task, but exceptions happen because of the software that is executed (like a breakpoint). Although you can have a kprobe trigger in an interrupt handler (where user access wouldn't make sense anyway). But there should be no problem with user access from an exception handler.
Can we just get rid of this might_sleep()? access_ok() doesn't sleep as far as I know.
Hmm, which might_sleep() would you pointed? What I talked was a WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just checks preempt_count.
I guess PeterZ assumed that access_ok() is used only with user space access APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might sleep :)), but now we are trying to use access_ok() with new functions which disables page-fault and just return -EFAULT.
Thank you,
On Sun, Feb 24, 2019 at 7:18 AM Masami Hiramatsu mhiramat@kernel.org wrote:
On Sat, 23 Feb 2019 20:38:03 -0800 Andy Lutomirski luto@kernel.org wrote:
Can we just get rid of this might_sleep()? access_ok() doesn't sleep as far as I know.
Hmm, which might_sleep() would you pointed? What I talked was a WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just checks preempt_count.
So the in_task() check does kind of make sense. Using "access_ok()" outside of task context is certainly an odd thing, for several reasons. The main one being simply that outside of task context, the whole "which task" question is open, and you don't know if the task is the active one, and so it's not clear if whatever task you interrupt might have done "set_fs()" or not.
So PeterZ isn't wrong:
I guess PeterZ assumed that access_ok() is used only with user space access APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might sleep :)), but now we are trying to use access_ok() with new functions which disables page-fault and just return -EFAULT.
.. but in this case, if we do it all *within* code that saves and restores the user access flag with get_fs/set_fs, access_ok() would be ok and it doesn't have the above issue.
So access_ok() in _general_ is absolutely not safe to do from interrupts, but within the context of probing user memory from a tracing event it just happens to be ok.
It would be lovely to have a special macro for this, and keep the warning for the general case, but because this is a "every architecture needs to build their own" it's probably too painful.
PeterZ, do you remember the particular use case that triggered that commit 7c4788950ba5 ("x86/uaccess, sched/preempt: Verify access_ok() context")?
Linus
On Sun, 24 Feb 2019 09:26:45 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Feb 24, 2019 at 7:18 AM Masami Hiramatsu mhiramat@kernel.org wrote:
On Sat, 23 Feb 2019 20:38:03 -0800 Andy Lutomirski luto@kernel.org wrote:
Can we just get rid of this might_sleep()? access_ok() doesn't sleep as far as I know.
Hmm, which might_sleep() would you pointed? What I talked was a WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just checks preempt_count.
So the in_task() check does kind of make sense. Using "access_ok()" outside of task context is certainly an odd thing, for several reasons. The main one being simply that outside of task context, the whole "which task" question is open, and you don't know if the task is the active one, and so it's not clear if whatever task you interrupt might have done "set_fs()" or not.
Ah I got it. Usual case access_ok() in IRQ handler is strange.
So PeterZ isn't wrong:
I guess PeterZ assumed that access_ok() is used only with user space access APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might sleep :)), but now we are trying to use access_ok() with new functions which disables page-fault and just return -EFAULT.
.. but in this case, if we do it all *within* code that saves and restores the user access flag with get_fs/set_fs, access_ok() would be ok and it doesn't have the above issue.
So access_ok() in _general_ is absolutely not safe to do from interrupts, but within the context of probing user memory from a tracing event it just happens to be ok.
Hmm, but user can specify user-memory access from the tracing event which is located in interrupt handler. So I understand that it is safe only if we correctly setup access flag with get_fs/set_fs, is that correct?
It would be lovely to have a special macro for this, and keep the warning for the general case, but because this is a "every architecture needs to build their own" it's probably too painful.
Agreed.
PeterZ, do you remember the particular use case that triggered that commit 7c4788950ba5 ("x86/uaccess, sched/preempt: Verify access_ok() context")?
Linus
Thank you,
On Sun, Feb 24, 2019 at 6:40 PM Masami Hiramatsu mhiramat@kernel.org wrote:
On Sun, 24 Feb 2019 09:26:45 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Feb 24, 2019 at 7:18 AM Masami Hiramatsu mhiramat@kernel.org wrote:
On Sat, 23 Feb 2019 20:38:03 -0800 Andy Lutomirski luto@kernel.org wrote:
Can we just get rid of this might_sleep()? access_ok() doesn't sleep as far as I know.
Hmm, which might_sleep() would you pointed? What I talked was a WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just checks preempt_count.
So the in_task() check does kind of make sense. Using "access_ok()" outside of task context is certainly an odd thing, for several reasons. The main one being simply that outside of task context, the whole "which task" question is open, and you don't know if the task is the active one, and so it's not clear if whatever task you interrupt might have done "set_fs()" or not.
Ah I got it. Usual case access_ok() in IRQ handler is strange.
So PeterZ isn't wrong:
I guess PeterZ assumed that access_ok() is used only with user space access APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might sleep :)), but now we are trying to use access_ok() with new functions which disables page-fault and just return -EFAULT.
.. but in this case, if we do it all *within* code that saves and restores the user access flag with get_fs/set_fs, access_ok() would be ok and it doesn't have the above issue.
So access_ok() in _general_ is absolutely not safe to do from interrupts, but within the context of probing user memory from a tracing event it just happens to be ok.
Hmm, but user can specify user-memory access from the tracing event which is located in interrupt handler. So I understand that it is safe only if we correctly setup access flag with get_fs/set_fs, is that correct?
It would be lovely to have a special macro for this, and keep the warning for the general case, but because this is a "every architecture needs to build their own" it's probably too painful.
Agreed.
This should probably go with whatever effort makes nmi_uaccess_ok() available on all architectures. That being said, how about just making copy_from_user_nmi() work on all architectures, even if it just fails unconditionally on some of them?
On Sun, 24 Feb 2019 20:49:45 -0800 Andy Lutomirski luto@kernel.org wrote:
On Sun, Feb 24, 2019 at 6:40 PM Masami Hiramatsu mhiramat@kernel.org wrote:
On Sun, 24 Feb 2019 09:26:45 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Feb 24, 2019 at 7:18 AM Masami Hiramatsu mhiramat@kernel.org wrote:
On Sat, 23 Feb 2019 20:38:03 -0800 Andy Lutomirski luto@kernel.org wrote:
Can we just get rid of this might_sleep()? access_ok() doesn't sleep as far as I know.
Hmm, which might_sleep() would you pointed? What I talked was a WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just checks preempt_count.
So the in_task() check does kind of make sense. Using "access_ok()" outside of task context is certainly an odd thing, for several reasons. The main one being simply that outside of task context, the whole "which task" question is open, and you don't know if the task is the active one, and so it's not clear if whatever task you interrupt might have done "set_fs()" or not.
Ah I got it. Usual case access_ok() in IRQ handler is strange.
So PeterZ isn't wrong:
I guess PeterZ assumed that access_ok() is used only with user space access APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might sleep :)), but now we are trying to use access_ok() with new functions which disables page-fault and just return -EFAULT.
.. but in this case, if we do it all *within* code that saves and restores the user access flag with get_fs/set_fs, access_ok() would be ok and it doesn't have the above issue.
So access_ok() in _general_ is absolutely not safe to do from interrupts, but within the context of probing user memory from a tracing event it just happens to be ok.
Hmm, but user can specify user-memory access from the tracing event which is located in interrupt handler. So I understand that it is safe only if we correctly setup access flag with get_fs/set_fs, is that correct?
It would be lovely to have a special macro for this, and keep the warning for the general case, but because this is a "every architecture needs to build their own" it's probably too painful.
Agreed.
This should probably go with whatever effort makes nmi_uaccess_ok() available on all architectures. That being said, how about just making copy_from_user_nmi() work on all architectures, even if it just fails unconditionally on some of them?
I think even if we have copy_from_user_nmi(), we need something like nmi_uaccess_ok() because without it we can not correctly use __copy_from_user_inatomic()...
Thank you,
On Mon, 25 Feb 2019 17:09:45 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
This should probably go with whatever effort makes nmi_uaccess_ok() available on all architectures. That being said, how about just making copy_from_user_nmi() work on all architectures, even if it just fails unconditionally on some of them?
I think even if we have copy_from_user_nmi(), we need something like nmi_uaccess_ok() because without it we can not correctly use __copy_from_user_inatomic()...
But wouldn't that just be part of the implementation of "copy_from_user_nmi()" as being in an NMI just assumes being inatomic?
-- Steve
On Mon, 25 Feb 2019 11:40:18 -0500 Steven Rostedt rostedt@goodmis.org wrote:
On Mon, 25 Feb 2019 17:09:45 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
This should probably go with whatever effort makes nmi_uaccess_ok() available on all architectures. That being said, how about just making copy_from_user_nmi() work on all architectures, even if it just fails unconditionally on some of them?
I think even if we have copy_from_user_nmi(), we need something like nmi_uaccess_ok() because without it we can not correctly use __copy_from_user_inatomic()...
But wouldn't that just be part of the implementation of "copy_from_user_nmi()" as being in an NMI just assumes being inatomic?
Yes for copy_from_user_nmi(). But there are some other fundamental functions, like __get_user(). And when we optimize the loop in strncpy/strnlen from user in atomic, I think one nmi_access_ok() at entry is enough.
Thank you,
On Sun, Feb 24, 2019 at 09:26:45AM -0800, Linus Torvalds wrote:
PeterZ, do you remember the particular use case that triggered that commit 7c4788950ba5 ("x86/uaccess, sched/preempt: Verify access_ok() context")?
This one, if I'm not mistaken.
---
commit ae31fe51a3cceaa0cabdb3058f69669ecb47f12e Author: Johannes Weiner hannes@cmpxchg.org Date: Tue Nov 22 10:57:42 2016 +0100
perf/x86: Restore TASK_SIZE check on frame pointer
The following commit:
75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses")
... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual access_ok() check.
Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE, whereas the access_ok() uses whatever the current address limit of the task is.
We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and then see vmalloc faults when we access what looks like pointers into vmalloc space:
[] WARNING: CPU: 3 PID: 3685731 at arch/x86/mm/fault.c:435 vmalloc_fault+0x289/0x290 [] CPU: 3 PID: 3685731 Comm: sh Tainted: G W 4.6.0-5_fbk1_223_gdbf0f40 #1 [] Call Trace: [] <NMI> [<ffffffff814717d1>] dump_stack+0x4d/0x6c [] [<ffffffff81076e43>] __warn+0xd3/0xf0 [] [<ffffffff81076f2d>] warn_slowpath_null+0x1d/0x20 [] [<ffffffff8104a899>] vmalloc_fault+0x289/0x290 [] [<ffffffff8104b5a0>] __do_page_fault+0x330/0x490 [] [<ffffffff8104b70c>] do_page_fault+0xc/0x10 [] [<ffffffff81794e82>] page_fault+0x22/0x30 [] [<ffffffff81006280>] ? perf_callchain_user+0x100/0x2a0 [] [<ffffffff8115124f>] get_perf_callchain+0x17f/0x190 [] [<ffffffff811512c7>] perf_callchain+0x67/0x80 [] [<ffffffff8114e750>] perf_prepare_sample+0x2a0/0x370 [] [<ffffffff8114e840>] perf_event_output+0x20/0x60 [] [<ffffffff8114aee7>] ? perf_event_update_userpage+0xc7/0x130 [] [<ffffffff8114ea01>] __perf_event_overflow+0x181/0x1d0 [] [<ffffffff8114f484>] perf_event_overflow+0x14/0x20 [] [<ffffffff8100a6e3>] intel_pmu_handle_irq+0x1d3/0x490 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff81197191>] ? vunmap_page_range+0x1a1/0x2f0 [] [<ffffffff811972f1>] ? unmap_kernel_range_noflush+0x11/0x20 [] [<ffffffff814f2056>] ? ghes_copy_tofrom_phys+0x116/0x1f0 [] [<ffffffff81040d1d>] ? x2apic_send_IPI_self+0x1d/0x20 [] [<ffffffff8100411d>] perf_event_nmi_handler+0x2d/0x50 [] [<ffffffff8101ea31>] nmi_handle+0x61/0x110 [] [<ffffffff8101ef94>] default_do_nmi+0x44/0x110 [] [<ffffffff8101f13b>] do_nmi+0xdb/0x150 [] [<ffffffff81795187>] end_repeat_nmi+0x1a/0x1e [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] <<EOE>> <IRQ> [<ffffffff8115d05e>] ? __probe_kernel_read+0x3e/0xa0
Fix this by moving the valid_user_frame() check to before the uaccess that loads the return address and the pointer to the next frame.
Signed-off-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: linux-kernel@vger.kernel.org Fixes: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses") Signed-off-by: Ingo Molnar mingo@kernel.org
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index d31735f37ed7..9d4bf3ab049e 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2352,7 +2352,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent frame.next_frame = 0; frame.return_address = 0;
- if (!access_ok(VERIFY_READ, fp, 8)) + if (!valid_user_frame(fp, sizeof(frame))) break;
bytes = __copy_from_user_nmi(&frame.next_frame, fp, 4); @@ -2362,9 +2362,6 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent if (bytes != 0) break;
- if (!valid_user_frame(fp, sizeof(frame))) - break; - perf_callchain_store(entry, cs_base + frame.return_address); fp = compat_ptr(ss_base + frame.next_frame); } @@ -2413,7 +2410,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs frame.next_frame = NULL; frame.return_address = 0;
- if (!access_ok(VERIFY_READ, fp, sizeof(*fp) * 2)) + if (!valid_user_frame(fp, sizeof(frame))) break;
bytes = __copy_from_user_nmi(&frame.next_frame, fp, sizeof(*fp)); @@ -2423,9 +2420,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs if (bytes != 0) break;
- if (!valid_user_frame(fp, sizeof(frame))) - break; - perf_callchain_store(entry, frame.return_address); fp = (void __user *)frame.next_frame; }
On Mon, Feb 25, 2019 at 09:33:09AM +0100, Peter Zijlstra wrote:
On Sun, Feb 24, 2019 at 09:26:45AM -0800, Linus Torvalds wrote:
PeterZ, do you remember the particular use case that triggered that commit 7c4788950ba5 ("x86/uaccess, sched/preempt: Verify access_ok() context")?
This one, if I'm not mistaken.
commit ae31fe51a3cceaa0cabdb3058f69669ecb47f12e Author: Johannes Weiner hannes@cmpxchg.org Date: Tue Nov 22 10:57:42 2016 +0100
perf/x86: Restore TASK_SIZE check on frame pointer
The following commit: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses") ... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual access_ok() check. Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE, whereas the access_ok() uses whatever the current address limit of the task is. We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and then see vmalloc faults when we access what looks like pointers into vmalloc space:
Also note that this was before we did:
commit 88b0193d9418c00340e45e0a913a0813bc6c8c96 Author: Will Deacon will.deacon@arm.com Date: Tue May 9 18:00:04 2017 +0100
perf/callchain: Force USER_DS when invoking perf_callchain_user()
Perf can generate and record a user callchain in response to a synchronous request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we can end up walking the user stack (and dereferencing/saving whatever we find there) without the protections usually afforded by checks such as access_ok.
Rather than play whack-a-mole with each architecture's stack unwinding implementation, fix the root of the problem by ensuring that we force USER_DS when invoking perf_callchain_user from the perf core.
Reported-by: Al Viro viro@ZenIV.linux.org.uk Signed-off-by: Will Deacon will.deacon@arm.com Acked-by: Peter Zijlstra peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ingo Molnar mingo@kernel.org
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index c04917cad1bf..1b2be63c8528 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -229,12 +229,18 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user, }
if (regs) { + mm_segment_t fs; + if (crosstask) goto exit_put;
if (add_mark) perf_callchain_store_context(&ctx, PERF_CONTEXT_USER); + + fs = get_fs(); + set_fs(USER_DS); perf_callchain_user(&ctx, regs); + set_fs(fs); } }
On Sat, Feb 23, 2019 at 8:38 PM Andy Lutomirski luto@kernel.org wrote:
On Sat, Feb 23, 2019 at 4:44 PM Steven Rostedt rostedt@goodmis.org wrote:
On Sat, 23 Feb 2019 12:47:46 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Since kprobes handler runs in IRQ context, we can not use access_ok() in it. (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
Is it really IRQ context or exception context? That is, one (interrupts) happen for any task, but exceptions happen because of the software that is executed (like a breakpoint). Although you can have a kprobe trigger in an interrupt handler (where user access wouldn't make sense anyway). But there should be no problem with user access from an exception handler.
Can we just get rid of this might_sleep()? access_ok() doesn't sleep as far as I know.
We do need to be aware of the userfaultfd case of getting held by userspace in the middle of a copy_*_user()... that's a whole other problem.
On Mon, Feb 25, 2019 at 8:48 AM Kees Cook keescook@chromium.org wrote:
On Sat, Feb 23, 2019 at 8:38 PM Andy Lutomirski luto@kernel.org wrote:
On Sat, Feb 23, 2019 at 4:44 PM Steven Rostedt rostedt@goodmis.org wrote:
On Sat, 23 Feb 2019 12:47:46 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Since kprobes handler runs in IRQ context, we can not use access_ok() in it. (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
Is it really IRQ context or exception context? That is, one (interrupts) happen for any task, but exceptions happen because of the software that is executed (like a breakpoint). Although you can have a kprobe trigger in an interrupt handler (where user access wouldn't make sense anyway). But there should be no problem with user access from an exception handler.
Can we just get rid of this might_sleep()? access_ok() doesn't sleep as far as I know.
We do need to be aware of the userfaultfd case of getting held by userspace in the middle of a copy_*_user()... that's a whole other problem.
I sure hope that pagefault_disable() already takes care of this. Otherwise we have major problems already.
On Mon, Feb 25, 2019 at 8:58 AM Andy Lutomirski luto@kernel.org wrote:
I sure hope that pagefault_disable() already takes care of this. Otherwise we have major problems already.
Okay, cool. I missed that bit. :)
On Fri, 15 Feb 2019 12:47:13 -0500 Steven Rostedt rostedt@goodmis.org wrote:
From: Changbin Du changbin.du@gmail.com
The userspace can ask kprobe to intercept strings at any memory address, including invalid kernel address. In this case, fetch_store_strlen() would crash since it uses general usercopy function, and user access functions are no longer allowed to access kernel memory.
For example, we can crash the kernel by doing something as below:
$ sudo kprobe 'p:do_sys_open +0(+0(%si)):string'
[ 103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?) [ 103.622104] general protection fault: 0000 [#1] SMP PTI [ 103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba1-dirty #96 [ 103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014 [ 103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0 [ 103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63 ca 89 f8 <8a> 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48 [ 103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246 [ 103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000 [ 103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000 [ 103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 103.644040] FS: 0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000 [ 103.646019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0 [ 103.649147] Call Trace: [ 103.649781] ? sched_clock_cpu+0xc/0xa0 [ 103.650747] ? do_sys_open+0x5/0x220 [ 103.651635] kprobe_trace_func+0x303/0x380 [ 103.652645] ? do_sys_open+0x5/0x220 [ 103.653528] kprobe_dispatcher+0x45/0x50 [ 103.654682] ? do_sys_open+0x1/0x220 [ 103.655875] kprobe_ftrace_handler+0x90/0xf0 [ 103.657282] ftrace_ops_assist_func+0x54/0xf0 [ 103.658564] ? __call_rcu+0x1dc/0x280 [ 103.659482] 0xffffffffc00000bf [ 103.660384] ? __ia32_sys_open+0x20/0x20 [ 103.661682] ? do_sys_open+0x1/0x220 [ 103.662863] do_sys_open+0x5/0x220 [ 103.663988] do_syscall_64+0x60/0x210 [ 103.665201] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 103.666862] RIP: 0033:0x7fc22fadccdd [ 103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44 [ 103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101 [ 103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd [ 103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c [ 103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8 [ 103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 103.688056] Modules linked in: [ 103.689131] ---[ end trace 43792035c28984a1 ]---
This can be fixed by using probe_mem_read() instead, as it can handle faulting kernel memory addresses, which kprobes can legitimately do.
Basically OK to me. Could you use probe_kernel_read() in this context, since probe_mem_read() is a wrapper function for template code.
With that change,
Acked-by: Masami Hiramatsu mhiramat@kernel.org
And for the long term, I need to find more efficient (or smarter) way to do it, like strnlen_user() does.
Thank you,
Link: http://lkml.kernel.org/r/20190125151051.7381-1-changbin.du@gmail.com
Cc: stable@vger.kernel.org Fixes: 9da3f2b7405 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses") Signed-off-by: Changbin Du changbin.du@gmail.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org
kernel/trace/trace_kprobe.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d5fb09ebba8b..9eaf07f99212 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -861,22 +861,14 @@ static const struct file_operations kprobe_profile_ops = { static nokprobe_inline int fetch_store_strlen(unsigned long addr) {
- mm_segment_t old_fs; int ret, len = 0; u8 c;
- old_fs = get_fs();
- set_fs(KERNEL_DS);
- pagefault_disable();
- do {
ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
len++; } while (c && ret == 0 && len < MAX_STRING_SIZE);ret = probe_mem_read(&c, (u8 *)addr + len, 1);
- pagefault_enable();
- set_fs(old_fs);
- return (ret < 0) ? ret : len;
} -- 2.20.1
On Thu, 21 Feb 2019 16:52:52 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Basically OK to me. Could you use probe_kernel_read() in this context, since probe_mem_read() is a wrapper function for template code.
With that change,
Acked-by: Masami Hiramatsu mhiramat@kernel.org
This already hit Linus's tree. I was able to reproduce the crash, so I streamlined it. I should have still pushed more for your ack first. Sorry about that.
For some reason, I thought the change was in the generic probe code, and accepted the probe_mem_read(). Anyway, did you want to send a patch to change it to probe_kernel_read(), for the merge window?
And for the long term, I need to find more efficient (or smarter) way to do it, like strnlen_user() does.
Agreed.
Thanks,
-- Steve
On Thu, 21 Feb 2019 09:36:25 -0500 Steven Rostedt rostedt@goodmis.org wrote:
On Thu, 21 Feb 2019 16:52:52 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Basically OK to me. Could you use probe_kernel_read() in this context, since probe_mem_read() is a wrapper function for template code.
With that change,
Acked-by: Masami Hiramatsu mhiramat@kernel.org
This already hit Linus's tree. I was able to reproduce the crash, so I streamlined it. I should have still pushed more for your ack first. Sorry about that.
Oh, never mind. That seems urgent issue for kprobe event. Thank you very much for fixing it!
For some reason, I thought the change was in the generic probe code, and accepted the probe_mem_read(). Anyway, did you want to send a patch to change it to probe_kernel_read(), for the merge window?
No problem.
And for the long term, I need to find more efficient (or smarter) way to do it, like strnlen_user() does.
Agreed.
Thank you,
Hi Steve,
On Fri, 22 Feb 2019 00:58:12 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
On Thu, 21 Feb 2019 09:36:25 -0500 Steven Rostedt rostedt@goodmis.org wrote:
On Thu, 21 Feb 2019 16:52:52 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Basically OK to me. Could you use probe_kernel_read() in this context, since probe_mem_read() is a wrapper function for template code.
With that change,
Acked-by: Masami Hiramatsu mhiramat@kernel.org
This already hit Linus's tree. I was able to reproduce the crash, so I streamlined it. I should have still pushed more for your ack first. Sorry about that.
Oh, never mind. That seems urgent issue for kprobe event. Thank you very much for fixing it!
For some reason, I thought the change was in the generic probe code, and accepted the probe_mem_read(). Anyway, did you want to send a patch to change it to probe_kernel_read(), for the merge window?
No problem.
Oops, I mean No, not yet. but it is a simple and cosmetic patch like below. Feel free to merge it to ftrace/core.
--------- tracing/kprobes: Use probe_kernel_read instead of probe_mem_read
From: Masami Hiramatsu mhiramat@kernel.org
Use probe_kernel_read() instead of probe_mem_read() because probe_mem_read() is a kind of wrapper for switching memory read function between uprobes and kprobes.
Signed-off-by: Masami Hiramatsu mhiramat@kernel.org --- kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 9eaf07f99212..99592c27465e 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -865,7 +865,7 @@ fetch_store_strlen(unsigned long addr) u8 c;
do { - ret = probe_mem_read(&c, (u8 *)addr + len, 1); + ret = probe_kernel_read(&c, (u8 *)addr + len, 1); len++; } while (c && ret == 0 && len < MAX_STRING_SIZE);
[ Linus, you may want to read this ]
On Fri, 22 Feb 2019 01:16:43 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
Yep, I'll take this patch. Hmm, my for-next isn't based on my urgent branch, and this would go on top of urgent. I can do one of three things to push this to Linus for the merge window:
1) Create a separate branch to add updates for on my urgent branch. This would have me do two pull requests to Linus.
2) Cherry pick the urgent commit that this updates. But that has a stable tag, which could confuse things as it would create the same commit twice, both with stable tags (I could take the stable tag off of the cherry pick though).
3) Merge the urgent branch into my for-next branch at that commit, with a message to why I'm doing this. The urgent branch is based off of a older tag in Linus's tree, so that would only pull in tracing changes (my stuff), and wont pull in anyone else's work.
I may go with option 3, because I believe this may be one of the cases that is allowed to have merges in pull requests.
a) Only my stuff got merged b) Have a dependency on something that already went into Linus's tree (prevent me from having to rebase already tested work). c) Have it documented in the merge commit to why its being merged
Unless Linus feels otherwise (use one of the other options?), I may go ahead and do that.
-- Steve
tracing/kprobes: Use probe_kernel_read instead of probe_mem_read
From: Masami Hiramatsu mhiramat@kernel.org
Use probe_kernel_read() instead of probe_mem_read() because probe_mem_read() is a kind of wrapper for switching memory read function between uprobes and kprobes.
Signed-off-by: Masami Hiramatsu mhiramat@kernel.org
kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 9eaf07f99212..99592c27465e 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -865,7 +865,7 @@ fetch_store_strlen(unsigned long addr) u8 c; do {
ret = probe_mem_read(&c, (u8 *)addr + len, 1);
len++; } while (c && ret == 0 && len < MAX_STRING_SIZE);ret = probe_kernel_read(&c, (u8 *)addr + len, 1);
Hi,
On Thu, 21 Feb 2019 16:52:52 +0900 Masami Hiramatsu mhiramat@kernel.org wrote:
kernel/trace/trace_kprobe.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d5fb09ebba8b..9eaf07f99212 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -861,22 +861,14 @@ static const struct file_operations kprobe_profile_ops = { static nokprobe_inline int fetch_store_strlen(unsigned long addr) {
- mm_segment_t old_fs; int ret, len = 0; u8 c;
- old_fs = get_fs();
- set_fs(KERNEL_DS);
- pagefault_disable();
BTW, compared with probe_kernel_read() implementation, this function lacks current->kernel_uaccess_faults_ok modification here.
I would like to know whether we can avoid this issue if we tweak this flag.
Thank you,
do {
ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
len++; } while (c && ret == 0 && len < MAX_STRING_SIZE);ret = probe_mem_read(&c, (u8 *)addr + len, 1);
- pagefault_enable();
- set_fs(old_fs);
- return (ret < 0) ? ret : len;
} -- 2.20.1
-- Masami Hiramatsu mhiramat@kernel.org
linux-stable-mirror@lists.linaro.org