Hi Guys,
This second version of patch to flush icache and dcache after uprobes xol write to make written instruction available in icache. Please see [1] for initial discussion.
This patch follows Russell's suggestion, and function that does cache flush after xol slot instruction write is shared/reused with similar one implemented already for ptrace code.
In order to reuse common implementation but to avoid vma use by xol_get_insn_slot I split flush_ptrace_access into two functions. Where first part retrieves all required conditions from vma and places them into flags variable and then calls second function which is common code.
Also I had to change xol_get_insn_slot function to map page into kernel explicitly within function without use of copy_to_page helper because ARM cache flush code need both kernel address through which instruction write happens and virtual address of user-land process where instruction will end up. I hope this call back is universal enough so other CPU could implement their cache invalidation/sync after uprobes xol instruction write logic based on provided parameters.
I've tested it on Arndale board with my SystemTap test case that had cache problem before. Disassemble of flush_uprobe_xol_access in case of Arndale shows that compiler does good job and optimizes out all flags check effectively leaving on this cpu call to flush_icache_alias or call to v7_coherent_user_range (__cpuc_coherent_kern_range).
Also tested basic user-level debugging.
Wondering on what ARM boards/cpus could we test cache_is_vivt() and cache_is_vipt_aliasing cases ...
Just to summarize, please note on [1] there were couple other suggestions:
Oleg suggested to use flush_icache_user_range but Russell argument was that meaning of the function is lost and on ARM it is not implemented in such way that it could address the issue anyway. Please see [2] for details. Note it would has vma problem use or not, that should be hacked.
Dave Martin suggested to use flush_icache_range, which is effectively better way to call __cpuc_coherent_[kern|user]_range(s,e), that was originally suggested. But Russell explained that it won't be enough in case of user-land process pages and variety of cache types have to be covered. Note for kernel pages it would be OK and it is used in multiple places like kprobes, modules, etc.
Thanks, Victor
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-April/245595.html http://lists.infradead.org/pipermail/linux-arm-kernel/2014-April/245427.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-April/245605.html
Victor Kamensky (1): ARM: uprobes need icache flush after xol write
arch/arm/include/asm/cacheflush.h | 2 ++ arch/arm/kernel/uprobes.c | 6 ++++++ arch/arm/mm/flush.c | 41 +++++++++++++++++++++++++++++++++------ include/linux/uprobes.h | 3 +++ kernel/events/uprobes.c | 33 +++++++++++++++++++++++++------ 5 files changed, 73 insertions(+), 12 deletions(-)
After instruction write into xol area, on ARM V7 architecture code need to flush dcache and icache to sync them up for given set of addresses. Having just 'flush_dcache_page(page)' call is not enough - it is possible to have stale instruction sitting in icache for given xol area slot address.
Introduce arch_uprobe_flush_xol_access weak function that by default calls 'flush_dcache_page(page)' and on ARM define new one that calls flush_uprobe_xol_access function.
flush_uprobe_xol_access function shares/reuses implementation with/of flush_ptrace_access function and takes care of writing instruction to user land address space on given variety of different cache types on ARM CPUs. Because flush_uprobe_xol_access does not have vma around flush_ptrace_access was split into two parts. First that retrieves set of condition from vma and common that receives those conditions as flags.
Because on ARM cache flush function need kernel address through which instruction write happened, changed xol_get_insn_slot to explicitly map page and do memcpy rather than use copy_to_page helper. In this case xol_get_insn_slot knows kernel address and passes it to arch_uprobe_flush_xol_access function.
Signed-off-by: Victor Kamensky victor.kamensky@linaro.org --- arch/arm/include/asm/cacheflush.h | 2 ++ arch/arm/kernel/uprobes.c | 6 ++++++ arch/arm/mm/flush.c | 41 +++++++++++++++++++++++++++++++++------ include/linux/uprobes.h | 3 +++ kernel/events/uprobes.c | 33 +++++++++++++++++++++++++------ 5 files changed, 73 insertions(+), 12 deletions(-)
diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 8b8b616..e02712a 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -487,4 +487,6 @@ int set_memory_rw(unsigned long addr, int numpages); int set_memory_x(unsigned long addr, int numpages); int set_memory_nx(unsigned long addr, int numpages);
+void flush_uprobe_xol_access(struct page *page, unsigned long uaddr, + void *kaddr, unsigned long len); #endif diff --git a/arch/arm/kernel/uprobes.c b/arch/arm/kernel/uprobes.c index f9bacee..a0339a6 100644 --- a/arch/arm/kernel/uprobes.c +++ b/arch/arm/kernel/uprobes.c @@ -113,6 +113,12 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, return 0; }
+void arch_uprobe_flush_xol_access(struct page *page, unsigned long vaddr, + void *kaddr, unsigned long len) +{ + flush_uprobe_xol_access(page, vaddr, kaddr, len); +} + int arch_uprobe_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs) { struct uprobe_task *utask = current->utask; diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c index 3387e60..69a0bd08 100644 --- a/arch/arm/mm/flush.c +++ b/arch/arm/mm/flush.c @@ -104,17 +104,21 @@ void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsig #define flush_icache_alias(pfn,vaddr,len) do { } while (0) #endif
+#define FLAG_UA_IS_EXEC 1 +#define FLAG_UA_CORE_IN_MM 2 +#define FLAG_UA_BROADCAST 4 + static void flush_ptrace_access_other(void *args) { __flush_icache_all(); }
-static -void flush_ptrace_access(struct vm_area_struct *vma, struct page *page, - unsigned long uaddr, void *kaddr, unsigned long len) +static inline +void __flush_ptrace_access(struct page *page, unsigned long uaddr, void *kaddr, + unsigned long len, unsigned int flags) { if (cache_is_vivt()) { - if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(vma->vm_mm))) { + if (flags & FLAG_UA_CORE_IN_MM) { unsigned long addr = (unsigned long)kaddr; __cpuc_coherent_kern_range(addr, addr + len); } @@ -128,18 +132,43 @@ void flush_ptrace_access(struct vm_area_struct *vma, struct page *page, }
/* VIPT non-aliasing D-cache */ - if (vma->vm_flags & VM_EXEC) { + if (flags & FLAG_UA_IS_EXEC) { unsigned long addr = (unsigned long)kaddr; if (icache_is_vipt_aliasing()) flush_icache_alias(page_to_pfn(page), uaddr, len); else __cpuc_coherent_kern_range(addr, addr + len); - if (cache_ops_need_broadcast()) + if (flags & FLAG_UA_BROADCAST) smp_call_function(flush_ptrace_access_other, NULL, 1); } }
+static +void flush_ptrace_access(struct vm_area_struct *vma, struct page *page, + unsigned long uaddr, void *kaddr, unsigned long len) +{ + unsigned int flags = 0; + if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(vma->vm_mm))) { + flags |= FLAG_UA_CORE_IN_MM; + } + if (vma->vm_flags & VM_EXEC) { + flags |= FLAG_UA_IS_EXEC; + } + if (cache_ops_need_broadcast()) { + flags |= FLAG_UA_BROADCAST; + } + __flush_ptrace_access(page, uaddr, kaddr, len, flags); +} + +void flush_uprobe_xol_access(struct page *page, unsigned long uaddr, + void *kaddr, unsigned long len) +{ + unsigned int flags = FLAG_UA_CORE_IN_MM|FLAG_UA_IS_EXEC; + + __flush_ptrace_access(page, uaddr, kaddr, len, flags); +} + /* * Copy user data from/to a page which is mapped into a different * processes address space. Really, we want to allow our "user diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index edff2b9..534e083 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -32,6 +32,7 @@ struct vm_area_struct; struct mm_struct; struct inode; struct notifier_block; +struct page;
#define UPROBE_HANDLER_REMOVE 1 #define UPROBE_HANDLER_MASK 1 @@ -127,6 +128,8 @@ extern int arch_uprobe_exception_notify(struct notifier_block *self, unsigned l extern void arch_uprobe_abort_xol(struct arch_uprobe *aup, struct pt_regs *regs); extern unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs); extern bool __weak arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); +extern void __weak arch_uprobe_flush_xol_access(struct page *page, unsigned long vaddr, + void *kaddr, unsigned long len); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..b9142d5 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1287,6 +1287,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) { struct xol_area *area; unsigned long xol_vaddr; + void *xol_page_kaddr;
area = get_xol_area(); if (!area) @@ -1296,14 +1297,22 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) if (unlikely(!xol_vaddr)) return 0;
- /* Initialize the slot */ - copy_to_page(area->page, xol_vaddr, - &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); /* - * We probably need flush_icache_user_range() but it needs vma. - * This should work on supported architectures too. + * We don't use copy_to_page here because we need kernel page + * addr to invalidate caches correctly */ - flush_dcache_page(area->page); + xol_page_kaddr = kmap_atomic(area->page); + + /* Initialize the slot */ + memcpy(xol_page_kaddr + (xol_vaddr & ~PAGE_MASK), + &uprobe->arch.ixol, + sizeof(uprobe->arch.ixol)); + + arch_uprobe_flush_xol_access(area->page, xol_vaddr, + xol_page_kaddr + (xol_vaddr & ~PAGE_MASK), + sizeof(uprobe->arch.ixol)); + + kunmap_atomic(xol_page_kaddr);
return xol_vaddr; } @@ -1346,6 +1355,18 @@ static void xol_free_insn_slot(struct task_struct *tsk) } }
+void __weak arch_uprobe_flush_xol_access(struct page *page, unsigned long vaddr, + void *kaddr, unsigned long len) +{ + /* + * We probably need flush_icache_user_range() but it needs vma. + * This should work on most of architectures by default. If + * architecture needs to do something different it can define + * its own version of the function. + */ + flush_dcache_page(page); +} + /** * uprobe_get_swbp_addr - compute address of swbp given post-swbp regs * @regs: Reflects the saved state of the task after it has hit a breakpoint
On 04/09/14 01:58, Victor Kamensky wrote:
After instruction write into xol area, on ARM V7 architecture code need to flush dcache and icache to sync them up for given set of addresses. Having just 'flush_dcache_page(page)' call is not enough - it is possible to have stale instruction sitting in icache for given xol area slot address.
Introduce arch_uprobe_flush_xol_access weak function that by default calls 'flush_dcache_page(page)' and on ARM define new one that calls flush_uprobe_xol_access function.
flush_uprobe_xol_access function shares/reuses implementation with/of flush_ptrace_access function and takes care of writing instruction to user land address space on given variety of different cache types on ARM CPUs. Because flush_uprobe_xol_access does not have vma around flush_ptrace_access was split into two parts. First that retrieves set of condition from vma and common that receives those conditions as flags.
Because on ARM cache flush function need kernel address through which instruction write happened, changed xol_get_insn_slot to explicitly map page and do memcpy rather than use copy_to_page helper. In this case xol_get_insn_slot knows kernel address and passes it to arch_uprobe_flush_xol_access function.
Signed-off-by: Victor Kamensky victor.kamensky@linaro.org
arch/arm/include/asm/cacheflush.h | 2 ++ arch/arm/kernel/uprobes.c | 6 ++++++ arch/arm/mm/flush.c | 41 +++++++++++++++++++++++++++++++++------ include/linux/uprobes.h | 3 +++ kernel/events/uprobes.c | 33 +++++++++++++++++++++++++------ 5 files changed, 73 insertions(+), 12 deletions(-)
diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 8b8b616..e02712a 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -487,4 +487,6 @@ int set_memory_rw(unsigned long addr, int numpages); int set_memory_x(unsigned long addr, int numpages); int set_memory_nx(unsigned long addr, int numpages);
+void flush_uprobe_xol_access(struct page *page, unsigned long uaddr,
#endifvoid *kaddr, unsigned long len);
diff --git a/arch/arm/kernel/uprobes.c b/arch/arm/kernel/uprobes.c index f9bacee..a0339a6 100644 --- a/arch/arm/kernel/uprobes.c +++ b/arch/arm/kernel/uprobes.c @@ -113,6 +113,12 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, return 0; }
+void arch_uprobe_flush_xol_access(struct page *page, unsigned long vaddr,
void *kaddr, unsigned long len)
+{
- flush_uprobe_xol_access(page, vaddr, kaddr, len);
+}
- int arch_uprobe_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs) { struct uprobe_task *utask = current->utask;
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c index 3387e60..69a0bd08 100644 --- a/arch/arm/mm/flush.c +++ b/arch/arm/mm/flush.c @@ -104,17 +104,21 @@ void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsig #define flush_icache_alias(pfn,vaddr,len) do { } while (0) #endif
+#define FLAG_UA_IS_EXEC 1 +#define FLAG_UA_CORE_IN_MM 2 +#define FLAG_UA_BROADCAST 4
- static void flush_ptrace_access_other(void *args) { __flush_icache_all(); }
-static -void flush_ptrace_access(struct vm_area_struct *vma, struct page *page,
unsigned long uaddr, void *kaddr, unsigned long len)
+static inline +void __flush_ptrace_access(struct page *page, unsigned long uaddr, void *kaddr,
{ if (cache_is_vivt()) {unsigned long len, unsigned int flags)
if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(vma->vm_mm))) {
}if (flags & FLAG_UA_CORE_IN_MM) { unsigned long addr = (unsigned long)kaddr; __cpuc_coherent_kern_range(addr, addr + len);
@@ -128,18 +132,43 @@ void flush_ptrace_access(struct vm_area_struct *vma, struct page *page, }
/* VIPT non-aliasing D-cache */
- if (vma->vm_flags & VM_EXEC) {
- if (flags & FLAG_UA_IS_EXEC) { unsigned long addr = (unsigned long)kaddr; if (icache_is_vipt_aliasing()) flush_icache_alias(page_to_pfn(page), uaddr, len); else __cpuc_coherent_kern_range(addr, addr + len);
if (cache_ops_need_broadcast())
} }if (flags & FLAG_UA_BROADCAST) smp_call_function(flush_ptrace_access_other, NULL, 1);
+static +void flush_ptrace_access(struct vm_area_struct *vma, struct page *page,
unsigned long uaddr, void *kaddr, unsigned long len)
+{
- unsigned int flags = 0;
- if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(vma->vm_mm))) {
flags |= FLAG_UA_CORE_IN_MM;
- }
- if (vma->vm_flags & VM_EXEC) {
flags |= FLAG_UA_IS_EXEC;
- }
- if (cache_ops_need_broadcast()) {
flags |= FLAG_UA_BROADCAST;
- }
- __flush_ptrace_access(page, uaddr, kaddr, len, flags);
+}
+void flush_uprobe_xol_access(struct page *page, unsigned long uaddr,
void *kaddr, unsigned long len)
+{
- unsigned int flags = FLAG_UA_CORE_IN_MM|FLAG_UA_IS_EXEC;
- __flush_ptrace_access(page, uaddr, kaddr, len, flags);
+}
I guess this is cleaner than duplicating some part of flush_ptrace_access inside your new flush function.
/*
- Copy user data from/to a page which is mapped into a different
- processes address space. Really, we want to allow our "user
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index edff2b9..534e083 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -32,6 +32,7 @@ struct vm_area_struct; struct mm_struct; struct inode; struct notifier_block; +struct page;
#define UPROBE_HANDLER_REMOVE 1 #define UPROBE_HANDLER_MASK 1 @@ -127,6 +128,8 @@ extern int arch_uprobe_exception_notify(struct notifier_block *self, unsigned l extern void arch_uprobe_abort_xol(struct arch_uprobe *aup, struct pt_regs *regs); extern unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs); extern bool __weak arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); +extern void __weak arch_uprobe_flush_xol_access(struct page *page, unsigned long vaddr,
#else /* !CONFIG_UPROBES */ struct uprobes_state { };void *kaddr, unsigned long len);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..b9142d5 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1287,6 +1287,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) { struct xol_area *area; unsigned long xol_vaddr;
void *xol_page_kaddr;
area = get_xol_area(); if (!area)
@@ -1296,14 +1297,22 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) if (unlikely(!xol_vaddr)) return 0;
- /* Initialize the slot */
- copy_to_page(area->page, xol_vaddr,
/*&uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
* We probably need flush_icache_user_range() but it needs vma.
* This should work on supported architectures too.
* We don't use copy_to_page here because we need kernel page
*/* addr to invalidate caches correctly
- flush_dcache_page(area->page);
xol_page_kaddr = kmap_atomic(area->page);
/* Initialize the slot */
memcpy(xol_page_kaddr + (xol_vaddr & ~PAGE_MASK),
&uprobe->arch.ixol,
sizeof(uprobe->arch.ixol));
arch_uprobe_flush_xol_access(area->page, xol_vaddr,
xol_page_kaddr + (xol_vaddr & ~PAGE_MASK),
sizeof(uprobe->arch.ixol));
kunmap_atomic(xol_page_kaddr);
return xol_vaddr; }
@@ -1346,6 +1355,18 @@ static void xol_free_insn_slot(struct task_struct *tsk) } }
+void __weak arch_uprobe_flush_xol_access(struct page *page, unsigned long vaddr,
void *kaddr, unsigned long len)
+{
- /*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on most of architectures by default. If
* architecture needs to do something different it can define
* its own version of the function.
*/
- flush_dcache_page(page);
+}
- /**
- uprobe_get_swbp_addr - compute address of swbp given post-swbp regs
- @regs: Reflects the saved state of the task after it has hit a breakpoint
I can confirm this fixes the problem I was seeing on arndale. x86 builds but I have not tested it.
It looks to me like this is the best way to solve the problem. You should run checkpatch.pl on your patch though, there are some warnings from unnecessary curly braces on if statements and a couple lines over 80 characters long. Minor issues, but easily fixed. Assuming you fix those you can add:
Reviewed-by: David A. Long dave.long@linaro.org
-dl
Victor, sorry, I can't even read this thread now, will try to reply tomorrow... But at first glance,
On 04/08, Victor Kamensky wrote:
Because on ARM cache flush function need kernel address
...
--- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1287,6 +1287,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) { struct xol_area *area; unsigned long xol_vaddr;
- void *xol_page_kaddr;
area = get_xol_area(); if (!area) @@ -1296,14 +1297,22 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) if (unlikely(!xol_vaddr)) return 0;
- /* Initialize the slot */
- copy_to_page(area->page, xol_vaddr,
/*&uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
* We probably need flush_icache_user_range() but it needs vma.
* This should work on supported architectures too.
* We don't use copy_to_page here because we need kernel page
*/* addr to invalidate caches correctly
- flush_dcache_page(area->page);
- xol_page_kaddr = kmap_atomic(area->page);
- /* Initialize the slot */
- memcpy(xol_page_kaddr + (xol_vaddr & ~PAGE_MASK),
&uprobe->arch.ixol,
sizeof(uprobe->arch.ixol));
- arch_uprobe_flush_xol_access(area->page, xol_vaddr,
xol_page_kaddr + (xol_vaddr & ~PAGE_MASK),
sizeof(uprobe->arch.ixol));
- kunmap_atomic(xol_page_kaddr);
return xol_vaddr; } @@ -1346,6 +1355,18 @@ static void xol_free_insn_slot(struct task_struct *tsk) } } +void __weak arch_uprobe_flush_xol_access(struct page *page, unsigned long vaddr,
void *kaddr, unsigned long len)
+{
- /*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on most of architectures by default. If
* architecture needs to do something different it can define
* its own version of the function.
*/
- flush_dcache_page(page);
+}
In this case I'd suggest to move this kmap/memcpy horror into arch_ helper. IOW, something like below.
If arm needs the kernel address we are writing to, let it do kmap/etc in its implementation.
Oleg.
--- x/kernel/events/uprobes.c +++ x/kernel/events/uprobes.c @@ -1291,18 +1291,16 @@ static unsigned long xol_get_insn_slot(s if (unlikely(!xol_vaddr)) return 0;
- /* Initialize the slot */ - copy_to_page(area->page, xol_vaddr, - &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); - /* - * We probably need flush_icache_user_range() but it needs vma. - * This should work on supported architectures too. - */ - flush_dcache_page(area->page); - + arch_uprobe_xxx(area->page, xol_vaddr, ...); return xol_vaddr; }
+void __weak arch_uprobe_xxx(page, xol_vaddr, ...) +{ + copy_to_page(page, xol_vaddr, ...) + flush_dcache_page(page); +} + /* * xol_free_insn_slot - If slot was earlier allocated by * @xol_get_insn_slot(), make the slot available for
On 9 April 2014 11:45, Oleg Nesterov oleg@redhat.com wrote:
Victor, sorry, I can't even read this thread now, will try to reply tomorrow... But at first glance,
Sure, np. Will wait for you to free up.
On 04/08, Victor Kamensky wrote:
Because on ARM cache flush function need kernel address
...
--- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1287,6 +1287,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) { struct xol_area *area; unsigned long xol_vaddr;
void *xol_page_kaddr; area = get_xol_area(); if (!area)
@@ -1296,14 +1297,22 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) if (unlikely(!xol_vaddr)) return 0;
/* Initialize the slot */
copy_to_page(area->page, xol_vaddr,
&uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); /*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on supported architectures too.
* We don't use copy_to_page here because we need kernel page
* addr to invalidate caches correctly */
flush_dcache_page(area->page);
xol_page_kaddr = kmap_atomic(area->page);
/* Initialize the slot */
memcpy(xol_page_kaddr + (xol_vaddr & ~PAGE_MASK),
&uprobe->arch.ixol,
sizeof(uprobe->arch.ixol));
arch_uprobe_flush_xol_access(area->page, xol_vaddr,
xol_page_kaddr + (xol_vaddr & ~PAGE_MASK),
sizeof(uprobe->arch.ixol));
kunmap_atomic(xol_page_kaddr); return xol_vaddr;
} @@ -1346,6 +1355,18 @@ static void xol_free_insn_slot(struct task_struct *tsk) } }
+void __weak arch_uprobe_flush_xol_access(struct page *page, unsigned long vaddr,
void *kaddr, unsigned long len)
+{
/*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on most of architectures by default. If
* architecture needs to do something different it can define
* its own version of the function.
*/
flush_dcache_page(page);
+}
In this case I'd suggest to move this kmap/memcpy horror into arch_ helper. IOW, something like below.
If arm needs the kernel address we are writing to, let it do kmap/etc in its implementation.
Fair enough. Can do that, as well as address Dave's comment about checkpatch.
But before I do that let's reach some conclusion wrt latest Russell's remarks about copy_{to,from}_user_page usage in uprobes. It is bigger question covering more than ARM issue and possibly having bigger implication on uprobes design. I have some comments/thoughts about it, but since I am not uprobes maintainer, will wait for you to address it first.
Thanks, Victor
Oleg.
--- x/kernel/events/uprobes.c +++ x/kernel/events/uprobes.c @@ -1291,18 +1291,16 @@ static unsigned long xol_get_insn_slot(s if (unlikely(!xol_vaddr)) return 0;
/* Initialize the slot */
copy_to_page(area->page, xol_vaddr,
&uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
/*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on supported architectures too.
*/
flush_dcache_page(area->page);
arch_uprobe_xxx(area->page, xol_vaddr, ...); return xol_vaddr;
}
+void __weak arch_uprobe_xxx(page, xol_vaddr, ...) +{
copy_to_page(page, xol_vaddr, ...)
flush_dcache_page(page);
+}
/*
- xol_free_insn_slot - If slot was earlier allocated by
- @xol_get_insn_slot(), make the slot available for
On Wed, Apr 09, 2014 at 12:13:56PM -0700, Victor Kamensky wrote:
Fair enough. Can do that, as well as address Dave's comment about checkpatch.
But before I do that let's reach some conclusion wrt latest Russell's remarks about copy_{to,from}_user_page usage in uprobes. It is bigger question covering more than ARM issue and possibly having bigger implication on uprobes design. I have some comments/thoughts about it, but since I am not uprobes maintainer, will wait for you to address it first.
I encourage folk to read this thread:
http://lkml.iu.edu//hypermail/linux/kernel/1404.1/00725.html
and involve David in the loop - I suspect you need to sell whatever solution you agree on to David (as one of the other arch maintainers) as well. Thanks.
OK, here is an alternative solution for the kernel-dcache / user-icache flush issue in uprobes which I think follows Dave Miller's suggested approach. As a reminder: the goal is to make sure the user-space icache does not have stale data after the kernel rewrite of an instruction in the user's uprobe "execute out of line" (xol) page. It seems only ARM currently finds the flush_dcache_page() call insufficient, but then apparently only two architectures (other than ARM) support uprobes.
I've modified events/uprobes.c to simply call the copy_to_user_page() function instead of doing a memcpy() followed by a flush_dcache_page() call. This results in a net reduction of one line of code in that file. Then I modified copy_to_user_page() and/or flushing function(s) it calls to treat a NULL vma pointer to mean: "assume the user icache address range is now invalid". In the majority of cases this is pretty basic and should be safe as nothing could have been doing this previously. In some cases this now results in flushing more icache than is necessary. For the mips, sh, sparc, and alpha architectures something more complicated is necessary and I have not currently done that. I am not certain this approach can be made to work cleanly for those architectures, although there is probably always the last resort of flushing all icache. On the other hand, it appears only x86, powerpc, and (god-willing) ARM currently support uprobes.
I have only tested this on ARM (arndale) at this point.
The preliminary patch follows in my next email.
(BTW, is depending on the C compiler short-circuiting conditonals acceptable style?)
-dl
Replace memcpy and dcache flush in generic uprobes with a call to copy_to_user_page(), which will do a proper flushing of kernel and user cache. Also modify the inmplementation of copy_to_user_page to assume a NULL vma pointer means the user icache corresponding to this right is stale and needs to be flushed. Note that this patch does not fix copy_to_user page for the sh, alpha, sparc, or mips architectures (which do not currently support uprobes).
Signed-off-by: David A. Long dave.long@linaro.org --- arch/arc/include/asm/cacheflush.h | 2 +- arch/arm/mm/flush.c | 4 ++-- arch/arm64/mm/flush.c | 2 +- arch/avr32/mm/cache.c | 2 +- arch/hexagon/include/asm/cacheflush.h | 2 +- arch/m68k/include/asm/cacheflush_mm.h | 2 +- arch/microblaze/include/asm/cacheflush.h | 2 +- arch/mips/mm/init.c | 2 +- arch/parisc/include/asm/tlbflush.h | 4 ++++ arch/parisc/kernel/cache.c | 4 ++-- arch/score/include/asm/cacheflush.h | 2 +- arch/score/mm/cache.c | 2 +- arch/sh/mm/cache.c | 2 +- arch/sparc/mm/leon_mm.c | 2 +- arch/tile/include/asm/cacheflush.h | 2 +- arch/unicore32/mm/flush.c | 2 +- arch/xtensa/mm/cache.c | 4 ++-- kernel/events/uprobes.c | 7 +------ 18 files changed, 24 insertions(+), 25 deletions(-)
diff --git a/arch/arc/include/asm/cacheflush.h b/arch/arc/include/asm/cacheflush.h index 6abc497..64d67e4 100644 --- a/arch/arc/include/asm/cacheflush.h +++ b/arch/arc/include/asm/cacheflush.h @@ -110,7 +110,7 @@ static inline int cache_is_vipt_aliasing(void) #define copy_to_user_page(vma, page, vaddr, dst, src, len) \ do { \ memcpy(dst, src, len); \ - if (vma->vm_flags & VM_EXEC) \ + if (!vma || vma->vm_flags & VM_EXEC) \ __sync_icache_dcache((unsigned long)(dst), vaddr, len); \ } while (0)
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c index 3387e60..dd19ad4 100644 --- a/arch/arm/mm/flush.c +++ b/arch/arm/mm/flush.c @@ -114,7 +114,7 @@ void flush_ptrace_access(struct vm_area_struct *vma, struct page *page, unsigned long uaddr, void *kaddr, unsigned long len) { if (cache_is_vivt()) { - if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(vma->vm_mm))) { + if (!vma || cpumask_test_cpu(smp_processor_id(), mm_cpumask(vma->vm_mm))) { unsigned long addr = (unsigned long)kaddr; __cpuc_coherent_kern_range(addr, addr + len); } @@ -128,7 +128,7 @@ void flush_ptrace_access(struct vm_area_struct *vma, struct page *page, }
/* VIPT non-aliasing D-cache */ - if (vma->vm_flags & VM_EXEC) { + if (!vma || vma->vm_flags & VM_EXEC) { unsigned long addr = (unsigned long)kaddr; if (icache_is_vipt_aliasing()) flush_icache_alias(page_to_pfn(page), uaddr, len); diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index e4193e3..cde3cb4 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -38,7 +38,7 @@ static void flush_ptrace_access(struct vm_area_struct *vma, struct page *page, unsigned long uaddr, void *kaddr, unsigned long len) { - if (vma->vm_flags & VM_EXEC) { + if (!vma || vma->vm_flags & VM_EXEC) { unsigned long addr = (unsigned long)kaddr; if (icache_is_aliasing()) { __flush_dcache_area(kaddr, len); diff --git a/arch/avr32/mm/cache.c b/arch/avr32/mm/cache.c index 6a46ecd..cd3d378 100644 --- a/arch/avr32/mm/cache.c +++ b/arch/avr32/mm/cache.c @@ -156,7 +156,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long len) { memcpy(dst, src, len); - if (vma->vm_flags & VM_EXEC) + if (!vma || vma->vm_flags & VM_EXEC) flush_icache_range((unsigned long)dst, (unsigned long)dst + len); } diff --git a/arch/hexagon/include/asm/cacheflush.h b/arch/hexagon/include/asm/cacheflush.h index 49e0896..9bea768 100644 --- a/arch/hexagon/include/asm/cacheflush.h +++ b/arch/hexagon/include/asm/cacheflush.h @@ -86,7 +86,7 @@ static inline void copy_to_user_page(struct vm_area_struct *vma, void *dst, void *src, int len) { memcpy(dst, src, len); - if (vma->vm_flags & VM_EXEC) { + if (!vma || vma->vm_flags & VM_EXEC) { flush_icache_range((unsigned long) dst, (unsigned long) dst + len); } diff --git a/arch/m68k/include/asm/cacheflush_mm.h b/arch/m68k/include/asm/cacheflush_mm.h index fa2c3d6..afefdeb 100644 --- a/arch/m68k/include/asm/cacheflush_mm.h +++ b/arch/m68k/include/asm/cacheflush_mm.h @@ -212,7 +212,7 @@ static inline void flush_cache_range(struct vm_area_struct *vma,
static inline void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn) { - if (vma->vm_mm == current->mm) + if (!vma || vma->vm_mm == current->mm) __flush_cache_030(); }
diff --git a/arch/microblaze/include/asm/cacheflush.h b/arch/microblaze/include/asm/cacheflush.h index ffea82a..9eef956 100644 --- a/arch/microblaze/include/asm/cacheflush.h +++ b/arch/microblaze/include/asm/cacheflush.h @@ -108,7 +108,7 @@ static inline void copy_to_user_page(struct vm_area_struct *vma, { u32 addr = virt_to_phys(dst); memcpy(dst, src, len); - if (vma->vm_flags & VM_EXEC) { + if (!vma || vma->vm_flags & VM_EXEC) { invalidate_icache_range(addr, addr + PAGE_SIZE); flush_dcache_range(addr, addr + PAGE_SIZE); } diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c index 6b59617..e428551 100644 --- a/arch/mips/mm/init.c +++ b/arch/mips/mm/init.c @@ -232,7 +232,7 @@ void copy_to_user_page(struct vm_area_struct *vma, if (cpu_has_dc_aliases) SetPageDcacheDirty(page); } - if ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc) + if ((!vma || vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc) flush_cache_page(vma, vaddr, page_to_pfn(page)); }
diff --git a/arch/parisc/include/asm/tlbflush.h b/arch/parisc/include/asm/tlbflush.h index 9d086a5..7aad1b6 100644 --- a/arch/parisc/include/asm/tlbflush.h +++ b/arch/parisc/include/asm/tlbflush.h @@ -68,6 +68,10 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, /* For one page, it's not worth testing the split_tlb variable */
mb(); + if (!vma) { + flush_tlb_all(); + return; + } sid = vma->vm_mm->context; purge_tlb_start(flags); mtsp(sid, 1); diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c index ac87a40..ff09f05 100644 --- a/arch/parisc/kernel/cache.c +++ b/arch/parisc/kernel/cache.c @@ -278,7 +278,7 @@ __flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, { preempt_disable(); flush_dcache_page_asm(physaddr, vmaddr); - if (vma->vm_flags & VM_EXEC) + if (!vma || vma->vm_flags & VM_EXEC) flush_icache_page_asm(physaddr, vmaddr); preempt_enable(); } @@ -574,7 +574,7 @@ void flush_cache_range(struct vm_area_struct *vma, void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn) { - BUG_ON(!vma->vm_mm->context); + BUG_ON(vma && !vma->vm_mm->context);
if (pfn_valid(pfn)) { flush_tlb_page(vma, vmaddr); diff --git a/arch/score/include/asm/cacheflush.h b/arch/score/include/asm/cacheflush.h index 1d545d0..63e7b4e 100644 --- a/arch/score/include/asm/cacheflush.h +++ b/arch/score/include/asm/cacheflush.h @@ -41,7 +41,7 @@ static inline void flush_icache_page(struct vm_area_struct *vma, #define copy_to_user_page(vma, page, vaddr, dst, src, len) \ do { \ memcpy(dst, src, len); \ - if ((vma->vm_flags & VM_EXEC)) \ + if (!vma || (vma->vm_flags & VM_EXEC)) \ flush_cache_page(vma, vaddr, page_to_pfn(page));\ } while (0)
diff --git a/arch/score/mm/cache.c b/arch/score/mm/cache.c index f85ec1a..8464575 100644 --- a/arch/score/mm/cache.c +++ b/arch/score/mm/cache.c @@ -210,7 +210,7 @@ void flush_cache_range(struct vm_area_struct *vma, void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { - int exec = vma->vm_flags & VM_EXEC; + int exec = !vma || vma->vm_flags & VM_EXEC; unsigned long kaddr = 0xa0000000 | (pfn << PAGE_SHIFT);
flush_dcache_range(kaddr, kaddr + PAGE_SIZE); diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c index 616966a..ba2313a 100644 --- a/arch/sh/mm/cache.c +++ b/arch/sh/mm/cache.c @@ -70,7 +70,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page, clear_bit(PG_dcache_clean, &page->flags); }
- if (vma->vm_flags & VM_EXEC) + if (!vma || vma->vm_flags & VM_EXEC) flush_cache_page(vma, vaddr, page_to_pfn(page)); }
diff --git a/arch/sparc/mm/leon_mm.c b/arch/sparc/mm/leon_mm.c index 5bed085..dca5e18 100644 --- a/arch/sparc/mm/leon_mm.c +++ b/arch/sparc/mm/leon_mm.c @@ -192,7 +192,7 @@ void leon_flush_dcache_all(void)
void leon_flush_pcache_all(struct vm_area_struct *vma, unsigned long page) { - if (vma->vm_flags & VM_EXEC) + if (!vma || vma->vm_flags & VM_EXEC) leon_flush_icache_all(); leon_flush_dcache_all(); } diff --git a/arch/tile/include/asm/cacheflush.h b/arch/tile/include/asm/cacheflush.h index 92ee4c8..7b7022c 100644 --- a/arch/tile/include/asm/cacheflush.h +++ b/arch/tile/include/asm/cacheflush.h @@ -66,7 +66,7 @@ static inline void copy_to_user_page(struct vm_area_struct *vma, void *dst, void *src, int len) { memcpy(dst, src, len); - if (vma->vm_flags & VM_EXEC) { + if (!vma || vma->vm_flags & VM_EXEC) { flush_icache_range((unsigned long) dst, (unsigned long) dst + len); } diff --git a/arch/unicore32/mm/flush.c b/arch/unicore32/mm/flush.c index 6d4c096..10ddab3 100644 --- a/arch/unicore32/mm/flush.c +++ b/arch/unicore32/mm/flush.c @@ -36,7 +36,7 @@ static void flush_ptrace_access(struct vm_area_struct *vma, struct page *page, unsigned long uaddr, void *kaddr, unsigned long len) { /* VIPT non-aliasing D-cache */ - if (vma->vm_flags & VM_EXEC) { + if (!vma || vma->vm_flags & VM_EXEC) { unsigned long addr = (unsigned long)kaddr;
__cpuc_coherent_kern_range(addr, addr + len); diff --git a/arch/xtensa/mm/cache.c b/arch/xtensa/mm/cache.c index ba4c47f..d34c06c 100644 --- a/arch/xtensa/mm/cache.c +++ b/arch/xtensa/mm/cache.c @@ -221,10 +221,10 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long t = TLBTEMP_BASE_1 + (vaddr & DCACHE_ALIAS_MASK);
__flush_invalidate_dcache_range((unsigned long) dst, len); - if ((vma->vm_flags & VM_EXEC) != 0) + if (!vma || (vma->vm_flags & VM_EXEC) != 0) __invalidate_icache_page_alias(t, phys);
- } else if ((vma->vm_flags & VM_EXEC) != 0) { + } else if (!vma || (vma->vm_flags & VM_EXEC) != 0) { __flush_dcache_range((unsigned long)dst,len); __invalidate_icache_range((unsigned long) dst, len); } diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..2e976fb 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -241,7 +241,7 @@ static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, in static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len) { void *kaddr = kmap_atomic(page); - memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len); + copy_to_user_page(NULL, page, vaddr, kaddr + (vaddr & ~PAGE_MASK), src, len); kunmap_atomic(kaddr); }
@@ -1299,11 +1299,6 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) /* Initialize the slot */ copy_to_page(area->page, xol_vaddr, &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); - /* - * We probably need flush_icache_user_range() but it needs vma. - * This should work on supported architectures too. - */ - flush_dcache_page(area->page);
return xol_vaddr; }
From: David Long dave.long@linaro.org Date: Thu, 10 Apr 2014 23:45:31 -0400
Replace memcpy and dcache flush in generic uprobes with a call to copy_to_user_page(), which will do a proper flushing of kernel and user cache. Also modify the inmplementation of copy_to_user_page to assume a NULL vma pointer means the user icache corresponding to this right is stale and needs to be flushed. Note that this patch does not fix copy_to_user page for the sh, alpha, sparc, or mips architectures (which do not currently support uprobes).
Signed-off-by: David A. Long dave.long@linaro.org
You really need to pass the proper VMA down to the call site rather than pass NULL, that's extremely ugly and totally unnecesary.
On 10 April 2014 21:36, David Miller davem@davemloft.net wrote:
From: David Long dave.long@linaro.org Date: Thu, 10 Apr 2014 23:45:31 -0400
Replace memcpy and dcache flush in generic uprobes with a call to copy_to_user_page(), which will do a proper flushing of kernel and user cache. Also modify the inmplementation of copy_to_user_page to assume a NULL vma pointer means the user icache corresponding to this right is stale and needs to be flushed. Note that this patch does not fix copy_to_user page for the sh, alpha, sparc, or mips architectures (which do not currently support uprobes).
Signed-off-by: David A. Long dave.long@linaro.org
You really need to pass the proper VMA down to the call site rather than pass NULL, that's extremely ugly and totally unnecesary.
Agreed that VMA is really needed.
Here is variant that I tried while waiting for Oleg's response:
From 4a6a9043e0910041dd8842835a528cbdc39fad34 Mon Sep 17 00:00:00 2001
From: Victor Kamensky victor.kamensky@linaro.org Date: Thu, 10 Apr 2014 17:06:39 -0700 Subject: [PATCH] uprobes: use copy_to_user_page function to copy instr to xol area
Use copy_to_user_page function to copy instruction into xol area. copy_to_user_page function guarantee that all caches are correctly flushed during such write (including icache as well if needed).
Because copy_to_user_page needs vm_area_struct, vma field was added into struct xol_area. It holds cached vma value for xol_area.
Also using copy_to_user_page we make sure that we use the same code that ptrace write code uses.
Signed-off-by: Victor Kamensky victor.kamensky@linaro.org --- kernel/events/uprobes.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..1ae4563 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -117,6 +117,7 @@ struct xol_area { * the vma go away, and we must handle that reasonably gracefully. */ unsigned long vaddr; /* Page(s) of instruction slots */ + struct vm_area_struct *vma; /* VMA that holds above address */ };
/* @@ -1150,6 +1151,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE, VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, &area->page); + area->vma = find_vma(mm, area->vaddr); if (ret) goto fail;
@@ -1287,6 +1289,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) { struct xol_area *area; unsigned long xol_vaddr; + void *xol_page_kaddr;
area = get_xol_area(); if (!area) @@ -1297,8 +1300,11 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) return 0;
/* Initialize the slot */ - copy_to_page(area->page, xol_vaddr, - &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); + xol_page_kaddr = kmap_atomic(area->page); + copy_to_user_page(area->vma, area->page, xol_vaddr, + xol_page_kaddr + (xol_vaddr & ~PAGE_MASK), + &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); + kunmap_atomic(xol_page_kaddr); /* * We probably need flush_icache_user_range() but it needs vma. * This should work on supported architectures too.
On 04/11, Victor Kamensky wrote:
On 10 April 2014 21:36, David Miller davem@davemloft.net wrote:
You really need to pass the proper VMA down to the call site rather than pass NULL, that's extremely ugly and totally unnecesary.
Agreed that VMA is really needed.
I do not ;) but I am still trying to finish my email...
index 04709b6..1ae4563 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -117,6 +117,7 @@ struct xol_area { * the vma go away, and we must handle that reasonably gracefully. */ unsigned long vaddr; /* Page(s) of instruction slots */
- struct vm_area_struct *vma; /* VMA that holds above address */
};
/* @@ -1150,6 +1151,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE, VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, &area->page);
- area->vma = find_vma(mm, area->vaddr);
No, this can't work. This vma can be unmapped/freed/etc.
Oleg.
On 11 April 2014 07:26, Victor Kamensky victor.kamensky@linaro.org wrote:
On 10 April 2014 21:36, David Miller davem@davemloft.net wrote:
From: David Long dave.long@linaro.org Date: Thu, 10 Apr 2014 23:45:31 -0400
Replace memcpy and dcache flush in generic uprobes with a call to copy_to_user_page(), which will do a proper flushing of kernel and user cache. Also modify the inmplementation of copy_to_user_page to assume a NULL vma pointer means the user icache corresponding to this right is stale and needs to be flushed. Note that this patch does not fix copy_to_user page for the sh, alpha, sparc, or mips architectures (which do not currently support uprobes).
Signed-off-by: David A. Long dave.long@linaro.org
You really need to pass the proper VMA down to the call site rather than pass NULL, that's extremely ugly and totally unnecesary.
Agreed that VMA is really needed.
Here is variant that I tried while waiting for Oleg's response:
From 4a6a9043e0910041dd8842835a528cbdc39fad34 Mon Sep 17 00:00:00 2001 From: Victor Kamensky victor.kamensky@linaro.org Date: Thu, 10 Apr 2014 17:06:39 -0700 Subject: [PATCH] uprobes: use copy_to_user_page function to copy instr to xol area
Use copy_to_user_page function to copy instruction into xol area. copy_to_user_page function guarantee that all caches are correctly flushed during such write (including icache as well if needed).
Because copy_to_user_page needs vm_area_struct, vma field was added into struct xol_area. It holds cached vma value for xol_area.
Also using copy_to_user_page we make sure that we use the same code that ptrace write code uses.
Signed-off-by: Victor Kamensky victor.kamensky@linaro.org
kernel/events/uprobes.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..1ae4563 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -117,6 +117,7 @@ struct xol_area { * the vma go away, and we must handle that reasonably gracefully. */ unsigned long vaddr; /* Page(s) of instruction slots */
- struct vm_area_struct *vma; /* VMA that holds above address */
};
/* @@ -1150,6 +1151,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE, VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, &area->page);
- area->vma = find_vma(mm, area->vaddr); if (ret) goto fail;
@@ -1287,6 +1289,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) { struct xol_area *area; unsigned long xol_vaddr;
void *xol_page_kaddr;
area = get_xol_area(); if (!area)
@@ -1297,8 +1300,11 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) return 0;
/* Initialize the slot */
- copy_to_page(area->page, xol_vaddr,
&uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
- xol_page_kaddr = kmap_atomic(area->page);
- copy_to_user_page(area->vma, area->page, xol_vaddr,
xol_page_kaddr + (xol_vaddr & ~PAGE_MASK),
&uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
- kunmap_atomic(xol_page_kaddr); /*
- We probably need flush_icache_user_range() but it needs vma.
- This should work on supported architectures too.
Above comment and following call to 'flush_dcache_page(area->page);' should be removed of course. Just noticed that ...
Thanks, Victor
-- 1.8.1.4
Now while waiting for Oleg's response I am trying to run some performance tests between different solution to understand situation better.
One of my concerns was ARM smp function call that could be reached from copy_to_user_page. But I have hard time finding CPU that really doing it ... It does not look such call will happen on relatively modern ARM CPUs (tried Arndale, and Pandaboard ES). So my issues with use of copy_to_user_page are quite elevated.
Thanks, Victor
First of all: I do not pretend I really understand the problems with icache/etc coherency and how flush_icache_range() actually works on alpha. Help.
For those who were not cc'ed. A probed task has a special "xol" vma, it is used to execute the probed insn out of line.
Initial implementation was x86 only, so we simply copied the probed insn into this vma and everything was fine, flush_icache_range() is nop on x86.
Then we added flush_dcache_page() for powerpc,
// this is just kmap() + memcpy() copy_to_page(area->page, xol_vaddr, &uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
/* * We probably need flush_icache_user_range() but it needs vma. * This should work on supported architectures too. */ flush_dcache_page(area->page);
but this doesn't work on arm. So we need another fix.
On 04/11, David Miller wrote:
From: David Long dave.long@linaro.org Date: Thu, 10 Apr 2014 23:45:31 -0400
Replace memcpy and dcache flush in generic uprobes with a call to copy_to_user_page(), which will do a proper flushing of kernel and user cache. Also modify the inmplementation of copy_to_user_page to assume a NULL vma pointer means the user icache corresponding to this right is stale and needs to be flushed. Note that this patch does not fix copy_to_user page for the sh, alpha, sparc, or mips architectures (which do not currently support uprobes).
Signed-off-by: David A. Long dave.long@linaro.org
You really need to pass the proper VMA down to the call site rather than pass NULL, that's extremely ugly and totally unnecesary.
I agree that we should not change copy_to_user_page(), but I am not sure the code above should use copy_to_user_page().
Because the code above really differs in my opinion from the only user of copy_to_user_page(), __access_remote_vm().
1. First of all, we do not know vma.
OK, we can down_read(mmap_sem) and do find_vma() of course. This is a bit unfortunate, especially because the architectures we currently support do not need this.
But,
2. The problem is, it would be very nice to remove this vma, or at least hide it somehow from find_vma/etc. This is the special mapping we do not want to expose to user-space.
In fact I even have the patches which remove this vma, but they do not work with compat tasks unfortunately.
3. Unlike __access_remote_vm() we always use current->mm, and the memory range changed by the code above can only be used by "current" thread.
So (perhaps) flush_icache_user_range() can even treat this case as "mm->mm_users) <= 1" and avoid ipi_flush_icache_page (just in case, of course I do not know if this is actually possible).
So can't we do something else? Lets forget about arch/arm for the moment, suppose that we want to support uprobes on alpha. Can't we do _something_ like below?
And perhap we can do something like this on arm? it can do kmap(page) / page_address(page) itself.
Oleg.
--- x/arch/alpha/kernel/smp.c +++ x/arch/alpha/kernel/smp.c @@ -751,15 +751,9 @@ ipi_flush_icache_page(void *x) flush_tlb_other(mm); }
-void -flush_icache_user_range(struct vm_area_struct *vma, struct page *page, - unsigned long addr, int len) -{ - struct mm_struct *mm = vma->vm_mm; - - if ((vma->vm_flags & VM_EXEC) == 0) - return;
+void __flush_icache_page_xxx(struct mm_struct *mm, struct page *page) // addr, len ? +{ preempt_disable();
if (mm == current->active_mm) { @@ -783,3 +777,24 @@ flush_icache_user_range(struct vm_area_s
preempt_enable(); } + +void flush_icache_page_xxx(struct mm_struct *mm, struct page *page) +{ + struct mm_struct *mm = current->mm; + + down_read(&mm->mmap_sem); + __flush_icache_page_xxx(mm, page); + up_read(&mm->mmap_sem); +} + +void +flush_icache_user_range(struct vm_area_struct *vma, struct page *page, + unsigned long addr, int len) +{ + struct mm_struct *mm = vma->vm_mm; + + if ((vma->vm_flags & VM_EXEC) == 0) + return; + + __flush_icache_page_xxx(mm, page); +}
On 04/11, Oleg Nesterov wrote:
Can't we do _something_ like below?
If not, I'd propose the patch below.
I can be easily wrong, but it seems that arch/arm can reimplement arch_uprobe_flush_xol_icache() and do flush_ptrace_access()-like code. It needs kaddr, but this is not a problem.
Btw. From arch/arm/include/asm/cacheflush.h
#define flush_icache_user_range(vma,page,addr,len) \ flush_dcache_page(page)
but it has no users?
And I am just curious, why arm's copy_to_user_page() disables premption before memcpy?
Oleg.
--- x/kernel/events/uprobes.c +++ x/kernel/events/uprobes.c @@ -1274,6 +1274,17 @@ static unsigned long xol_take_insn_slot( return slot_addr; }
+void __weak arch_uprobe_flush_xol_icache(struct page *page, + unsigned long vaddr, int len) +{ + /* + * We need copy_to_user_page/flush_icache_user_range but this + * needs vma. If this doesn't work on your arch, reimplement. + */ + flush_dcache_page(area->page); + +} + /* * xol_get_insn_slot - allocate a slot for xol. * Returns the allocated slot address or 0. @@ -1294,11 +1305,8 @@ static unsigned long xol_get_insn_slot(s /* Initialize the slot */ copy_to_page(area->page, xol_vaddr, &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); - /* - * We probably need flush_icache_user_range() but it needs vma. - * This should work on supported architectures too. - */ - flush_dcache_page(area->page); + arch_uprobe_flush_xol_icache(area->page, xol_vaddr, + sizeof(uprobe->arch.ixol));
return xol_vaddr; }
On Fri, Apr 11, 2014 at 05:22:07PM +0200, Oleg Nesterov wrote:
On 04/11, Oleg Nesterov wrote:
Can't we do _something_ like below?
If not, I'd propose the patch below.
I can be easily wrong, but it seems that arch/arm can reimplement arch_uprobe_flush_xol_icache() and do flush_ptrace_access()-like code. It needs kaddr, but this is not a problem.
Btw. From arch/arm/include/asm/cacheflush.h
#define flush_icache_user_range(vma,page,addr,len) \ flush_dcache_page(page)
but it has no users?
I wonder whether you've read this yet:
http://lkml.iu.edu//hypermail/linux/kernel/1404.1/00725.html
where I proposed removing flush_icache_user_range() since it's not used on a great many architectures.
And I am just curious, why arm's copy_to_user_page() disables premption before memcpy?
flush_ptrace_access() needs to run on the CPU which ended up with the dirty cache line(s) to cope with those which do not have hardware broadcasting of cache maintanence operations.
This is why the hacks that you're doing are just that - they're hacks and are all broken in some way.
So, let me re-quote what I asked in a previous thread:
| Given that we've already solved that problem, wouldn't it be a good idea | if the tracing code would stop trying to reinvent broken solutions to | problems we have already solved?
I fail to see what your problem is with keeping the vma around, and using that infrastructure. If it needs optimisation for uprobes, then let's optimise it. Let's not go inventing a whole new interface solving the same problem.
On 04/11, Russell King - ARM Linux wrote:
On Fri, Apr 11, 2014 at 05:22:07PM +0200, Oleg Nesterov wrote:
I wonder whether you've read this yet:
http://lkml.iu.edu//hypermail/linux/kernel/1404.1/00725.html
it seems that the only result of this discussion is "stop trying to reinvent" you already quoted. Thanks.
where I proposed removing flush_icache_user_range() since it's not used on a great many architectures.
Or at least it and its usage can be cleanuped somehow...
And I am just curious, why arm's copy_to_user_page() disables premption before memcpy?
flush_ptrace_access() needs to run on the CPU which ended up with the dirty cache line(s) to cope with those which do not have hardware broadcasting of cache maintanence operations.
Aha, thanks.
But you know, perhaps I'll ask you another stupid question later. Because it still seems to me that we can do something better/cheaper in uprobe case. Nevermind.
This is why the hacks that you're doing are just that - they're hacks and are all broken in some way.
OK.
I fail to see what your problem is with keeping the vma around,
We can't pin vm_area_struct.
Let's not go inventing a whole new interface solving the same problem.
OK. How about the patch below?
Oleg. ---
index 2adbc97..9d45a4a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1274,6 +1274,33 @@ static unsigned long xol_take_insn_slot(struct xol_area *area) return slot_addr; }
+static void arch_uprobe_copy_ixol(struct xol_area *area, unsigned long vaddr, + struct arch_uprobe *auprobe) +{ +#ifndef ARCH_UPROBE_XXX + copy_to_page(area->page, vaddr, &auprobe->ixol, sizeof(&auprobe->ixol)); + /* + * We probably need flush_icache_user_range() but it needs vma. + * If this doesn't work define ARCH_UPROBE_XXX. + */ + flush_dcache_page(area->page); +#else + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + + down_read(&mm->mmap_sem); + vma = find_exact_vma(mm, area->vaddr, area->vaddr + PAGE_SIZE); + if (vma) { + void *kaddr = kmap_atomic(area->page); + copy_to_user_page(vma, area->page, + vaddr, kaddr + (vaddr & ~PAGE_MASK), + &auprobe->ixol, sizeof(&auprobe->ixol)); + kunmap_atomic(kaddr); + } + up_read(&mm->mmap_sem); +#endif +} + /* * xol_get_insn_slot - allocate a slot for xol. * Returns the allocated slot address or 0. @@ -1291,15 +1318,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) if (unlikely(!xol_vaddr)) return 0;
- /* Initialize the slot */ - copy_to_page(area->page, xol_vaddr, - &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); - /* - * We probably need flush_icache_user_range() but it needs vma. - * This should work on supported architectures too. - */ - flush_dcache_page(area->page); - + arch_uprobe_copy_ixol(area, xol_vaddr, &uprobe->arch); return xol_vaddr; }
On 04/11, Oleg Nesterov wrote:
+static void arch_uprobe_copy_ixol(struct xol_area *area, unsigned long vaddr,
struct arch_uprobe *auprobe)
+{ +#ifndef ARCH_UPROBE_XXX
- copy_to_page(area->page, vaddr, &auprobe->ixol, sizeof(&auprobe->ixol));
- /*
* We probably need flush_icache_user_range() but it needs vma.
* If this doesn't work define ARCH_UPROBE_XXX.
*/
- flush_dcache_page(area->page);
+#else
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- down_read(&mm->mmap_sem);
- vma = find_exact_vma(mm, area->vaddr, area->vaddr + PAGE_SIZE);
- if (vma) {
void *kaddr = kmap_atomic(area->page);
copy_to_user_page(vma, area->page,
vaddr, kaddr + (vaddr & ~PAGE_MASK),
&auprobe->ixol, sizeof(&auprobe->ixol));
kunmap_atomic(kaddr);
- }
- up_read(&mm->mmap_sem);
+#endif
And perhaps the patch is not complete. "if (vma)" is not enough, a probed task can mmap something else at this vaddr.
copy_to_user_page() should only change the contents of area->page, so memcpy should be fine. But I am not sure that flush_icache_user_range() or flush_ptrace_access() is always safe on every arch if "struct page *page" doesn't match vma.
Oleg.
From: Oleg Nesterov oleg@redhat.com Date: Fri, 11 Apr 2014 19:38:53 +0200
And perhaps the patch is not complete. "if (vma)" is not enough, a probed task can mmap something else at this vaddr.
copy_to_user_page() should only change the contents of area->page, so memcpy should be fine. But I am not sure that flush_icache_user_range() or flush_ptrace_access() is always safe on every arch if "struct page *page" doesn't match vma.
The architectures want the VMA for two reasons:
1) To get at the 'mm'. The 'mm' is absolutely essential so that we can look at the MM cpumask and therefore determine what cpus this address space has executed upon, and therefore what cpus need the flush broadcast to.
2) To determine if the VMA is executable, in order to avoid the I-cache flush if possible.
I think you can get at the 'mm' trivially in this uprobes path, and we can just as well assume that the VMA is executable since this thing is always writing instructions.
So we could create a __copy_to_user_page() that takes an 'mm' and a boolean 'executable' which uprobes could unconditionally set true, and copy_to_user_page() would then be implemented in terms of __copy_to_user_page().
On 04/11, David Miller wrote:
From: Oleg Nesterov oleg@redhat.com Date: Fri, 11 Apr 2014 19:38:53 +0200
And perhaps the patch is not complete. "if (vma)" is not enough, a probed task can mmap something else at this vaddr.
copy_to_user_page() should only change the contents of area->page, so memcpy should be fine. But I am not sure that flush_icache_user_range() or flush_ptrace_access() is always safe on every arch if "struct page *page" doesn't match vma.
The architectures want the VMA for two reasons:
To get at the 'mm'. The 'mm' is absolutely essential so that we can look at the MM cpumask and therefore determine what cpus this address space has executed upon, and therefore what cpus need the flush broadcast to.
To determine if the VMA is executable, in order to avoid the I-cache flush if possible.
Yes, thanks, this is clear.
I think you can get at the 'mm' trivially in this uprobes path,
sure, it is always current->mm.
and we can just as well assume that the VMA is executable since this thing is always writing instructions.
yes.
So we could create a __copy_to_user_page() that takes an 'mm' and a boolean 'executable' which uprobes could unconditionally set true, and copy_to_user_page() would then be implemented in terms of __copy_to_user_page().
This needs a lot of per-arch changes. Plus, it seems, in general VM_EXEC is not the only thing __copy_to_user_page() should take into account...
But at least we are starting to agree that we need something else ;)
Oleg.
On Fri, Apr 11, 2014 at 10:24 AM, Oleg Nesterov oleg@redhat.com wrote:
+static void arch_uprobe_copy_ixol(struct xol_area *area, unsigned long vaddr,
struct arch_uprobe *auprobe)
+{ +#ifndef ARCH_UPROBE_XXX
copy_to_page(area->page, vaddr, &auprobe->ixol, sizeof(&auprobe->ixol));
/*
* We probably need flush_icache_user_range() but it needs vma.
* If this doesn't work define ARCH_UPROBE_XXX.
*/
flush_dcache_page(area->page);
+#else
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
down_read(&mm->mmap_sem);
vma = find_exact_vma(mm, area->vaddr, area->vaddr + PAGE_SIZE);
if (vma) {
void *kaddr = kmap_atomic(area->page);
copy_to_user_page(vma, area->page,
vaddr, kaddr + (vaddr & ~PAGE_MASK),
&auprobe->ixol, sizeof(&auprobe->ixol));
kunmap_atomic(kaddr);
}
up_read(&mm->mmap_sem);
+#endif
Yeah, no, this is wrong.
the fact is, the *only* possible use for the whole "vma" argument is the "can this be executable" optimization, afaik.
So I really think we should just have a fixed "flush_icache_page(page,vaddr)" function. Maybe add a "len" argument, to allow architectures that have to loop over cachelines to do just a minimal loop.
Then, to do the vma optimization, let's introduce a new
arch_needs_icache_flush(vma, page)
function, which on cache coherent architectures can just be zero (or one, since the icache flush itself will be a no-op, so it doesn't really matter), and on others it can do the "have we executed from this page", which may involve just looking at the vma..
Then the uprobe case can just do
copy_to_page() flush_dcache_page() flush_icache_page()
and be done with it.
Hmm?
Linus
From: Linus Torvalds torvalds@linux-foundation.org Date: Fri, 11 Apr 2014 10:50:31 -0700
So I really think we should just have a fixed "flush_icache_page(page,vaddr)" function. Maybe add a "len" argument, to allow architectures that have to loop over cachelines to do just a minimal loop.
It's not enough, we need to have the 'mm' so we can know what cpu's this address space has executed upon, and therefore what cpus need the broadcast flush.
See my other reply, we can just make a __copy_to_user_page() that takes 'mm' and a boolean 'executable' which uprobes can unconditionally pass as true.
On Fri, Apr 11, 2014 at 11:02 AM, David Miller davem@davemloft.net wrote:
It's not enough, we need to have the 'mm' so we can know what cpu's this address space has executed upon, and therefore what cpus need the broadcast flush.
Ok. But still, it shouldn't need "vma".
See my other reply, we can just make a __copy_to_user_page() that takes 'mm' and a boolean 'executable' which uprobes can unconditionally pass as true.
Sure, that doesn't look disgusting. That said, I thought at least one architecture (powerpc) did more than just check the executable bit: I think somebody actually does a page-per-page "has this been mapped executably" thing because their icache flush is *so* expensive. So that boolean "executable" bit is potentially architecture-specific.
And quite frankly, using the "vma->vm_flags" sounds potentially *incorrect* to me, since it really isn't about the vma. If you change a page through a non-executable vma, you'd want to flush the icache entry for that page mapped in a totally different vma. So I really get the feeling that passing in "vma" is actively *wrong*. The vma interface really makes little to no sense.
Hmm?
Linus
From: Linus Torvalds torvalds@linux-foundation.org Date: Fri, 11 Apr 2014 11:11:33 -0700
And quite frankly, using the "vma->vm_flags" sounds potentially *incorrect* to me, since it really isn't about the vma. If you change a page through a non-executable vma, you'd want to flush the icache entry for that page mapped in a totally different vma. So I really get the feeling that passing in "vma" is actively *wrong*. The vma interface really makes little to no sense.
Hmm?
The vm_flags check is about "could it have gotten into the I-cache via this VMA".
If the VMA protections change, we'd do a flush of some sort during that change.
On Fri, Apr 11, 2014 at 11:19 AM, David Miller davem@davemloft.net wrote:
The vm_flags check is about "could it have gotten into the I-cache via this VMA".
.. and that's obviously complete bullshit and wrong. Which is my point.
Now, it's possible that doing things right is just too much work for architectures that don't even matter, but dammit, it's still wrong. If you change a page, and it's executably mapped into some other vma, the icache is possibly stale there. The whole _point_ of our cache flushing is to make caches coherent, and anything that uses "vma" to do so is *wrong*.
So your argument makes no sense. You're just re-stating that "it's wrong", but you're re-stating it in a way that makes it sounds like it could be right.
The "this page has been mapped executably" approach, in contrast, is *correct*. It has a chance in hell of actually making caches coherent.
Linus
From: Linus Torvalds torvalds@linux-foundation.org Date: Fri, 11 Apr 2014 11:24:58 -0700
On Fri, Apr 11, 2014 at 11:19 AM, David Miller davem@davemloft.net wrote:
The vm_flags check is about "could it have gotten into the I-cache via this VMA".
.. and that's obviously complete bullshit and wrong. Which is my point.
Now, it's possible that doing things right is just too much work for architectures that don't even matter, but dammit, it's still wrong. If you change a page, and it's executably mapped into some other vma, the icache is possibly stale there. The whole _point_ of our cache flushing is to make caches coherent, and anything that uses "vma" to do so is *wrong*.
So your argument makes no sense. You're just re-stating that "it's wrong", but you're re-stating it in a way that makes it sounds like it could be right.
The "this page has been mapped executably" approach, in contrast, is *correct*. It has a chance in hell of actually making caches coherent.
You're right that using VMA as a hint during ptrace accesses is bogus. If it's writeable, shared, and executable, we won't do the right thing.
Since we do most of the cache flushing stuff during normal operations at the PTE modification point, perhaps a piece of page state could be used to handle this. We already use such a thing for D-cache alias flushing.
On Fri, Apr 11, 2014 at 11:58 AM, David Miller davem@davemloft.net wrote:
Since we do most of the cache flushing stuff during normal operations at the PTE modification point, perhaps a piece of page state could be used to handle this. We already use such a thing for D-cache alias flushing.
So looking at the powerpc code, I thought ppc already did this, but it seems to do something different: it lazily does the icache flush at page fault time if the page has been marked by dcache flush (with the PG_arch_1 bit indicating whether the page is coherent in the I$).
But I don't see it trying to actually flush the icache of already mapped processes when modifying the dcache.
So while we *could* do that, apparently no architecture does this. Even the one architecture that I thought did it doesn'r really try to make things globally coherent.
(My "analysis" was mainly using "git grep", so maybe I missed something).
Linus
On 11 April 2014 11:02, David Miller davem@davemloft.net wrote:
From: Linus Torvalds torvalds@linux-foundation.org Date: Fri, 11 Apr 2014 10:50:31 -0700
So I really think we should just have a fixed "flush_icache_page(page,vaddr)" function. Maybe add a "len" argument, to allow architectures that have to loop over cachelines to do just a minimal loop.
It's not enough, we need to have the 'mm' so we can know what cpu's this address space has executed upon, and therefore what cpus need the broadcast flush.
But in uprobes case xol slot where instruction write happened will be used only by current CPU. The way I read uprobes code other core when it hit the same uprobe address will use different xol slot. Xol slot size is cache line so it will not be moved around. So as long as we know for sure that while tasks performs single step on uprobe xol area instruction it won't be migrated to another core we don't need to do broadcast to any other cores.
Thanks, Victor
See my other reply, we can just make a __copy_to_user_page() that takes 'mm' and a boolean 'executable' which uprobes can unconditionally pass as true.
On 04/11, Victor Kamensky wrote:
On 11 April 2014 11:02, David Miller davem@davemloft.net wrote: But in uprobes case xol slot where instruction write happened will be used only by current CPU. The way I read uprobes code other core when it hit the same uprobe address will use different xol slot. Xol slot size is cache line so it will not be moved around.
Yes.
So as long as we know for sure that while tasks performs single step on uprobe xol area instruction it won't be migrated to another core we don't need to do broadcast to any other cores.
It can migrate to another CPU before it does single-step.
Oleg.
On 04/11, Linus Torvalds wrote:
On Fri, Apr 11, 2014 at 10:24 AM, Oleg Nesterov oleg@redhat.com wrote:
+static void arch_uprobe_copy_ixol(struct xol_area *area, unsigned long vaddr,
struct arch_uprobe *auprobe)
+{ +#ifndef ARCH_UPROBE_XXX
copy_to_page(area->page, vaddr, &auprobe->ixol, sizeof(&auprobe->ixol));
/*
* We probably need flush_icache_user_range() but it needs vma.
* If this doesn't work define ARCH_UPROBE_XXX.
*/
flush_dcache_page(area->page);
+#else
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
down_read(&mm->mmap_sem);
vma = find_exact_vma(mm, area->vaddr, area->vaddr + PAGE_SIZE);
if (vma) {
void *kaddr = kmap_atomic(area->page);
copy_to_user_page(vma, area->page,
vaddr, kaddr + (vaddr & ~PAGE_MASK),
&auprobe->ixol, sizeof(&auprobe->ixol));
kunmap_atomic(kaddr);
}
up_read(&mm->mmap_sem);
+#endif
Yeah, no, this is wrong.
Yesss, agreed.
So I really think we should just have a fixed "flush_icache_page(page,vaddr)" function. ... Then the uprobe case can just do
copy_to_page() flush_dcache_page() flush_icache_page()
And I obviously like this idea because (iiuc) it more or less matches flush_icache_page_xxx() I tried to suggest.
But we need a short term solution for arm. And unless I misunderstood Russell (this is quite possible), arm needs to disable preemption around copy + flush.
Russel, so what do you think we can do for arm right now? Does the patch above (and subsequent discussion) answer the "why reinvent" question ?
Oleg.
On 14 April 2014 11:59, Oleg Nesterov oleg@redhat.com wrote:
On 04/11, Linus Torvalds wrote:
On Fri, Apr 11, 2014 at 10:24 AM, Oleg Nesterov oleg@redhat.com wrote:
+static void arch_uprobe_copy_ixol(struct xol_area *area, unsigned long vaddr,
struct arch_uprobe *auprobe)
+{ +#ifndef ARCH_UPROBE_XXX
copy_to_page(area->page, vaddr, &auprobe->ixol, sizeof(&auprobe->ixol));
/*
* We probably need flush_icache_user_range() but it needs vma.
* If this doesn't work define ARCH_UPROBE_XXX.
*/
flush_dcache_page(area->page);
+#else
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
down_read(&mm->mmap_sem);
vma = find_exact_vma(mm, area->vaddr, area->vaddr + PAGE_SIZE);
if (vma) {
void *kaddr = kmap_atomic(area->page);
copy_to_user_page(vma, area->page,
vaddr, kaddr + (vaddr & ~PAGE_MASK),
&auprobe->ixol, sizeof(&auprobe->ixol));
kunmap_atomic(kaddr);
}
up_read(&mm->mmap_sem);
+#endif
Yeah, no, this is wrong.
Yesss, agreed.
So I really think we should just have a fixed "flush_icache_page(page,vaddr)" function. ... Then the uprobe case can just do
copy_to_page() flush_dcache_page() flush_icache_page()
And I obviously like this idea because (iiuc) it more or less matches flush_icache_page_xxx() I tried to suggest.
Would not page granularity to be too expensive? Note you need to do that on each probe hit and you flushing whole data and instruction page every time. IMHO it will work correctly when you flush just few dcache/icache lines that correspond to xol slot that got modified. Note copy_to_user_page takes len that describes size of area that has to be flushed. Given that we are flushing xol area page at this case; and nothing except one xol slot is any interest for current task, and if CPU can flush one dcache and icache page as quickly as it can flush range, my remark may not matter.
I personally would prefer if we could have function like copy_to_user_page but without requirement to pass vma to it.
Thanks, Victor
But we need a short term solution for arm. And unless I misunderstood Russell (this is quite possible), arm needs to disable preemption around copy + flush.
Russel, so what do you think we can do for arm right now? Does the patch above (and subsequent discussion) answer the "why reinvent" question ?
Oleg.
On 14 April 2014 13:05, Victor Kamensky victor.kamensky@linaro.org wrote:
On 14 April 2014 11:59, Oleg Nesterov oleg@redhat.com wrote:
On 04/11, Linus Torvalds wrote:
On Fri, Apr 11, 2014 at 10:24 AM, Oleg Nesterov oleg@redhat.com wrote:
+static void arch_uprobe_copy_ixol(struct xol_area *area, unsigned long vaddr,
struct arch_uprobe *auprobe)
+{ +#ifndef ARCH_UPROBE_XXX
copy_to_page(area->page, vaddr, &auprobe->ixol, sizeof(&auprobe->ixol));
/*
* We probably need flush_icache_user_range() but it needs vma.
* If this doesn't work define ARCH_UPROBE_XXX.
*/
flush_dcache_page(area->page);
+#else
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
down_read(&mm->mmap_sem);
vma = find_exact_vma(mm, area->vaddr, area->vaddr + PAGE_SIZE);
if (vma) {
void *kaddr = kmap_atomic(area->page);
copy_to_user_page(vma, area->page,
vaddr, kaddr + (vaddr & ~PAGE_MASK),
&auprobe->ixol, sizeof(&auprobe->ixol));
kunmap_atomic(kaddr);
}
up_read(&mm->mmap_sem);
+#endif
Yeah, no, this is wrong.
Yesss, agreed.
So I really think we should just have a fixed "flush_icache_page(page,vaddr)" function. ... Then the uprobe case can just do
copy_to_page() flush_dcache_page() flush_icache_page()
And I obviously like this idea because (iiuc) it more or less matches flush_icache_page_xxx() I tried to suggest.
Would not page granularity to be too expensive? Note you need to do that on each probe hit and you flushing whole data and instruction page every time. IMHO it will work correctly when you flush just few dcache/icache lines that correspond to xol slot that got modified. Note copy_to_user_page takes len that describes size of area that has to be flushed. Given that we are flushing xol area page at this case; and nothing except one xol slot is any interest for current task, and if CPU can flush one dcache and icache page as quickly as it can flush range, my remark may not matter.
I personally would prefer if we could have function like copy_to_user_page but without requirement to pass vma to it.
I was trying to collect some experimental data around this discussion. I did not find anything super surprising and I am not sure how it would matter, but since I collected it already, I will just share anyway. The result covers only one architecture so they should be taken with grain of salt.
Test was conducted on ARM h/w. Arndale with Exynos 5250 2 cores CPU and Pandaboard ES with OMAP 4460 were tested.
The uporbes/systemtap test was arranged in the following way. SystemTap module was counting number of times functions was called. The SystemTap action is very simple, counter increment, is close to noop operation that allows to see tracing overhead.
Traced user-land function had approximately 8000 instructions (unoptimized empty loop of 1000 iterations, with each interaction is 8 instructions). That function was constantly called in the loop 1 million times, and that interval was timed. SystemTap/uprobes testing was enabled and it was observed how targeted user-land execution time changed.
Test scenarios and variations -----------------------------
Here is scenarios where measurements took place:
vanilla - no tracing, 1 million calls of function that executes 8000 instructions
Oleg's fix - Oleg's fix proposed on [1]. Basically it uses copy_to_user_page and it does dynamic look-up of xol area vma every tracing
my arm specific fix - this one was proposed on as [2]. It is close to discussed possible solution if we could have something similar to copy_to_user_page function but which does not required vma. My code tried to shared ARM backend of copy_to_user_page and xol access flush function
Oleg's fix + forced broadcast - one of concerns I had is situation where smp function call broadcast has to be happen to flush icache. On both of my board that was not needed. So to simulated such situation I've changed code ARM backend of copy_to_user_page to do smp_call_function(flush_ptrace_access_other
Tested application had two possible dimensions:
1) number of threads that runs loop over traced function to see how tracing would cope with multicore, default is only one thread, but test could run another loop on second core.
2) number of mapping in target process, target process could have 1000 files mapping to create bunch of vmas. It is to test how much dynamic look-up of xol area vma matters.
Results -------
Number shown in the table is time in microseconds to execute tested function with and without tracing presence.
Please note well tracing overhead include all related to tracing pieces, not only cache flush that is under discussion. Those pieces will be: taking arch exception layer; uprobes layer; uprobes arch specific layer (before/after); xol look-up, update and cache flush; uprobes single stepping logic, systemtap module callback that generated for .stp tracing script, etc
Arndale Pandaboard ES
vanilla 5.0 (100%) 11.5 (100%)
Oleg's fix 9.8 (196%) 28.1 (244%)
Oleg's fix 10.0 (200%) 28.7 (250%) + 1000 mappings
Victor's fix 9.4 (188%) 26.4 (230%)
Oleg's fix 13.7 (274%) 39.8 (346%) + broadcast 1 thread
Oleg's fix 14.1 (282%) 41.6 (361%) + broadcast 2 threads
Observations ------------
x) generally uprobes tracing is a bit expensive 1 trace roughly would cost around 10000 instructions/cycles
x) the way how cache is flushed matters somewhat, but because of big overall tracing overhead those differences may not matter
x) looking at 'Oleg's fix' vs 'Oleg's fix + 1000 mappings' shows that vma look-up noticeable but the difference is marginal. No surprise here: rb tree search works fast.
x) need to broadcast icache flush has most noticeable impact. I am not sure how essential is that. Both tested platforms Exynos 5250 and OMAP 4460 did not need that operation. I am not sure what CPU would really have this issue ...
x) fix that I did for ARM that shares ARM code with copy_to_user_page but does not need vma performs best. Essentially it differs from 'Oleg's fix' that no vma lookup at all. But I have to admit resulting code looks a bit ugly. I wonder whether the gain matters enough ... maybe not.
Dynamic xol slots vs cached to uprobe xol slots -----------------------------------------------
This section could be a bit off topic, but introduce interesting data point.
When I looked at current uprobes single step out line code and compared it with code that was in the past (utrace times) I noticed main essential difference how xol slots are handled: Currently for each hit uprobes code allocate xol slot and needs dcache/icache flush. But in the past xol slot was attached/cached to uprobe entry and if there were enough xol slots dcache/icache flush would happen only once and latter tracing would not need it. For cases where there were not enough xol slots lru algorithm was used to rotate xol slots vs uprobes.
Previous uprobes mechanism was more immune to cost of modifying instruction stream, because modifying instruction stream was one time operation and after that under normal circumstances traced apps did not touch instructions at all. I am quite sure that semi-static xol slot allocation scheme had it is own set of issues.
I've tried to hack cached xol slot scheme and measure time difference it brings.
Arndale
hack static 8.9 (188%) xol slot
8.9 microseconds gives idea about all other overheads during uprobes tracing except of xol allocation and icache/dcache flush. I.e cost of dynamically allocating xol slot, dcache/icache flush and impact of cache flush on application is around 0.5 and 1.1 microsecond as long as no cache operations broadcast is involved. I.e cost is not that big, as long as modern CPU that does not need cache flush broadcasts, dynamic xol scheme looks OK to me.
Raw Results and Test Source Code --------------------------------
I don't publish my test source code and raw results here because it is quite big. Raw results were collected with target test running under perf, so it could be seen how different schemes affect cache and tlb misses. If anyone interested in source and raw data please let me know I will post it here.
Thanks, Victor
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-April/246595.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-April/245743.html
Thanks, Victor
But we need a short term solution for arm. And unless I misunderstood Russell (this is quite possible), arm needs to disable preemption around copy + flush.
Russel, so what do you think we can do for arm right now? Does the patch above (and subsequent discussion) answer the "why reinvent" question ?
Oleg.
On 04/14, Victor Kamensky wrote:
Oleg's fix - Oleg's fix proposed on [1]. Basically it uses copy_to_user_page and it does dynamic look-up of xol area vma every tracing
I guess I was not clear. No, I didn't really try to propose this change, I do not like it ;)
I showed this hack in reply to multiple and persistent requests to reuse the ptrace solution we already have.
my arm specific fix - this one was proposed on as [2].
I didn't even try to read the changes in arm/, I can't understand them anyway. I leave this to you and Russel.
But, once again, please do not add arch_uprobe_flush_xol_access(), add arch_uprobe_copy_ixol().
x) fix that I did for ARM that shares ARM code with copy_to_user_page but does not need vma performs best.
The patch which adds copy_to_user_page(vma => NULL) into copy_to_page() ? Please see the comments in my previous email.
When I looked at current uprobes single step out line code and compared it with code that was in the past (utrace times) I noticed main essential difference how xol slots are handled: Currently for each hit uprobes code allocate xol slot and needs dcache/icache flush. But in the past xol slot was attached/cached to uprobe entry
Can't comment, I am not familiar with the old implementation.
But yes, the current implementation is not perfect. Once again, it would be nice to remove this vma. Even if this is not possible, we can try to share this memory. We do not even need lru, we can make it "per cpu" and avoid the broadcasts. On x86 this is simple, we have __switch_to_xtra() which can re-copy ->ixol[] and do flush_icache_range() if UTASK_SSTEP. Not sure this is possible on arm and other arch'es. But lets not discuss this right now, this is a bit off-topic currently.
Oleg.
On 04/14, Victor Kamensky wrote:
On 14 April 2014 11:59, Oleg Nesterov oleg@redhat.com wrote:
On 04/11, Linus Torvalds wrote:
So I really think we should just have a fixed "flush_icache_page(page,vaddr)" function. ... Then the uprobe case can just do
copy_to_page() flush_dcache_page() flush_icache_page()
And I obviously like this idea because (iiuc) it more or less matches flush_icache_page_xxx() I tried to suggest.
Would not page granularity to be too expensive? Note you need to do that on each probe hit and you flushing whole data and instruction page every time. IMHO it will work correctly when you flush just few dcache/icache lines that correspond to xol slot that got modified. Note copy_to_user_page takes len that describes size of area that has to be flushed. Given that we are flushing xol area page at this case; and nothing except one xol slot is any interest for current task, and if CPU can flush one dcache and icache page as quickly as it can flush range, my remark may not matter.
We can add "vaddr, len" to the argument list.
I personally would prefer if we could have function like copy_to_user_page but without requirement to pass vma to it.
I won't argue, but you need to convince maintainers.
And to remind, we need something simple/nonintrusive for arm right now. Again, I won't argue if we turn copy_to_page() + flush_dcache_page() into __weak arch_uprobe_copy_ixol(), and add the necessary hacks into arm's implementatiion. This is up to you and Russel.
But. Please do not add copy_to_user_page() into copy_to_page() (as your patch did). This is certainly not what uprobe_write_opcode() wants, we do not want or need "flush" in this case. The same for __create_xol_area().
Note also that currently copy_to_user_page() is always called under mmap_sem. I do not know if arm actually needs ->mmap_sem, but if you propose it as a generic solution we should probably take this lock.
Also. Even if we have copy_to_user_page_no_vma() or change copy_to_user_page() to accept vma => NULL, I am not sure this will work fine on arm when the probed application unmaps xol_area and mmaps something else at the same vaddr. I mean, in this case we do not care if the application crashes, but please verify that something really bad can't happen.
Let me repeat just in case, I know nothing about arm/. I can't even understand how, say, flush_pfn_alias() works, and how it should work if 2 threads call it at the same time with the same vaddr (or CACHE_COLOUR(vaddr)).
Oleg.
On 15 April 2014 08:46, Oleg Nesterov oleg@redhat.com wrote:
On 04/14, Victor Kamensky wrote:
On 14 April 2014 11:59, Oleg Nesterov oleg@redhat.com wrote:
On 04/11, Linus Torvalds wrote:
So I really think we should just have a fixed "flush_icache_page(page,vaddr)" function. ... Then the uprobe case can just do
copy_to_page() flush_dcache_page() flush_icache_page()
And I obviously like this idea because (iiuc) it more or less matches flush_icache_page_xxx() I tried to suggest.
Would not page granularity to be too expensive? Note you need to do that on each probe hit and you flushing whole data and instruction page every time. IMHO it will work correctly when you flush just few dcache/icache lines that correspond to xol slot that got modified. Note copy_to_user_page takes len that describes size of area that has to be flushed. Given that we are flushing xol area page at this case; and nothing except one xol slot is any interest for current task, and if CPU can flush one dcache and icache page as quickly as it can flush range, my remark may not matter.
We can add "vaddr, len" to the argument list.
I personally would prefer if we could have function like copy_to_user_page but without requirement to pass vma to it.
I won't argue, but you need to convince maintainers.
And to remind, we need something simple/nonintrusive for arm right now. Again, I won't argue if we turn copy_to_page() + flush_dcache_page() into __weak arch_uprobe_copy_ixol(), and add the necessary hacks into arm's implementatiion. This is up to you and Russel.
For short term arm specific solution I will follow up on [1]. Yes, I will incorporate your request to make arch_uprobe_copy_ixol() instead of arch_uprobe_flush_xol_access, will address Dave Long's comments about checkpatch and will remove special handling for broadcast situation (FLAG_UA_BROADCAST) since in further discussion it was established that task can migrate while doing uprobes xol single stepping.
I don't think my patch does those things that you describe below. Anyway, I will repost new version of short term arm specific fix proposal today PST and we will see what Russell, you and all will say about it.
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-April/245952.html http://lists.infradead.org/pipermail/linux-arm-kernel/2014-April/245743.html
Thanks, Victor
But. Please do not add copy_to_user_page() into copy_to_page() (as your patch did). This is certainly not what uprobe_write_opcode() wants, we do not want or need "flush" in this case. The same for __create_xol_area().
Note also that currently copy_to_user_page() is always called under mmap_sem. I do not know if arm actually needs ->mmap_sem, but if you propose it as a generic solution we should probably take this lock.
Also. Even if we have copy_to_user_page_no_vma() or change copy_to_user_page() to accept vma => NULL, I am not sure this will work fine on arm when the probed application unmaps xol_area and mmaps something else at the same vaddr. I mean, in this case we do not care if the application crashes, but please verify that something really bad can't happen.
Let me repeat just in case, I know nothing about arm/. I can't even understand how, say, flush_pfn_alias() works, and how it should work if 2 threads call it at the same time with the same vaddr (or CACHE_COLOUR(vaddr)).
Oleg.
On 04/15/14 11:46, Oleg Nesterov wrote:
But. Please do not add copy_to_user_page() into copy_to_page() (as your patch did). This is certainly not what uprobe_write_opcode() wants, we do not want or need "flush" in this case. The same for __create_xol_area().
It looked me like a call to a new __copy_to_user_page(current->mm, ...) in xol_get_insn_slot() would be in line with David Miller's suggestion and would cure the problem on ARM (and hopefuly be more philosophically correct for all architectures):
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..b418626 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1287,6 +1287,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) { struct xol_area *area; unsigned long xol_vaddr; + void *kaddr;
area = get_xol_area(); if (!area) @@ -1297,13 +1298,11 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) return 0;
/* Initialize the slot */ - copy_to_page(area->page, xol_vaddr, - &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); - /* - * We probably need flush_icache_user_range() but it needs vma. - * This should work on supported architectures too. - */ - flush_dcache_page(area->page); + kaddr = kmap_atomic(area->page); + __copy_to_user_page(current->mm, area->page, xol_vaddr, + kaddr + (xol_vaddr & ~PAGE_MASK), + &uprobe->arch.ixol, sizeof(uprobe->arch.ixol), true); + kunmap_atomic(kaddr);
return xol_vaddr; }
Opinions? It's possible this approach isn't good enough. Cache operations and VM are not my strong suit.
-dl
From: David Long dave.long@linaro.org Date: Tue, 15 Apr 2014 13:19:27 -0400
On 04/15/14 11:46, Oleg Nesterov wrote:
But. Please do not add copy_to_user_page() into copy_to_page() (as your patch did). This is certainly not what uprobe_write_opcode() wants, we do not want or need "flush" in this case. The same for __create_xol_area().
It looked me like a call to a new __copy_to_user_page(current->mm, ...) in xol_get_insn_slot() would be in line with David Miller's suggestion and would cure the problem on ARM (and hopefuly be more philosophically correct for all architectures):
It occurs to me that because of the specific environment in which this executes we can make the interface take advantage of the invariants at this call site:
1) We are copying into userspace and thus current->mm
2) We are only storing an instruction or two
So it would be just like the dynamic linker lazy resolving a PLT slot in userland.
Furthermore, we can do the stores using something akin to put_user(), directly store them into userspace.
This avoids completely any D-cache aliasing issues. The kernel stores to the same address, and therefore the same cache lines, as userspace would.
So for example, since even on the most braindamaged sparc64 cpus the remote cpus will have their I-cache snoop the store we do locally we just need to do a simple flush on the local cpu where this store is happening. So I'd like to do this something like:
stw %r1, [%r2 + 0x0] flush %r2
PowerPC could probably do something similar.
The copy_to_user_page() interface has to deal with storing into a non-current 'mm' and therefore via the kernel side copy of the page and that's why the D-cache aliasing issues are so painful to deal with.
Just a thought...
On 04/15, David Miller wrote:
From: David Long dave.long@linaro.org Date: Tue, 15 Apr 2014 13:19:27 -0400
On 04/15/14 11:46, Oleg Nesterov wrote:
But. Please do not add copy_to_user_page() into copy_to_page() (as your patch did). This is certainly not what uprobe_write_opcode() wants, we do not want or need "flush" in this case. The same for __create_xol_area().
It looked me like a call to a new __copy_to_user_page(current->mm, ...) in xol_get_insn_slot() would be in line with David Miller's suggestion and would cure the problem on ARM (and hopefuly be more philosophically correct for all architectures):
It occurs to me that because of the specific environment in which this executes we can make the interface take advantage of the invariants at this call site:
We are copying into userspace and thus current->mm
We are only storing an instruction or two
Not sure this can help, but also
3) nobody else will even read/execute this peace of memory
Furthermore, we can do the stores using something akin to put_user(), directly store them into userspace.
This avoids completely any D-cache aliasing issues. The kernel stores to the same address, and therefore the same cache lines, as userspace would.
Heh. Of course I thought about this. This can not work afaics.
We do not want to write to, say, page cache if the probed application mmaps a file at the same vaddr.
Oleg.
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 19:49:00 +0200
We do not want to write to, say, page cache if the probed application mmaps a file at the same vaddr.
If user has write access to that page... not our problem.
And if he doesn't we'll fault.
On 04/15, David Miller wrote:
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 19:49:00 +0200
We do not want to write to, say, page cache if the probed application mmaps a file at the same vaddr.
If user has write access to that page... not our problem.
Well, I am not sure.
Yes, this won't allow you to write the exploit. But if this actually happens because an application is buggy, this can lead to really hard-to-debug problems.
Oleg.
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 20:07:00 +0200
On 04/15, David Miller wrote:
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 19:49:00 +0200
We do not want to write to, say, page cache if the probed application mmaps a file at the same vaddr.
If user has write access to that page... not our problem.
Well, I am not sure.
Yes, this won't allow you to write the exploit. But if this actually happens because an application is buggy, this can lead to really hard-to-debug problems.
And this is different from a stray userland scribble onto such a page how?
This is all so much simpler if you don't insist on doing all of this hand-holding for the user, that's not the kernel's job.
On 04/15, David Miller wrote:
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 20:07:00 +0200
On 04/15, David Miller wrote:
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 19:49:00 +0200
We do not want to write to, say, page cache if the probed application mmaps a file at the same vaddr.
If user has write access to that page... not our problem.
Well, I am not sure.
Yes, this won't allow you to write the exploit. But if this actually happens because an application is buggy, this can lead to really hard-to-debug problems.
And this is different from a stray userland scribble onto such a page how?
Sure. But I think that the kernel should not write to the file just because a buggy application did the extra munmap() or passed a wrong addres to mmap/munmap.
This is all so much simpler if you don't insist on doing all of this hand-holding for the user, that's not the kernel's job.
Well, I do not agree, but perhaps because I do not really understand what do you mean.
Oleg.
On 04/15, David Long wrote:
On 04/15/14 11:46, Oleg Nesterov wrote:
But. Please do not add copy_to_user_page() into copy_to_page() (as your patch did). This is certainly not what uprobe_write_opcode() wants, we do not want or need "flush" in this case. The same for __create_xol_area().
It looked me like a call to a new __copy_to_user_page(current->mm, ...) in xol_get_insn_slot() would be in line with David Miller's suggestion and would cure the problem on ARM (and hopefuly be more philosophically correct for all architectures):
David, let me repeat. I am not going to argue even if I obviously like the Linus's suggestion more.
I only argued with the suggestions to follow the __access_remote_vm() logic.
@@ -1297,13 +1298,11 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) return 0;
/* Initialize the slot */
copy_to_page(area->page, xol_vaddr,
&uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
/*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on supported architectures too.
*/
flush_dcache_page(area->page);
kaddr = kmap_atomic(area->page);
__copy_to_user_page(current->mm, area->page, xol_vaddr,
kaddr + (xol_vaddr & ~PAGE_MASK),
&uprobe->arch.ixol, sizeof(uprobe->arch.ixol), true);
kunmap_atomic(kaddr);
This all is fine, but you need to introduce __copy_to_user_page() first.
And you need to do this for every arch. And you need to verify that it does not need mmap_sem. It seems that at least alpha needs this lock. And as Russel pointed out (again, again, I could easily misunderstood him) you need preempt_disable() around memcpy + flush, so down_read(mmap_sem) should probably go into __copy_to_user_page(). And I am not sure this helper needs "struct mm_struct *mm" argument, but this is minor.
Finally, let me repeat, you should verify that this __copy_to_user_page(page, uaddr, kaddr) will not something bad if uaddr is not mmapped, or its mapping do not match area->page.
Good luck ;)
Oleg.
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 19:43:30 +0200
Finally, let me repeat, you should verify that this __copy_to_user_page(page, uaddr, kaddr) will not something bad if uaddr is not mmapped, or its mapping do not match area->page.
Just directly access userspace with the usual exception mechanism we use for copy_to_user(), put_user(), et al. and if it faults you'll get -EFAULT and handle it.
This also avoids the D-cache aliasing issues entirely as I explained in my other reply.
On 04/15, David Miller wrote:
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 19:43:30 +0200
Finally, let me repeat, you should verify that this __copy_to_user_page(page, uaddr, kaddr) will not something bad if uaddr is not mmapped, or its mapping do not match area->page.
Just directly access userspace with the usual exception mechanism we use for copy_to_user(), put_user(), et al. and if it faults you'll get -EFAULT and handle it.
This also avoids the D-cache aliasing issues entirely as I explained in my other reply.
Yes, yes, this is obvious.
But I have no idea what else we should do to take care of icache.
Oleg.
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 20:03:11 +0200
On 04/15, David Miller wrote:
From: Oleg Nesterov oleg@redhat.com Date: Tue, 15 Apr 2014 19:43:30 +0200
Finally, let me repeat, you should verify that this __copy_to_user_page(page, uaddr, kaddr) will not something bad if uaddr is not mmapped, or its mapping do not match area->page.
Just directly access userspace with the usual exception mechanism we use for copy_to_user(), put_user(), et al. and if it faults you'll get -EFAULT and handle it.
This also avoids the D-cache aliasing issues entirely as I explained in my other reply.
Yes, yes, this is obvious.
But I have no idea what else we should do to take care of icache.
Russell, if for example userland stores into the instruction stream to patch an instruction, what needs to be done on ARM?
On Tue, Apr 15, 2014 at 02:30:54PM -0400, David Miller wrote:
Russell, if for example userland stores into the instruction stream to patch an instruction, what needs to be done on ARM?
Hi David,
I'm afraid I won't be too responsive this week, but I'll try to provide input where I can. So I'll try to cover all points in the previous discussion in this reply.
I think your suggestion that we should be writing directly to userspace from uprobes is definitely the correct way forward, as that nicely takes any d-cache aliasing issues completely out of the picture.
However, unlike your "most braindead sparc" CPU, the i-cache doesn't snoop d-cache stores at all. However, this is something that we already deal with since self-modifying code has to work, so (from userspace) we have a syscall that is used to sort that out. Internally in the kernel, this translates to:
ret = flush_cache_user_range(start, end)
This deals with whatever the CPU requires to be able to correctly execute code which has been previously written in the range - and only actions on the currently mapped userspace.
I hope this helps.
From: Russell King - ARM Linux linux@arm.linux.org.uk Date: Tue, 15 Apr 2014 19:47:26 +0100
However, unlike your "most braindead sparc" CPU, the i-cache doesn't snoop d-cache stores at all.
All that matters is remote snooping on SMP, not local snooping.
However, this is something that we already deal with since self-modifying code has to work, so (from userspace) we have a syscall that is used to sort that out. Internally in the kernel, this translates to:
ret = flush_cache_user_range(start, end)
This deals with whatever the CPU requires to be able to correctly execute code which has been previously written in the range - and only actions on the currently mapped userspace.
Looking around it seems the I-cache line mcr operation should do the right thing for most chips.
You could simply make a new cpuc op for writing an instruction or two to userspace and doing the I-cache line mcr op afterwards.
From: David Miller davem@davemloft.net Date: Tue, 15 Apr 2014 14:30:54 -0400 (EDT)
Russell, if for example userland stores into the instruction stream to patch an instruction, what needs to be done on ARM?
Looking around I suspect something like:
mcrne p15, 0, INSN_ADDR, c7, c5, 1
after the instruction stores will do it.
On Tue, Apr 15, 2014 at 02:50:06PM -0400, David Miller wrote:
From: David Miller davem@davemloft.net Date: Tue, 15 Apr 2014 14:30:54 -0400 (EDT)
Russell, if for example userland stores into the instruction stream to patch an instruction, what needs to be done on ARM?
Looking around I suspect something like:
mcrne p15, 0, INSN_ADDR, c7, c5, 1
after the instruction stores will do it.
It does still need to be pushed out of the D-cache first though. So, for ARMv7 for example:
str NEW_INSN, [INSN_ADDR] @ store new instruction mcr p15, 0, INSN_ADDR, c7, c11, 1 @ clean d line mcr p15, 0, INSN_ADDR, c7, c5, 1 @ flush i line
would do it. We of course need the user access marking on that (so that any fault doesn't oops the kernel) - not only for the store, but also the following two instructions which could fault (and oops unless they're marked with a fixup) if someone were to munmap() this page in another thread. All those fixups can just do the "lets return -EFAULT" from the operation.
From: Russell King - ARM Linux linux@arm.linux.org.uk Date: Tue, 15 Apr 2014 20:29:15 +0100
On Tue, Apr 15, 2014 at 02:50:06PM -0400, David Miller wrote:
From: David Miller davem@davemloft.net Date: Tue, 15 Apr 2014 14:30:54 -0400 (EDT)
Russell, if for example userland stores into the instruction stream to patch an instruction, what needs to be done on ARM?
Looking around I suspect something like:
mcrne p15, 0, INSN_ADDR, c7, c5, 1
after the instruction stores will do it.
It does still need to be pushed out of the D-cache first though. So, for ARMv7 for example:
str NEW_INSN, [INSN_ADDR] @ store new instruction mcr p15, 0, INSN_ADDR, c7, c11, 1 @ clean d line mcr p15, 0, INSN_ADDR, c7, c5, 1 @ flush i line
would do it. We of course need the user access marking on that (so that any fault doesn't oops the kernel) - not only for the store, but also the following two instructions which could fault (and oops unless they're marked with a fixup) if someone were to munmap() this page in another thread. All those fixups can just do the "lets return -EFAULT" from the operation.
Right, you'd need the exception table business used in put_user() et al.
On 04/15/14 13:43, Oleg Nesterov wrote:
On 04/15, David Long wrote:
On 04/15/14 11:46, Oleg Nesterov wrote:
But. Please do not add copy_to_user_page() into copy_to_page() (as your patch did). This is certainly not what uprobe_write_opcode() wants, we do not want or need "flush" in this case. The same for __create_xol_area().
It looked me like a call to a new __copy_to_user_page(current->mm, ...) in xol_get_insn_slot() would be in line with David Miller's suggestion and would cure the problem on ARM (and hopefuly be more philosophically correct for all architectures):
David, let me repeat. I am not going to argue even if I obviously like the Linus's suggestion more.
I only argued with the suggestions to follow the __access_remote_vm() logic.
@@ -1297,13 +1298,11 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) return 0;
/* Initialize the slot */
copy_to_page(area->page, xol_vaddr,
&uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
/*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on supported architectures too.
*/
flush_dcache_page(area->page);
kaddr = kmap_atomic(area->page);
__copy_to_user_page(current->mm, area->page, xol_vaddr,
kaddr + (xol_vaddr & ~PAGE_MASK),
&uprobe->arch.ixol, sizeof(uprobe->arch.ixol), true);
kunmap_atomic(kaddr);
This all is fine, but you need to introduce __copy_to_user_page() first.
Yes, I have that coded for 14 out of 24 architectures (the easy ones). The remaining ones need more work. Some may prove problematic. The whole approach could yet prove impractical. More recent suggested approaches may be better too (or Linus's relatively simple change).
And you need to do this for every arch. And you need to verify that it does not need mmap_sem. It seems that at least alpha needs this lock. And as Russel pointed out (again, again, I could easily misunderstood him) you need preempt_disable() around memcpy + flush, so down_read(mmap_sem) should probably go into __copy_to_user_page(). And I am not sure this helper needs "struct mm_struct *mm" argument, but this is minor.
I will look at the locking. I'm not sure yet if anything needs struct mm_struct either.
Finally, let me repeat, you should verify that this __copy_to_user_page(page, uaddr, kaddr) will not something bad if uaddr is not mmapped, or its mapping do not match area->page.
Good luck ;)
Oleg.
From: David Long dave.long@linaro.org Date: Tue, 15 Apr 2014 15:39:18 -0400
Yes, I have that coded for 14 out of 24 architectures (the easy ones). The remaining ones need more work. Some may prove problematic. The whole approach could yet prove impractical. More recent suggested approaches may be better too (or Linus's relatively simple change).
BTW, another reason perhaps to write directly to userspace and have a uprobes_*() specific interface is that you'll only need to do 3 architectures, the ones that support uprobes.
That way, this is just part of what every arch needs to implement in order to support uprobes.
On 15 April 2014 12:53, David Miller davem@davemloft.net wrote:
From: David Long dave.long@linaro.org Date: Tue, 15 Apr 2014 15:39:18 -0400
Yes, I have that coded for 14 out of 24 architectures (the easy ones). The remaining ones need more work. Some may prove problematic. The whole approach could yet prove impractical. More recent suggested approaches may be better too (or Linus's relatively simple change).
BTW, another reason perhaps to write directly to userspace and have a uprobes_*() specific interface is that you'll only need to do 3 architectures, the ones that support uprobes.
That way, this is just part of what every arch needs to implement in order to support uprobes.
David, Russell, I hope I correctly understood your idea of writing directly into user space. Please check patch below. On my tests it works ok. Look for arch_uprobe_copy_ixol in arch/arm/kernel/uprobes.c.
However, I've run into the issue while I've tried that - I had to add VM_WRITE to xol area mapping. I.e area should be writable and executable in order for put_user or __copy_to_user to work. This does not seem right to have such mapping inside of user-space process. It seems as possible exploitation point. Especially it seems that xol page is left around even when tracing is complete. I feel very uneasy about this direction, unless I am missing something and there is other way to do that.
Another concern about this approach: will flush_cache_user_range function take care of smp case were remove snooping is not supported. I think use case that Russell mentioned about self modified code and special system call implies yes.
From e325a1a1bdddc7bd95301f0031d868bc69ddcddb Mon Sep 17 00:00:00 2001
From: Victor Kamensky victor.kamensky@linaro.org Date: Mon, 7 Apr 2014 17:57:36 -0700 Subject: [PATCH] ARM: uprobes need icache flush after xol write
After instruction write into xol area, on ARM V7 architecture code need to flush dcache and icache to sync them up for given set of addresses. Having just 'flush_dcache_page(page)' call is not enough - it is possible to have stale instruction sitting in icache for given xol area slot address.
Introduce arch_uprobe_ixol_copy weak function that by default calls uprobes copy_to_page function and than flush_dcache_page function and on ARM define new one that handles xol slot copy in ARM specific way
Arm implementation of arch_uprobe_ixol_copy just makes __copy_to_user which does not have dcache aliasing issues and then flush_cache_user_range to push dcache out and invalidate corresponding icache entries.
Note in order to write into uprobes xol area had to add VM_WRITE to xol area mapping.
Signed-off-by: Victor Kamensky victor.kamensky@linaro.org --- arch/arm/kernel/uprobes.c | 8 ++++++++ include/linux/uprobes.h | 3 +++ kernel/events/uprobes.c | 27 ++++++++++++++++++--------- 3 files changed, 29 insertions(+), 9 deletions(-)
diff --git a/arch/arm/kernel/uprobes.c b/arch/arm/kernel/uprobes.c index f9bacee..4836e54 100644 --- a/arch/arm/kernel/uprobes.c +++ b/arch/arm/kernel/uprobes.c @@ -113,6 +113,14 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, return 0; }
+void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, + void *src, unsigned long len) +{ + if (!__copy_to_user((void *) vaddr, src, len)) + flush_cache_user_range(vaddr, len); +} + + int arch_uprobe_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs) { struct uprobe_task *utask = current->utask; diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index edff2b9..c52f827 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -32,6 +32,7 @@ struct vm_area_struct; struct mm_struct; struct inode; struct notifier_block; +struct page;
#define UPROBE_HANDLER_REMOVE 1 #define UPROBE_HANDLER_MASK 1 @@ -127,6 +128,8 @@ extern int arch_uprobe_exception_notify(struct notifier_block *self, unsigned l extern void arch_uprobe_abort_xol(struct arch_uprobe *aup, struct pt_regs *regs); extern unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs); extern bool __weak arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); +extern void __weak arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, + void *src, unsigned long len); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..9e22002 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1149,7 +1149,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) }
ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE, - VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, &area->page); + VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_WRITE, &area->page); if (ret) goto fail;
@@ -1296,14 +1296,8 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) if (unlikely(!xol_vaddr)) return 0;
- /* Initialize the slot */ - copy_to_page(area->page, xol_vaddr, - &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); - /* - * We probably need flush_icache_user_range() but it needs vma. - * This should work on supported architectures too. - */ - flush_dcache_page(area->page); + arch_uprobe_copy_ixol(area->page, xol_vaddr, + &uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
return xol_vaddr; } @@ -1346,6 +1340,21 @@ static void xol_free_insn_slot(struct task_struct *tsk) } }
+void __weak arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, + void *src, unsigned long len) +{ + /* Initialize the slot */ + copy_to_page(page, vaddr, src, len); + + /* + * We probably need flush_icache_user_range() but it needs vma. + * This should work on most of architectures by default. If + * architecture needs to do something different it can define + * its own version of the function. + */ + flush_dcache_page(page); +} + /** * uprobe_get_swbp_addr - compute address of swbp given post-swbp regs * @regs: Reflects the saved state of the task after it has hit a breakpoint
From: Victor Kamensky victor.kamensky@linaro.org Date: Tue, 15 Apr 2014 18:42:39 -0700
On 15 April 2014 12:53, David Miller davem@davemloft.net wrote:
From: David Long dave.long@linaro.org Date: Tue, 15 Apr 2014 15:39:18 -0400
Yes, I have that coded for 14 out of 24 architectures (the easy ones). The remaining ones need more work. Some may prove problematic. The whole approach could yet prove impractical. More recent suggested approaches may be better too (or Linus's relatively simple change).
BTW, another reason perhaps to write directly to userspace and have a uprobes_*() specific interface is that you'll only need to do 3 architectures, the ones that support uprobes.
That way, this is just part of what every arch needs to implement in order to support uprobes.
David, Russell, I hope I correctly understood your idea of writing directly into user space. Please check patch below.
Just FYI, your mail client corrupted the patch, at a minimum chopping up long lines.
From: Victor Kamensky victor.kamensky@linaro.org Date: Tue, 15 Apr 2014 18:42:39 -0700
+void __weak arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len)
+{
- /* Initialize the slot */
- copy_to_page(page, vaddr, src, len);
- /*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on most of architectures by default. If
* architecture needs to do something different it can define
* its own version of the function.
*/
- flush_dcache_page(page);
+}
I would say that, if anything, flush_dcache_page() is unnecessary if you just copy straight to userspace.
The default implementation should be copy_to_user(), and that's what every architecture can use if it needs no I-cache flushing.
On 15 April 2014 19:24, David Miller davem@davemloft.net wrote:
From: Victor Kamensky victor.kamensky@linaro.org Date: Tue, 15 Apr 2014 18:42:39 -0700
+void __weak arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len)
+{
- /* Initialize the slot */
- copy_to_page(page, vaddr, src, len);
- /*
* We probably need flush_icache_user_range() but it needs vma.
* This should work on most of architectures by default. If
* architecture needs to do something different it can define
* its own version of the function.
*/
- flush_dcache_page(page);
+}
I would say that, if anything, flush_dcache_page() is unnecessary if you just copy straight to userspace. The default implementation should be copy_to_user(), and that's what every architecture can use if it needs no I-cache flushing.
OK, got it. I tried not to touch existing cases (x86 and ppc), but yes it would benefit here as well.
But don't you think that writable and executable uprobes xol page is show stopper for this approach?
Thanks, Victor
From: Victor Kamensky victor.kamensky@linaro.org Date: Tue, 15 Apr 2014 20:06:57 -0700
But don't you think that writable and executable uprobes xol page is show stopper for this approach?
Not really.
From: Russell King - ARM Linux linux@arm.linux.org.uk Date: Fri, 11 Apr 2014 16:30:41 +0100
| Given that we've already solved that problem, wouldn't it be a good idea | if the tracing code would stop trying to reinvent broken solutions to | problems we have already solved?
I fail to see what your problem is with keeping the vma around, and using that infrastructure. If it needs optimisation for uprobes, then let's optimise it. Let's not go inventing a whole new interface solving the same problem.
+1
On Fri, Apr 11, 2014 at 05:22:07PM +0200, Oleg Nesterov wrote:
And I am just curious, why arm's copy_to_user_page() disables premption before memcpy?
Without looking, I suspect its because the VIVT caches, they need to get shot down on every context switch.
On Fri, Apr 11, 2014 at 05:32:49PM +0200, Peter Zijlstra wrote:
On Fri, Apr 11, 2014 at 05:22:07PM +0200, Oleg Nesterov wrote:
And I am just curious, why arm's copy_to_user_page() disables premption before memcpy?
Without looking, I suspect its because the VIVT caches, they need to get shot down on every context switch.
So... let's think about that for a moment... if we have a preemption event, then that's a context switch, which means...
No, this is obviously not the reason, because such an event on a fully VIVT system would result in the caches being flushed, meaning that we wouldn't need to do anything if we could be predicably preempted at that point.
On Fri, Apr 11, 2014 at 05:00:29PM +0100, Russell King - ARM Linux wrote:
On Fri, Apr 11, 2014 at 05:32:49PM +0200, Peter Zijlstra wrote:
On Fri, Apr 11, 2014 at 05:22:07PM +0200, Oleg Nesterov wrote:
And I am just curious, why arm's copy_to_user_page() disables premption before memcpy?
Without looking, I suspect its because the VIVT caches, they need to get shot down on every context switch.
So... let's think about that for a moment... if we have a preemption event, then that's a context switch, which means...
No, this is obviously not the reason, because such an event on a fully VIVT system would result in the caches being flushed, meaning that we wouldn't need to do anything if we could be predicably preempted at that point.
Yeah; I've since realized I was completely wrong about that. Thanks for explaining though.
On 11 April 2014 07:56, Oleg Nesterov oleg@redhat.com wrote:
First of all: I do not pretend I really understand the problems with icache/etc coherency and how flush_icache_range() actually works on alpha. Help.
For those who were not cc'ed. A probed task has a special "xol" vma, it is used to execute the probed insn out of line.
Initial implementation was x86 only, so we simply copied the probed insn into this vma and everything was fine, flush_icache_range() is nop on x86.
Then we added flush_dcache_page() for powerpc,
// this is just kmap() + memcpy() copy_to_page(area->page, xol_vaddr, &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); /* * We probably need flush_icache_user_range() but it needs vma. * This should work on supported architectures too. */ flush_dcache_page(area->page);
but this doesn't work on arm. So we need another fix.
On 04/11, David Miller wrote:
From: David Long dave.long@linaro.org Date: Thu, 10 Apr 2014 23:45:31 -0400
Replace memcpy and dcache flush in generic uprobes with a call to copy_to_user_page(), which will do a proper flushing of kernel and user cache. Also modify the inmplementation of copy_to_user_page to assume a NULL vma pointer means the user icache corresponding to this right is stale and needs to be flushed. Note that this patch does not fix copy_to_user page for the sh, alpha, sparc, or mips architectures (which do not currently support uprobes).
Signed-off-by: David A. Long dave.long@linaro.org
You really need to pass the proper VMA down to the call site rather than pass NULL, that's extremely ugly and totally unnecesary.
I agree that we should not change copy_to_user_page(), but I am not sure the code above should use copy_to_user_page().
Because the code above really differs in my opinion from the only user of copy_to_user_page(), __access_remote_vm().
1. First of all, we do not know vma. OK, we can down_read(mmap_sem) and do find_vma() of course. This is a bit unfortunate, especially because the architectures we currently support do not need this.
Question, maybe silly one but I don't know the answer, why can't we just do look up for vma once and cache results in place like xol_area (along with xol_area.vaddr) and use it all the time. IOW under what circumstances vma for xol area can disappear change so we need constant lookup for it? Comment in xol_area
/*
- We keep the vma's vm_start rather than a pointer to the vma
- itself. The probed process or a naughty kernel module could make
- the vma go away, and we must handle that reasonably gracefully.
*/ unsigned long vaddr; /* Page(s) of instruction slots */
alludes to some of those conditions, but I don't quite follow. Should not we go after "probed process" ability to unmap xol area. xol area is like vdso, signal page and other mapping injected by kernel into user process address space, mmap call should ignore those.. I wonder what would happen if process would try to unmap vdso region.
But, 2. The problem is, it would be very nice to remove this vma, or at least hide it somehow from find_vma/etc. This is the special mapping we do not want to expose to user-space. In fact I even have the patches which remove this vma, but they do not work with compat tasks unfortunately.
I don't think it is right route. Xol area as well as vdso, signal page, etc should be visible as regular VMAs. There are other aspects of the system where they needed. Like core file collection - I would like to have xol area present in my core file if traced process crashed. Like /porc/<pid>/maps - I would like to see my memory layout through this interface and I would like to see xol area there because I can see xol area addresses by some other means.
3. Unlike __access_remote_vm() we always use current->mm, and the memory range changed by the code above can only be used by "current" thread. So (perhaps) flush_icache_user_range() can even treat this case as "mm->mm_users) <= 1" and avoid ipi_flush_icache_page (just in case, of course I do not know if this is actually possible).
So can't we do something else? Lets forget about arch/arm for the moment, suppose that we want to support uprobes on alpha. Can't we do _something_ like below?
Appeal of copy_to_user_page approach is that I don't need to know how to handle sync up of icache and dcache on that architecture, it is already done by someone else when they programmed basic ptrace breakpoint write behavior.
Thanks, Victor
And perhap we can do something like this on arm? it can do kmap(page) / page_address(page) itself.
Oleg.
--- x/arch/alpha/kernel/smp.c +++ x/arch/alpha/kernel/smp.c @@ -751,15 +751,9 @@ ipi_flush_icache_page(void *x) flush_tlb_other(mm); }
-void -flush_icache_user_range(struct vm_area_struct *vma, struct page *page,
unsigned long addr, int len)
-{
struct mm_struct *mm = vma->vm_mm;
if ((vma->vm_flags & VM_EXEC) == 0)
return;
+void __flush_icache_page_xxx(struct mm_struct *mm, struct page *page) // addr, len ? +{ preempt_disable();
if (mm == current->active_mm) {
@@ -783,3 +777,24 @@ flush_icache_user_range(struct vm_area_s
preempt_enable();
}
+void flush_icache_page_xxx(struct mm_struct *mm, struct page *page) +{
struct mm_struct *mm = current->mm;
down_read(&mm->mmap_sem);
__flush_icache_page_xxx(mm, page);
up_read(&mm->mmap_sem);
+}
+void +flush_icache_user_range(struct vm_area_struct *vma, struct page *page,
unsigned long addr, int len)
+{
struct mm_struct *mm = vma->vm_mm;
if ((vma->vm_flags & VM_EXEC) == 0)
return;
__flush_icache_page_xxx(mm, page);
+}
On 04/11, Victor Kamensky wrote:
On 11 April 2014 07:56, Oleg Nesterov oleg@redhat.com wrote:
1. First of all, we do not know vma. OK, we can down_read(mmap_sem) and do find_vma() of course. This is a bit unfortunate, especially because the architectures we currently support do not need this.
Question, maybe silly one but I don't know the answer, why can't we just do look up for vma once and cache results in place like xol_area (along with xol_area.vaddr) and use it all the time. IOW under what circumstances vma for xol area can disappear change so we need constant lookup for it? Comment in xol_area
/*
- We keep the vma's vm_start rather than a pointer to the vma
- itself. The probed process or a naughty kernel module could make
- the vma go away, and we must handle that reasonably gracefully.
*/ unsigned long vaddr; /* Page(s) of instruction slots */
alludes to some of those conditions, but I don't quite follow. Should not we go after "probed process" ability to unmap xol area. xol area is like vdso,
But it is not like vdso. And (unlike vsyscall page) vdso can be unmapped too (unless it is FIX_VDSO).
mmap call should ignore those..
This is not that simple, this means more ugly uprobe_ hooks in mm/. And I think we simply do not want/need this.
I didn't write the comment above, but "reasonably gracefully" should mean "we should not allow unmap/remap/etc(xol_area) crash the kernel, the task can crashif it does this, we do not care".
The same for vdso, except in this case the kernel can simply forget about this area after it does setup_additional_pages().
2. The problem is, it would be very nice to remove this vma, or at least hide it somehow from find_vma/etc. This is the special mapping we do not want to expose to user-space. In fact I even have the patches which remove this vma, but they do not work with compat tasks unfortunately.
I don't think it is right route. Xol area as well as vdso, signal page, etc should be visible as regular VMAs. There are other aspects of the system where they needed. Like core file collection - I would like to have xol area present in my core file if traced process crashed.
It must never crash in xol_area, or we have a kernel bug. (we do have such a bug which I am trying to fix right now ;)
/porc/<pid>/maps - I would like to see my memory layout through this interface and I would like to see xol area there because I can see xol area addresses by some other means.
But it is not "your memory", to some degree. I mean, it would be nice if it was not.
This should be more like vsyscall page. And indeed, we can move this into FIXMAP area. The only problem, 32bit task can't use this area in 64-bit machine.
Appeal of copy_to_user_page approach is that I don't need to know how to handle sync up of icache and dcache on that architecture,
Yes, sure, this is true.
it is already done by someone else when they programmed basic ptrace breakpoint write behavior.
Yes, but (rightly or not) I still think that uprobes differs from ptrace. Perhaps we do not have other choice though.
Oleg.
On Fri, Apr 11, 2014 at 7:56 AM, Oleg Nesterov oleg@redhat.com wrote:
First of all: I do not pretend I really understand the problems with icache/etc coherency and how flush_icache_range() actually works on alpha. Help.
According to the alpha architecture rules, the instruction cache can be completely virtual, and is not only not coherent with the data cache, it's not even necessarily coherent with TLB mapping changes (ie it's purely virtual, and you need to flush it if you change instruction mappings). The virtual caches do have an address space number, so you can have multiple separate virtual address spaces.
The way to flush it is with the "imb" instruction (which is not actually an instruction at all, it's a jump to PAL-code, alpha's "explicit microcode")
That means that when you modify data that could be code, you do need to do an "imb" _and_ you do need to do it cross-cpu even for thread-local cases in case your thread migrates to another CPU with stale I$ data (the ASN will be the same). You can use the usual VM cpu-mask to tell which other CPU's you'd need to do it on, though.
But alpha does not need page or addr/len, because "imb" is "make the whole instruction cache coherent".
Your patch looks correct for alpha, afaik.
Linus
On 04/10, David Long wrote:
static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len) { void *kaddr = kmap_atomic(page);
- memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
- copy_to_user_page(NULL, page, vaddr, kaddr + (vaddr & ~PAGE_MASK), src, len);
No, no, this is not what we want... And I do not think we should change copy_to_user_page().
I'll write another email...
Oleg.
On Thu, Apr 10, 2014 at 11:45:31PM -0400, David Long wrote:
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..2e976fb 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -241,7 +241,7 @@ static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, in static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len) { void *kaddr = kmap_atomic(page);
- memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
- copy_to_user_page(NULL, page, vaddr, kaddr + (vaddr & ~PAGE_MASK), src, len); kunmap_atomic(kaddr);
}
Rather than changing all the architectures to be able to pass a NULL vma to copy_to_user_page(), you can create a dummy vma on the stack with the VM_EXEC flag and pass a pointer to it.
linaro-kernel@lists.linaro.org