Usefulness of GCC's 64bit __sync_* ops on ARM

List overview All Threads
Download

newer

older

Problems re-using "repo manifest...

[PATCH 0/2] ORIGEN Board Support

Ken Werner

6 May 2011 6 May '11

3:06 p.m.

Hi,

We've been thinking about adding support for the built-in functions for 64bit atomic memory access and I'd like to know if this is of any interest. Currently the main use of these functions seems to be to implement (SMP safe) locking mechanisms where the existent 32bit memory ops are sufficient. However, there might be code out there that implements a parallel algorithm using 64bit atomic memory operations.

Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

Any ideas/thoughts are appreciated.

[1] https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations#GCC [2] https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations#imple...

Regards Ken

Show replies by date

Richard Earnshaw

6 May 6 May

3:25 p.m.

On Fri, 2011-05-06 at 17:06 +0200, Ken Werner wrote:

...

Hi,

We've been thinking about adding support for the built-in functions for 64bit atomic memory access and I'd like to know if this is of any interest. Currently the main use of these functions seems to be to implement (SMP safe) locking mechanisms where the existent 32bit memory ops are sufficient. However, there might be code out there that implements a parallel algorithm using 64bit atomic memory operations.

I've been told (though I've not verified it) that the Membase NoSQL database needs 64-bit atomic operations to work effectively (http://www.membase.org/downloadsrc)

Ken Werner

3:44 p.m.

On Friday, May 06, 2011 5:25:46 pm Richard Earnshaw wrote:

...

On Fri, 2011-05-06 at 17:06 +0200, Ken Werner wrote:

...
Hi,

We've been thinking about adding support for the built-in functions for 64bit atomic memory access and I'd like to know if this is of any interest. Currently the main use of these functions seems to be to implement (SMP safe) locking mechanisms where the existent 32bit memory ops are sufficient. However, there might be code out there that implements a parallel algorithm using 64bit atomic memory operations.

I've been told (though I've not verified it) that the Membase NoSQL database needs 64-bit atomic operations to work effectively (http://www.membase.org/downloadsrc)

Interesting... ..Thanks!

...

R.

Regards Ken

Rob Herring

9:30 p.m.

Ken,

On 05/06/2011 10:25 AM, Richard Earnshaw wrote:

...

On Fri, 2011-05-06 at 17:06 +0200, Ken Werner wrote:

...
Hi,

We've been thinking about adding support for the built-in functions for 64bit atomic memory access and I'd like to know if this is of any interest. Currently the main use of these functions seems to be to implement (SMP safe) locking mechanisms where the existent 32bit memory ops are sufficient. However, there might be code out there that implements a parallel algorithm using 64bit atomic memory operations.

I've been told (though I've not verified it) that the Membase NoSQL database needs 64-bit atomic operations to work effectively (http://www.membase.org/downloadsrc)

By work effectively, that means they are needed to get membase to build and run. Without gcc support you have to copy the routines out of the kernel and link them with membase.

Rob

Ramana Radhakrishnan

3:53 p.m.

On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:

...

Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky. Think new user space and old kernel and a jump into never-never land. For v7-a you could inline the function using ldrexd / strexd.

cheers Ramana

...

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Nicolas Pitre

9:39 p.m.

On Fri, 6 May 2011, Ramana Radhakrishnan wrote:

...

On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:

...
Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky.

What is the problem with v5t?

...

Think new user space and old kernel and a jump into never-never land.

The kernel helpers are "versioned". At 0xffff0ffc you can read the number of helpers currently available. If a program uses a new helper then when it is started this value can be verified to equal or exceed the expected value and bail out otherwise.

Nicolas

saeed bishara

11 May 11 May

5:14 p.m.

On Sat, May 7, 2011 at 12:39 AM, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...

On Fri, 6 May 2011, Ramana Radhakrishnan wrote:

...
On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:

...
Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky.

What is the problem with v5t?

Hi, when I run the following on my jaunty (for armv5), I get the following

echo "void f(){__sync_synchronize();}" | gcc -O2 -x c -Wall -dA -S - -o - .arch armv6 .eabi_attribute 27, 3 .fpu vfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 2 .eabi_attribute 18, 4 .file "" .text .align 2 .global f .type f, %function f: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. @ basic block 2 bx lr .size f, .-f .ident "GCC: (Ubuntu 4.4.1-4ubuntu9) 4.4.1" .section .note.GNU-stack,"",%progbits

but I can see (with debugger) that the kernel helper functions are reached. any idea how?

saeed

Dave Martin

17 May 17 May

9:47 a.m.

On Wed, May 11, 2011 at 6:14 PM, saeed bishara saeed.bishara@gmail.com wrote:

...

On Sat, May 7, 2011 at 12:39 AM, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...
On Fri, 6 May 2011, Ramana Radhakrishnan wrote:

...
On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:

...
Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky.

What is the problem with v5t?

Hi, when I run the following on my jaunty (for armv5), I get the following

echo "void f(){__sync_synchronize();}" | gcc -O2 -x c -Wall -dA -S - -o - .arch armv6 .eabi_attribute 27, 3 .fpu vfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 2 .eabi_attribute 18, 4 .file "" .text .align 2 .global f .type f, %function f: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. @ basic block 2 bx lr .size f, .-f .ident "GCC: (Ubuntu 4.4.1-4ubuntu9) 4.4.1" .section .note.GNU-stack,"",%progbits

but I can see (with debugger) that the kernel helper functions are reached. any idea how?

There was a compiler bug some time ago which mistakenly generated no code for __sync_synchronize(): https://bugs.launchpad.net/ubuntu/+source/gcc-4.4/+bug/491872

I suspect your Jaunty-era compiler may be affected by this -- you probably need to upgrade it.

If you're observing calls to the helpers, they may be coming from glibc etc. instead of your function.

You _should_ be seeing code something like this:

$ echo 'void f(void) { __sync_synchronize(); }' | arm-linux-gnueabi-gcc -march=armv7-a -O2 -S -o - -x c - [...] f: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. dmb sy bx lr [...]

(If -march is set high enough, the dmb is explicit and inline.)

$ echo 'void f(void) { __sync_synchronize(); }' | arm-linux-gnueabi-gcc -march=armv5t -O2 -S -o - -x c - [...] f: push {r3, lr} bl __sync_synchronize @ sp needed for prologue pop {r3, pc}

(-march=armv5t indicates an architecture which doesn't have the barrier instructions, so the kuser helper is needed instead)

Cheers ---Dave

Dave Martin

9:55 a.m.

On Fri, May 6, 2011 at 10:39 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...

On Fri, 6 May 2011, Ramana Radhakrishnan wrote:

...
On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:

...
Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky.

What is the problem with v5t?

...
Think new user space and old kernel and a jump into never-never land.

The kernel helpers are "versioned". At 0xffff0ffc you can read the number of helpers currently available. If a program uses a new helper then when it is started this value can be verified to equal or exceed the expected value and bail out otherwise.

To what extent do we think that userspace code is actually checking this?

I may suggest a patch to the documentation text in entry-armv.S to make this requirement clearer, as well as getting rid of the example assembler code (which I consider to be mis-educative, but more importantly it's also Thumb-incompatible and will likely suffer poor branch-prediction on recent CPUs. This code was the source of a TLS bug in bionic, and may have been inappropriately pasted elsewhere too...)

Cheers ---Dave

Nicolas Pitre

4:30 p.m.

On Tue, 17 May 2011, Dave Martin wrote:

...

On Fri, May 6, 2011 at 10:39 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...
On Fri, 6 May 2011, Ramana Radhakrishnan wrote:

...
On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:

...
Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky.

What is the problem with v5t?

...
Think new user space and old kernel and a jump into never-never land.

The kernel helpers are "versioned". At 0xffff0ffc you can read the number of helpers currently available. If a program uses a new helper then when it is started this value can be verified to equal or exceed the expected value and bail out otherwise.

To what extent do we think that userspace code is actually checking this?

I think right now it is none of it simply because most of the helpers were added almost all at once. But if future helpers are added then it would be a good idea to check this but only if the new helper is actually invoked for a given application.

...

I may suggest a patch to the documentation text in entry-armv.S to make this requirement clearer, as well as getting rid of the example assembler code (which I consider to be mis-educative, but more importantly it's also Thumb-incompatible and will likely suffer poor branch-prediction on recent CPUs.

If you have suggestions for improving this then please do so.

...

This code was the source of a TLS bug in bionic, and may have been inappropriately pasted elsewhere too...)

What bug?

Nicolas

Dave Martin

18 May 18 May

3:07 p.m.

On Tue, May 17, 2011 at 5:30 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...

On Tue, 17 May 2011, Dave Martin wrote:

...
On Fri, May 6, 2011 at 10:39 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...
On Fri, 6 May 2011, Ramana Radhakrishnan wrote:

...
On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:

...
Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky.

What is the problem with v5t?

...
Think new user space and old kernel and a jump into never-never land.

The kernel helpers are "versioned". At 0xffff0ffc you can read the number of helpers currently available. If a program uses a new helper then when it is started this value can be verified to equal or exceed the expected value and bail out otherwise.

To what extent do we think that userspace code is actually checking this?

I think right now it is none of it simply because most of the helpers were added almost all at once. But if future helpers are added then it would be a good idea to check this but only if the new helper is actually invoked for a given application.

...
I may suggest a patch to the documentation text in entry-armv.S to make this requirement clearer, as well as getting rid of the example assembler code (which I consider to be mis-educative, but more importantly it's also Thumb-incompatible and will likely suffer poor branch-prediction on recent CPUs.

If you have suggestions for improving this then please do so.

I have a patch which I'll suggest at some point, but it's not high priority.

...

...
This code was the source of a TLS bug in bionic, and may have been inappropriately pasted elsewhere too...)

What bug?

Actually, to be fair I think I may be mis-remembering here... I can't seem to find the exact bug now.

Cheers ---Dave

David Rusling

19 May 19 May

10:40 a.m.

Is this going to end up in a blueprint? This is the last loose end of SMP / atomic memory operations work and I'd like to see it happen

Dave

On 18 May 2011, at 16:07, Dave Martin wrote:

...

On Tue, May 17, 2011 at 5:30 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...
On Tue, 17 May 2011, Dave Martin wrote:

...
On Fri, May 6, 2011 at 10:39 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...
On Fri, 6 May 2011, Ramana Radhakrishnan wrote:

...
On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:

...
Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.

It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky.

What is the problem with v5t?

...
Think new user space and old kernel and a jump into never-never land.

The kernel helpers are "versioned". At 0xffff0ffc you can read the number of helpers currently available. If a program uses a new helper then when it is started this value can be verified to equal or exceed the expected value and bail out otherwise.

To what extent do we think that userspace code is actually checking this?

I think right now it is none of it simply because most of the helpers were added almost all at once. But if future helpers are added then it would be a good idea to check this but only if the new helper is actually invoked for a given application.

...
I may suggest a patch to the documentation text in entry-armv.S to make this requirement clearer, as well as getting rid of the example assembler code (which I consider to be mis-educative, but more importantly it's also Thumb-incompatible and will likely suffer poor branch-prediction on recent CPUs.

If you have suggestions for improving this then please do so.

I have a patch which I'll suggest at some point, but it's not high priority.

...
...
This code was the source of a TLS bug in bionic, and may have been inappropriately pasted elsewhere too...)

What bug?

Actually, to be fair I think I may be mis-remembering here... I can't seem to find the exact bug now.

Cheers ---Dave

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Ken Werner

3:49 p.m.

On 05/19/2011 12:40 PM, David Rusling wrote:

...

Is this going to end up in a blueprint? This is the last loose end of SMP / atomic memory operations work and I'd like to see it happen

Hi,

Yep, there is one (kind of a skeleton) in place at: https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/64-bit-sync-pri...

Regards Ken

David Gilbert

5:56 p.m.

On 19 May 2011 16:49, Ken Werner ken@linux.vnet.ibm.com wrote:

...

On 05/19/2011 12:40 PM, David Rusling wrote:

...
Is this going to end up in a blueprint? This is the last loose end of SMP / atomic memory operations work and I'd like to see it happen

Hi,

Yep, there is one (kind of a skeleton) in place at: https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/64-bit-sync-pri...

Which I'll be filling out in the next few days.

Dave

Michael Casadevall

23 May 23 May

10:33 p.m.

On 05/19/2011 10:56 AM, David Gilbert wrote:

...

On 19 May 2011 16:49, Ken Werner ken@linux.vnet.ibm.com wrote:

...
On 05/19/2011 12:40 PM, David Rusling wrote:

...
Is this going to end up in a blueprint? This is the last loose end of SMP / atomic memory operations work and I'd like to see it happen

Hi,

Yep, there is one (kind of a skeleton) in place at: https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/64-bit-sync-pri...

Which I'll be filling out in the next few days.

Dave

Is there a timeline for this feature? It's been requested by members of the ARM Server Club to have this implemented, and its important that this makes it into the Ubuntu 11.10 release. Michael

Michael Hope

10:50 p.m.

On Tue, May 24, 2011 at 10:33 AM, Michael Casadevall mcasadevall@ubuntu.com wrote:

...

On 05/19/2011 10:56 AM, David Gilbert wrote:

...
On 19 May 2011 16:49, Ken Werner ken@linux.vnet.ibm.com wrote:

...
On 05/19/2011 12:40 PM, David Rusling wrote:

...
Is this going to end up in a blueprint? This is the last loose end of SMP / atomic memory operations work and I'd like to see it happen

Hi,

Yep, there is one (kind of a skeleton) in place at: https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/64-bit-sync-pri...

Which I'll be filling out in the next few days.

Dave

Is there a timeline for this feature? It's been requested by members of the ARM Server Club to have this implemented, and its important that this makes it into the Ubuntu 11.10 release. Michael

Hi Michael. The topics for this planning cycle are listed here: https://wiki.linaro.org/Cycles/1111/TechnicalTopics/Toolchain

64 bit sync primitives are medium priority so they will be achieved in the next six months.

A draft of what is in whos queue is at: https://wiki.linaro.org/Cycles/1111/TechnicalTopics/Toolchain/Planning

The primitives are second in Dave's queue so should be started in the next three months.

-- Michael

Nicolas Pitre

25 May 25 May

3:45 a.m.

On Tue, 24 May 2011, Michael Hope wrote:

...

On Tue, May 24, 2011 at 10:33 AM, Michael Casadevall mcasadevall@ubuntu.com wrote:

...
On 05/19/2011 10:56 AM, David Gilbert wrote:

...
On 19 May 2011 16:49, Ken Werner ken@linux.vnet.ibm.com wrote:

...
On 05/19/2011 12:40 PM, David Rusling wrote:

...
Is this going to end up in a blueprint? This is the last loose end of SMP / atomic memory operations work and I'd like to see it happen

Hi,

Yep, there is one (kind of a skeleton) in place at: https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/64-bit-sync-pri...

Which I'll be filling out in the next few days.

Dave

Is there a timeline for this feature? It's been requested by members of the ARM Server Club to have this implemented, and its important that this makes it into the Ubuntu 11.10 release. Michael

Hi Michael. The topics for this planning cycle are listed here: https://wiki.linaro.org/Cycles/1111/TechnicalTopics/Toolchain

64 bit sync primitives are medium priority so they will be achieved in the next six months.

A draft of what is in whos queue is at: https://wiki.linaro.org/Cycles/1111/TechnicalTopics/Toolchain/Planning

The primitives are second in Dave's queue so should be started in the next three months.

FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S index e8d8856..53830a7 100644 --- a/arch/arm/kernel/entry-armv.S +++ b/arch/arm/kernel/entry-armv.S @@ -383,7 +383,7 @@ ENDPROC(__pabt_svc) .endm

.macro kuser_cmpxchg_check -#if __LINUX_ARM_ARCH__ < 6 && !defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG) +#if !defined(CONFIG_CPU_32v6K) && !defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG) #ifndef CONFIG_MMU #warning "NPTL on non MMU needs fixing" #else @@ -392,7 +392,7 @@ ENDPROC(__pabt_svc) @ perform a quick test inline since it should be false @ 99.9999% of the time. The rest is done out of line. cmp r2, #TASK_SIZE - blhs kuser_cmpxchg_fixup + blhs kuser_cmpxchg64_fixup #endif #endif .endm @@ -797,6 +797,139 @@ __kuser_helper_start: /* * Reference prototype: * + * int __kernel_cmpxchgd64(int64_t *oldval, int64_t *newval, int64_t *ptr) + * + * Input: + * + * r0 = pointer to oldval + * r1 = pointer to newval + * r2 = pointer to target value + * lr = return address + * + * Output: + * + * r0 = returned value (zero or non-zero) + * C flag = set if r0 == 0, clear if r0 != 0 + * + * Clobbered: + * + * r3, flags + * + * Definition and user space usage example: + * + * typedef int (__kernel_cmpxchg64_t)(const int64_t *oldval, + * const int64_t *newval, + * volatile int64_t *ptr); + * #define __kernel_cmpxchg64 (*(__kernel_cmpxchg64_t *)0xffff0f60) + * + * Atomically store newval in *ptr if *ptr is equal to oldval for user space. + * Return zero if *ptr was changed or non-zero if no exchange happened. + * The C flag is also set if *ptr was changed to allow for assembly + * optimization in the calling code. + * + * Notes: + * + * - This routine already includes memory barriers as needed. + * + * - Due to the length of some sequences, this spans 2 regular kuser + * "slots" so 0xffff0f80 is not used as a valid entry point. + */ + +__kuser_cmpxchg64: @ 0xffff0f60 + +#if defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG) + + /* + * Poor you. No fast solution possible... + * The kernel itself must perform the operation. + * A special ghost syscall is used for that (see traps.c). + */ + stmfd sp!, {r7, lr} + ldr r7, 1f @ it's 20 bits + swi __ARM_NR_cmpxchg64 + ldmfd sp!, {r7, pc} +1: .word __ARM_NR_cmpxchg64 + +#elif defined(CONFIG_CPU_32v6K) + + stmfd sp!, {r4, r5, r6, r7} + ldrd r4, r5, [r0] @ load old val + ldrd r6, r7, [r1] @ load new val + smp_dmb arm +1: ldrexd r0, r1, [r2] @ load current val + eors r3, r0, r4 @ compare with oldval (1) + eoreqs r3, r1, r5 @ compare with oldval (2) + strexdeq r3, r6, r7, [r2] @ store newval if eq + teqeq r3, #1 @ success? + beq 1b @ if no then retry + smp_dmb arm + rsbs r0, r3, #0 @ set returned val and C flag + ldmfd sp!, {r4, r5, r6, r7} + usr_ret lr + +#elif !defined(CONFIG_SMP) + +#ifdef CONFIG_MMU + + /* + * The only thing that can break atomicity in this cmpxchg64 + * implementation is either an IRQ or a data abort exception + * causing another process/thread to be scheduled in the middle + * of the critical sequence. To prevent this, code is added to + * the IRQ and data abort exception handlers to set the pc back + * to the beginning of the critical section if it is found to be + * within that critical section (see kuser_cmpxchg_fixup64). + */ + stmfd sp!, {r4, r5, r6, lr} + ldmia r0, {r4, r5} @ load old val + ldmia r1, {r6, lr} @ load new val +1: ldmia r2, {r0, r1} @ load current val + eors r3, r0, r4 @ compare with oldval (1) + eoreqs r3, r1, r5 @ compare with oldval (2) +2: stmeqia r2, {r6, lr} @ store newval if eq + rsbs r0, r3, #0 @ set return val and C flag + ldmfd sp!, {r4, r5, r6, pc} + + .text +kuser_cmpxchg64_fixup: + @ Called from kuser_cmpxchg_fixup. + @ r2 = address of interrupted insn (must be preserved). + @ sp = saved regs. r7 and r8 are clobbered. + @ 1b = first critical insn, 2b = last critical insn. + @ If r2 >= 1b and r2 <= 2b then saved pc_usr is set to 1b. + mov r7, #0xffff0fff + sub r7, r7, #(0xffff0fff - (0xffff0f60 + (1b - __kuser_cmpxchg64))) + subs r8, r2, r7 + rsbcss r8, r8, #(2b - 1b) + strcs r7, [sp, #S_PC] +#if __LINUX_ARM_ARCH__ < 6 + b kuser_cmpxchg32_fixup +#else + mov pc, lr +#endif + .previous + +#else +#warning "NPTL on non MMU needs fixing" + mov r0, #-1 + adds r0, r0, #0 + usr_ret lr +#endif + +#else +#error "incoherent kernel configuration" +#endif + + /* pad to next slot */ + .rept (16 - (. - __kuser_cmpxchg64)/4) + .word 0 + .endr + + .align 5 + +/* + * Reference prototype: + * * void __kernel_memory_barrier(void) * * Input: @@ -921,7 +1054,7 @@ __kuser_cmpxchg: @ 0xffff0fc0 usr_ret lr

.text -kuser_cmpxchg_fixup: +kuser_cmpxchg32_fixup: @ Called from kuser_cmpxchg_check macro. @ r2 = address of interrupted insn (must be preserved). @ sp = saved regs. r7 and r8 are clobbered.

Dave Martin

11:10 a.m.

On Tue, May 24, 2011 at 11:45:04PM -0400, Nicolas Pitre wrote:

...

On Tue, 24 May 2011, Michael Hope wrote:

...
On Tue, May 24, 2011 at 10:33 AM, Michael Casadevall mcasadevall@ubuntu.com wrote:

...
On 05/19/2011 10:56 AM, David Gilbert wrote:

...
On 19 May 2011 16:49, Ken Werner ken@linux.vnet.ibm.com wrote:

...
On 05/19/2011 12:40 PM, David Rusling wrote:

...
Is this going to end up in a blueprint? This is the last loose end of SMP / atomic memory operations work and I'd like to see it happen

Hi,

Yep, there is one (kind of a skeleton) in place at: https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/64-bit-sync-pri...

Which I'll be filling out in the next few days.

Dave

Is there a timeline for this feature? It's been requested by members of the ARM Server Club to have this implemented, and its important that this makes it into the Ubuntu 11.10 release. Michael

Hi Michael. The topics for this planning cycle are listed here: https://wiki.linaro.org/Cycles/1111/TechnicalTopics/Toolchain

64 bit sync primitives are medium priority so they will be achieved in the next six months.

A draft of what is in whos queue is at: https://wiki.linaro.org/Cycles/1111/TechnicalTopics/Toolchain/Planning

The primitives are second in Dave's queue so should be started in the next three months.

FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S index e8d8856..53830a7 100644 --- a/arch/arm/kernel/entry-armv.S +++ b/arch/arm/kernel/entry-armv.S @@ -383,7 +383,7 @@ ENDPROC(__pabt_svc) .endm .macro kuser_cmpxchg_check -#if __LINUX_ARM_ARCH__ < 6 && !defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG) +#if !defined(CONFIG_CPU_32v6K) && !defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG) #ifndef CONFIG_MMU #warning "NPTL on non MMU needs fixing" #else @@ -392,7 +392,7 @@ ENDPROC(__pabt_svc) @ perform a quick test inline since it should be false @ 99.9999% of the time. The rest is done out of line. cmp r2, #TASK_SIZE

blhs kuser_cmpxchg_fixup

blhs kuser_cmpxchg64_fixup

#endif #endif .endm @@ -797,6 +797,139 @@ __kuser_helper_start: /*

Reference prototype:
int __kernel_cmpxchgd64(int64_t *oldval, int64_t *newval, int64_t *ptr)

Input:

r0 = pointer to oldval

r1 = pointer to newval

r2 = pointer to target value

lr = return address

Output:

r0 = returned value (zero or non-zero)

C flag = set if r0 == 0, clear if r0 != 0

Clobbered:

r3, flags

Definition and user space usage example:

typedef int (__kernel_cmpxchg64_t)(const int64_t *oldval,
			   const int64_t *newval,
			   volatile int64_t *ptr);
#define __kernel_cmpxchg64 (*(__kernel_cmpxchg64_t *)0xffff0f60)

Atomically store newval in *ptr if *ptr is equal to oldval for user space.

Return zero if *ptr was changed or non-zero if no exchange happened.

The C flag is also set if *ptr was changed to allow for assembly

optimization in the calling code.

* Do not attempt to call this function unless __kernel_helper_version >= 5. *

...

Notes:

This routine already includes memory barriers as needed.

Due to the length of some sequences, this spans 2 regular kuser
 "slots" so 0xffff0f80 is not used as a valid entry point.
*/
+__kuser_cmpxchg64: @ 0xffff0f60

+#if defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG)
/*
* Poor you.  No fast solution possible...
* The kernel itself must perform the operation.
* A special ghost syscall is used for that (see traps.c).
*/
stmfd sp!, {r7, lr}

ldr r7, 1f @ it's 20 bits

swi __ARM_NR_cmpxchg64

ldmfd sp!, {r7, pc}
+1: .word __ARM_NR_cmpxchg64

+#elif defined(CONFIG_CPU_32v6K)

stmfd sp!, {r4, r5, r6, r7}

ldrd r4, r5, [r0] @ load old val

ldrd r6, r7, [r1] @ load new val

smp_dmb arm

+1: ldrexd r0, r1, [r2] @ load current val

eors r3, r0, r4 @ compare with oldval (1)

eoreqs r3, r1, r5 @ compare with oldval (2)

strexdeq r3, r6, r7, [r2] @ store newval if eq

teqeq r3, #1 @ success?

beq 1b @ if no then retry

smp_dmb arm

rsbs r0, r3, #0 @ set returned val and C flag

ldmfd sp!, {r4, r5, r6, r7}

usr_ret lr

+#elif !defined(CONFIG_SMP)

+#ifdef CONFIG_MMU
/*
* The only thing that can break atomicity in this cmpxchg64
* implementation is either an IRQ or a data abort exception
* causing another process/thread to be scheduled in the middle
* of the critical sequence.  To prevent this, code is added to
* the IRQ and data abort exception handlers to set the pc back
* to the beginning of the critical section if it is found to be
* within that critical section (see kuser_cmpxchg_fixup64).
*/
stmfd sp!, {r4, r5, r6, lr}

ldmia r0, {r4, r5} @ load old val

ldmia r1, {r6, lr} @ load new val
+1: ldmia r2, {r0, r1} @ load current val

eors r3, r0, r4 @ compare with oldval (1)

eoreqs r3, r1, r5 @ compare with oldval (2)

+2: stmeqia r2, {r6, lr} @ store newval if eq

rsbs r0, r3, #0 @ set return val and C flag

ldmfd sp!, {r4, r5, r6, pc}

.text

+kuser_cmpxchg64_fixup:

@ Called from kuser_cmpxchg_fixup.

@ r2 = address of interrupted insn (must be preserved).

@ sp = saved regs. r7 and r8 are clobbered.

@ 1b = first critical insn, 2b = last critical insn.

@ If r2 >= 1b and r2 <= 2b then saved pc_usr is set to 1b.

mov r7, #0xffff0fff

sub r7, r7, #(0xffff0fff - (0xffff0f60 + (1b - __kuser_cmpxchg64)))

subs r8, r2, r7

rsbcss r8, r8, #(2b - 1b)

strcs r7, [sp, #S_PC]

+#if __LINUX_ARM_ARCH__ < 6

b kuser_cmpxchg32_fixup

Can we just have movcs pc,lr here, and put kuser_cmpxchg32_fixup immediately after?

This would allow us to skip the branch, and the initial "mov r7" in the kuser_cmpxchg32_fixup code.

...

+#else

mov pc, lr

+#endif

.previous

+#else +#warning "NPTL on non MMU needs fixing"

mov r0, #-1

adds r0, r0, #0

usr_ret lr

+#endif

+#else +#error "incoherent kernel configuration" +#endif

/* pad to next slot */

.rept (16 - (. - __kuser_cmpxchg64)/4)

.word 0

.endr

.align 5

+/*

Reference prototype:

void __kernel_memory_barrier(void)

Input:

@@ -921,7 +1054,7 @@ __kuser_cmpxchg: @ 0xffff0fc0 usr_ret lr .text -kuser_cmpxchg_fixup: +kuser_cmpxchg32_fixup: @ Called from kuser_cmpxchg_check macro. @ r2 = address of interrupted insn (must be preserved). @ sp = saved regs. r7 and r8 are clobbered.

There's a fair amount of duplicated logic from the 32-bit case. Is it worth trying to merge the two?

Since we can reasonably expect there to be no additional cmpxchg helpers in the future, the answer may be "no"...

Cheers ---Dave

Nicolas Pitre

2:21 p.m.

On Wed, 25 May 2011, Dave Martin wrote:

...

On Tue, May 24, 2011 at 11:45:04PM -0400, Nicolas Pitre wrote:

...
typedef int (__kernel_cmpxchg64_t)(const int64_t *oldval,
			   const int64_t *newval,
			   volatile int64_t *ptr);
#define __kernel_cmpxchg64 (*(__kernel_cmpxchg64_t *)0xffff0f60)

Atomically store newval in *ptr if *ptr is equal to oldval for user space.

Return zero if *ptr was changed or non-zero if no exchange happened.

The C flag is also set if *ptr was changed to allow for assembly

optimization in the calling code.
Do not attempt to call this function unless __kernel_helper_version >= 5.

Yep. I will queue your other patch prior to this one and make this blurb consistent with the rest.

...

...
+kuser_cmpxchg64_fixup:

@ Called from kuser_cmpxchg_fixup.

@ r2 = address of interrupted insn (must be preserved).

@ sp = saved regs. r7 and r8 are clobbered.

@ 1b = first critical insn, 2b = last critical insn.

@ If r2 >= 1b and r2 <= 2b then saved pc_usr is set to 1b.

mov r7, #0xffff0fff

sub r7, r7, #(0xffff0fff - (0xffff0f60 + (1b - __kuser_cmpxchg64)))

subs r8, r2, r7

rsbcss r8, r8, #(2b - 1b)

strcs r7, [sp, #S_PC]

+#if __LINUX_ARM_ARCH__ < 6

b kuser_cmpxchg32_fixup

Can we just have movcs pc,lr here, and put kuser_cmpxchg32_fixup immediately after?

This would allow us to skip the branch, and the initial "mov r7" in the kuser_cmpxchg32_fixup code.

The 'mov r7' is still needed unless the second instruction uses another target register.

I thought about the possibility of "movcs pc, lr", especially since the .text segments are simply concatenated and therefore the branch is effectively branching to the very next instruction. So there could be like a common preamble, then a list of concatenated fixup segments (only two of them now) and finally a postamble which would simply be "mov pc, lr". This would all be put contigous at link time. However I'm not sure yet if this is worth optimizing that far since this code is far from being hot, and clarity would also be affected.

...

There's a fair amount of duplicated logic from the 32-bit case. Is it worth trying to merge the two?

The core logic spans 5 instructions. Only 3 of them are actually the same in both cases i.e. those with no references to 1b or 2b, and they're perfectly interlaced in between the others. Trying to make this into common runtime code would result in even bigger code I'm afraid. This could be made into common source code via a macro though.

Thanks for the review.

Nicolas

Dave Martin

26 May 26 May

11 a.m.

On Wed, May 25, 2011 at 10:21:43AM -0400, Nicolas Pitre wrote:

...

On Wed, 25 May 2011, Dave Martin wrote:

...
On Tue, May 24, 2011 at 11:45:04PM -0400, Nicolas Pitre wrote:

...
typedef int (__kernel_cmpxchg64_t)(const int64_t *oldval,
			   const int64_t *newval,
			   volatile int64_t *ptr);
#define __kernel_cmpxchg64 (*(__kernel_cmpxchg64_t *)0xffff0f60)

Atomically store newval in *ptr if *ptr is equal to oldval for user space.

Return zero if *ptr was changed or non-zero if no exchange happened.

The C flag is also set if *ptr was changed to allow for assembly

optimization in the calling code.
Do not attempt to call this function unless __kernel_helper_version >= 5.
Yep. I will queue your other patch prior to this one and make this blurb consistent with the rest.

Oh, OK. I hadn't feedback on that yet, so I wasn't sure whether anyone was picking it up. I will repost to alkml, but I was waiting for the merge window to close since this was just a tidyup rather than something urgent.

This warning is more important for the new helper than for the existing ones (where really we could omit it without much risk).

...

...
...
+kuser_cmpxchg64_fixup:

@ Called from kuser_cmpxchg_fixup.

@ r2 = address of interrupted insn (must be preserved).

@ sp = saved regs. r7 and r8 are clobbered.

@ 1b = first critical insn, 2b = last critical insn.

@ If r2 >= 1b and r2 <= 2b then saved pc_usr is set to 1b.

mov r7, #0xffff0fff

sub r7, r7, #(0xffff0fff - (0xffff0f60 + (1b - __kuser_cmpxchg64)))

subs r8, r2, r7

rsbcss r8, r8, #(2b - 1b)

strcs r7, [sp, #S_PC]

+#if __LINUX_ARM_ARCH__ < 6

b kuser_cmpxchg32_fixup

Can we just have movcs pc,lr here, and put kuser_cmpxchg32_fixup immediately after?

This would allow us to skip the branch, and the initial "mov r7" in the kuser_cmpxchg32_fixup code.

The 'mov r7' is still needed unless the second instruction uses another target register.

What I meant is that at the end of the above sequence, r7 = 1b. So we can just fall through in to the 32-bit fixup code and add the appropriate small offset to r7 so that it now points to the corresponding 1b label for __kuser_cmpxchg32, without needing to re-derive the address using mov+sub.

It's a pretty minor tweak though.

...

I thought about the possibility of "movcs pc, lr", especially since the .text segments are simply concatenated and therefore the branch is effectively branching to the very next instruction. So there could be like a common preamble, then a list of concatenated fixup segments (only two of them now) and finally a postamble which would simply be "mov pc, lr". This would all be put contigous at link time. However I'm not sure yet if this is worth optimizing that far since this code is far from being hot, and clarity would also be affected.

Also, probably the number of fixups is never going to grow very large.

...

...
There's a fair amount of duplicated logic from the 32-bit case. Is it worth trying to merge the two?

The core logic spans 5 instructions. Only 3 of them are actually the same in both cases i.e. those with no references to 1b or 2b, and they're perfectly interlaced in between the others. Trying to make this into common runtime code would result in even bigger code I'm afraid. This could be made into common source code via a macro though.

Fair enough -- a macro might be worth a try _if_ it simplifies things in the source, but I think you're right that merging the actual code probably isn't worth it just to save a few words in the vectors page (which eats up 4K regardless of what we put in it) and a few cycles per fault (which already costs many, many cycles).

Cheers ---Dave

Nicolas Pitre

1:29 p.m.

On Thu, 26 May 2011, Dave Martin wrote:

...

...
The core logic spans 5 instructions. Only 3 of them are actually the same in both cases i.e. those with no references to 1b or 2b, and they're perfectly interlaced in between the others. Trying to make this into common runtime code would result in even bigger code I'm afraid. This could be made into common source code via a macro though.

Fair enough -- a macro might be worth a try _if_ it simplifies things in the source, but I think you're right that merging the actual code probably isn't worth it just to save a few words in the vectors page (which eats up 4K regardless of what we put in it) and a few cycles per fault (which already costs many, many cycles).

In the normal cases, there is no additional cycles per fault as the inline check remains unchanged, and it goes like this:

@ Make sure our user space atomic helper is restarted @ if it was interrupted in a critical region. Here we @ perform a quick test inline since it should be false @ 99.9999% of the time. The rest is done out of line. cmp r2, #TASK_SIZE blhs kuser_cmpxchg_fixup

In most cases the branch is not taken.

Nicolas

Dave Martin

2:50 p.m.

On Thu, May 26, 2011 at 09:29:51AM -0400, Nicolas Pitre wrote:

...

On Thu, 26 May 2011, Dave Martin wrote:

...
...
The core logic spans 5 instructions. Only 3 of them are actually the same in both cases i.e. those with no references to 1b or 2b, and they're perfectly interlaced in between the others. Trying to make this into common runtime code would result in even bigger code I'm afraid. This could be made into common source code via a macro though.

Fair enough -- a macro might be worth a try _if_ it simplifies things in the source, but I think you're right that merging the actual code probably isn't worth it just to save a few words in the vectors page (which eats up 4K regardless of what we put in it) and a few cycles per fault (which already costs many, many cycles).

In the normal cases, there is no additional cycles per fault as the inline check remains unchanged, and it goes like this:
    @ Make sure our user space atomic helper is restarted
    @ if it was interrupted in a critical region.  Here we
    @ perform a quick test inline since it should be false
    @ 99.9999% of the time.  The rest is done out of line.
    cmp     r2, #TASK_SIZE
    blhs    kuser_cmpxchg_fixup
In most cases the branch is not taken.

True!

---Dave

David Gilbert

25 May 25 May

11:58 a.m.

On 25 May 2011 04:45, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...

FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

OK, so that makes a eglibc part for that pretty easy. For things like fetch_and_add (which I can see membase needs) would you expect implementation using this cmpxchg so it has a fall back or just to use ldrexd directly which I assume would be somewhat more efficient.

(Question holds for both eglibc and gcc's __sync_*)

...

Notes:

- This routine already includes memory barriers as needed.

Hmm I wonder whether something like the atomic add primitives from user space need that or not; it depends whether people are using them to build larger data structures or just trying to keep a consistent count somewhere.

Dave

Dave Martin

1:17 p.m.

On Wed, May 25, 2011 at 12:58:30PM +0100, David Gilbert wrote:

...

On 25 May 2011 04:45, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

OK, so that makes a eglibc part for that pretty easy. For things like fetch_and_add (which I can see membase needs) would you expect implementation using this cmpxchg so it has a fall back or just to use ldrexd directly which I assume would be somewhat more efficient.

(Question holds for both eglibc and gcc's __sync_*)

It depends on the baseline architecture for the build.

An eglibc built for ARMv6 and above would need to call the helper by default, though it could also use ldrexd/strexd if it determines at run- time that this is supported by the CPU.

Similarly, if GCC is building for -march=marmv7-a it can inline the atomics directly using ldrex/strex and friends, but for -march=armv6 it will need to call helpers via libgcc.

...

...

Notes:

- This routine already includes memory barriers as needed.

Hmm I wonder whether something like the atomic add primitives from user space need that or not; it depends whether people are using them to build

The GCC __sync_* primitives must mostly be full barriers. This is what the Itanium ABI specifies (this is the spec GCC follows for these).

The kernel user helper itself could omit the barriers, but this would deviate from the existing 32-bit implementation, and also slow down the (common) case where at least one of the barriers really is needed.

_just_ updating a counter doesn't need the barriers. But if it's important that a counter can be atomically updated by multiple threads, the synchronisation of that update against other data structures usually turns out to be important too...

...

larger data structures or just trying to keep a consistent count somewhere.

GCC's atomic primitives don't address the full/partial barrier distinction. Some other atomics APIs, like Qt's for example, do express different flavours of barrier behavour and so in principle can be more optimal for cases where it makes a difference.

Cheers ---Dave

Ken Werner

30 May 30 May

7:38 a.m.

On 05/25/2011 03:17 PM, Dave Martin wrote:

...

On Wed, May 25, 2011 at 12:58:30PM +0100, David Gilbert wrote:

...
On 25 May 2011 04:45, Nicolas Pitrenicolas.pitre@linaro.org wrote:

...
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

OK, so that makes a eglibc part for that pretty easy. For things like fetch_and_add (which I can see membase needs) would you expect implementation using this cmpxchg so it has a fall back or just to use ldrexd directly which I assume would be somewhat more efficient.

(Question holds for both eglibc and gcc's __sync_*)

It depends on the baseline architecture for the build.

An eglibc built for ARMv6 and above would need to call the helper by default, though it could also use ldrexd/strexd if it determines at run- time that this is supported by the CPU.

Similarly, if GCC is building for -march=marmv7-a it can inline the atomics directly using ldrex/strex and friends, but for -march=armv6 it will need to call helpers via libgcc.

I would have thought that the libc does not decide this directly but just calls the GCC __sync_* routines (if build with a GCC that supports them). Then the GCC decides whether to inline them using ldrexd/strexd (ARMv6+) or emit calls to libgcc which calls the kernel helpers.

Regards Ken

Dave Martin

31 May 31 May

12:17 p.m.

On Mon, May 30, 2011 at 09:38:25AM +0200, Ken Werner wrote:

...

On 05/25/2011 03:17 PM, Dave Martin wrote:

...
On Wed, May 25, 2011 at 12:58:30PM +0100, David Gilbert wrote:

...
On 25 May 2011 04:45, Nicolas Pitrenicolas.pitre@linaro.org wrote:

...
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

OK, so that makes a eglibc part for that pretty easy. For things like fetch_and_add (which I can see membase needs) would you expect implementation using this cmpxchg so it has a fall back or just to use ldrexd directly which I assume would be somewhat more efficient.

(Question holds for both eglibc and gcc's __sync_*)

It depends on the baseline architecture for the build.

An eglibc built for ARMv6 and above would need to call the helper by default, though it could also use ldrexd/strexd if it determines at run- time that this is supported by the CPU.

Similarly, if GCC is building for -march=marmv7-a it can inline the atomics directly using ldrex/strex and friends, but for -march=armv6 it will need to call helpers via libgcc.

I would have thought that the libc does not decide this directly but just calls the GCC __sync_* routines (if build with a GCC that supports them). Then the GCC decides whether to inline them using ldrexd/strexd (ARMv6+) or emit calls to libgcc which calls the kernel helpers.

You're right; it looks like eglibc uses the GCC __sync_*() functions if they exist. So, it would be natural to follow this model for 64-bit atomics too.

Cheers ---Dave

Richard Earnshaw

2:35 p.m.

On Tue, 2011-05-31 at 13:17 +0100, Dave Martin wrote:

...

On Mon, May 30, 2011 at 09:38:25AM +0200, Ken Werner wrote:

...
On 05/25/2011 03:17 PM, Dave Martin wrote:

...
On Wed, May 25, 2011 at 12:58:30PM +0100, David Gilbert wrote:

...
On 25 May 2011 04:45, Nicolas Pitrenicolas.pitre@linaro.org wrote:

...
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

OK, so that makes a eglibc part for that pretty easy. For things like fetch_and_add (which I can see membase needs) would you expect implementation using this cmpxchg so it has a fall back or just to use ldrexd directly which I assume would be somewhat more efficient.

(Question holds for both eglibc and gcc's __sync_*)

It depends on the baseline architecture for the build.

An eglibc built for ARMv6 and above would need to call the helper by default, though it could also use ldrexd/strexd if it determines at run- time that this is supported by the CPU.

Similarly, if GCC is building for -march=marmv7-a it can inline the atomics directly using ldrex/strex and friends, but for -march=armv6 it will need to call helpers via libgcc.

I would have thought that the libc does not decide this directly but just calls the GCC __sync_* routines (if build with a GCC that supports them). Then the GCC decides whether to inline them using ldrexd/strexd (ARMv6+) or emit calls to libgcc which calls the kernel helpers.

You're right; it looks like eglibc uses the GCC __sync_*() functions if they exist. So, it would be natural to follow this model for 64-bit atomics too.

I think the difficulty here is that glibc expects either the compiler, or libgcc to provide the sync primitives; and while GCC can tie the inlined copy of the primitive to use of CPUs with the relevant instruction, the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...

It could throw the dependency back on glibc, but then you've got an expensive operation (the libgcc copies are normally implemented as private, per-library, helpers to avoid a PLT call overhead).

I'm not sure what the best solution is here.

Dave Martin

3:01 p.m.

On Tue, May 31, 2011 at 03:35:42PM +0100, Richard Earnshaw wrote:

...

On Tue, 2011-05-31 at 13:17 +0100, Dave Martin wrote:

...
On Mon, May 30, 2011 at 09:38:25AM +0200, Ken Werner wrote:

...
On 05/25/2011 03:17 PM, Dave Martin wrote:

...
On Wed, May 25, 2011 at 12:58:30PM +0100, David Gilbert wrote:

...
On 25 May 2011 04:45, Nicolas Pitrenicolas.pitre@linaro.org wrote:

...
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

OK, so that makes a eglibc part for that pretty easy. For things like fetch_and_add (which I can see membase needs) would you expect implementation using this cmpxchg so it has a fall back or just to use ldrexd directly which I assume would be somewhat more efficient.

(Question holds for both eglibc and gcc's __sync_*)

It depends on the baseline architecture for the build.

An eglibc built for ARMv6 and above would need to call the helper by default, though it could also use ldrexd/strexd if it determines at run- time that this is supported by the CPU.

Similarly, if GCC is building for -march=marmv7-a it can inline the atomics directly using ldrex/strex and friends, but for -march=armv6 it will need to call helpers via libgcc.

I would have thought that the libc does not decide this directly but just calls the GCC __sync_* routines (if build with a GCC that supports them). Then the GCC decides whether to inline them using ldrexd/strexd (ARMv6+) or emit calls to libgcc which calls the kernel helpers.

You're right; it looks like eglibc uses the GCC __sync_*() functions if they exist. So, it would be natural to follow this model for 64-bit atomics too.

I think the difficulty here is that glibc expects either the compiler, or libgcc to provide the sync primitives; and while GCC can tie the inlined copy of the primitive to use of CPUs with the relevant instruction, the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...

The libgcc 64-bit atomic helpers could do a runtime check on the __kernel_helper_version field in the vectors page before calling the 64-bit cmpxchg helper. This will allow the absence of the helper to be reliably detected on older kernels. Because this is data, it might cause an extra D-TLB miss to accompany any other miss associated with calling the kernel helper (if it exists).

It's an overhead, so this may not be very desirable; however, the overhead will normally not apply if the platform is built for armv7+, since I believe in that case the atomics will usually get inlined -- is that the case?

If the runtime check finds the kernel helper isn't there, there's a question of what to do. If libgcc was not built for v7, the safest approach is probably to spit out a diasnotsic message and call abort() or similar, since there's no guarantee of doing a 64-bit atomic at all in such situations.

These issues only apply to the 64-bit atomics. For the other atomics, we have a de facto "assume the kernel is new enough" policy, which seems OK in practice due to the fact that the GCC atomics support is rather newer than the kernel helpers themselves in any case.

Cheers ---Dave

...

It could throw the dependency back on glibc, but then you've got an expensive operation (the libgcc copies are normally implemented as private, per-library, helpers to avoid a PLT call overhead).

I'm not sure what the best solution is here.

R.

Nicolas Pitre

4:53 p.m.

On Tue, 31 May 2011, Dave Martin wrote:

...

On Tue, May 31, 2011 at 03:35:42PM +0100, Richard Earnshaw wrote:

...
I think the difficulty here is that glibc expects either the compiler, or libgcc to provide the sync primitives; and while GCC can tie the inlined copy of the primitive to use of CPUs with the relevant instruction, the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...

The libgcc 64-bit atomic helpers could do a runtime check on the __kernel_helper_version field in the vectors page before calling the 64-bit cmpxchg helper. This will allow the absence of the helper to be reliably detected on older kernels. Because this is data, it might cause an extra D-TLB miss to accompany any other miss associated with calling the kernel helper (if it exists).

Isn't there a way to pull in a global constructor or similar whenever a reference to the 64-bit cmpxchg helper is made, so that the constructor code could simply validate the kuser version number and bail out otherwise?

I know this is certainly the case on x86 that if you statically compile a binary on a modern distro in the hope of being able to execute that binary anywhere, you may get a "kernel too old" error message when trying to run it on older distros. The same thing could be done for the ARM 64-bit cmpxchg helper as well, at startup rather than on every invokation, no?

Nicolas

Dave Martin

1 Jun 1 Jun

8:26 a.m.

On Tue, May 31, 2011 at 12:53:35PM -0400, Nicolas Pitre wrote:

...

On Tue, 31 May 2011, Dave Martin wrote:

...
On Tue, May 31, 2011 at 03:35:42PM +0100, Richard Earnshaw wrote:

...
I think the difficulty here is that glibc expects either the compiler, or libgcc to provide the sync primitives; and while GCC can tie the inlined copy of the primitive to use of CPUs with the relevant instruction, the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...

The libgcc 64-bit atomic helpers could do a runtime check on the __kernel_helper_version field in the vectors page before calling the 64-bit cmpxchg helper. This will allow the absence of the helper to be reliably detected on older kernels. Because this is data, it might cause an extra D-TLB miss to accompany any other miss associated with calling the kernel helper (if it exists).

Isn't there a way to pull in a global constructor or similar whenever a reference to the 64-bit cmpxchg helper is made, so that the constructor code could simply validate the kuser version number and bail out otherwise?

Hmm, that should be possible.

I don't know how this interacts with other constructors: really this would need to run at a defined point early program startup. If it just ends up somewhere in the middle of the list of global constructurs, we'd have a problem if some of the other constructurs try to use the 64-bit atomics.

Richard, do you know how this could work?

...

I know this is certainly the case on x86 that if you statically compile a binary on a modern distro in the hope of being able to execute that binary anywhere, you may get a "kernel too old" error message when trying to run it on older distros. The same thing could be done for the ARM 64-bit cmpxchg helper as well, at startup rather than on every invokation, no?

As a principle, it feels like it ought to work.

Cheers ---Dave

David Gilbert

31 May 31 May

3:03 p.m.

On 31 May 2011 15:35, Richard Earnshaw rearnsha@arm.com wrote:

...

I think the difficulty here is that glibc expects either the compiler, or libgcc to provide the sync primitives; and while GCC can tie the inlined copy of the primitive to use of CPUs with the relevant instruction, the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...

It could throw the dependency back on glibc, but then you've got an expensive operation (the libgcc copies are normally implemented as private, per-library, helpers to avoid a PLT call overhead).

You say 'the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...'

but why isn't that code just in libgcc; if the libgcc helper gets called and the kernel interface is too old it throws an error.

If the compiler knows it's on 6k or above it inlines else it calls libgcc If libgcc was built with a 6k compiler it inlines (in case something explicitly calls the library routine) If libgcc was built with a <6k compiler it checks to see if it finds itself on a new enough kernel else it errors.

eglibc always uses the compilers sync primitives

(I've not checked if all of eglibc's atomic's match gcc's or if it needs any more).

Dave

Richard Earnshaw

1 Jun 1 Jun

2:30 p.m.

On Tue, 2011-05-31 at 16:03 +0100, David Gilbert wrote:

...

On 31 May 2011 15:35, Richard Earnshaw rearnsha@arm.com wrote:

...
I think the difficulty here is that glibc expects either the compiler, or libgcc to provide the sync primitives; and while GCC can tie the inlined copy of the primitive to use of CPUs with the relevant instruction, the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...

It could throw the dependency back on glibc, but then you've got an expensive operation (the libgcc copies are normally implemented as private, per-library, helpers to avoid a PLT call overhead).

You say 'the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...'

but why isn't that code just in libgcc; if the libgcc helper gets called and the kernel interface is too old it throws an error.

You don't want to repeat the test every time the sync function is called. That's excessive. Further, you want to know when the program starts if the kernel isn't current enough, not a long way into execution.

It might be possible to construct an object file that contained the support, but which also contained a 'constructor' that was run at init time. The 'constructor' would run the kernel version check and terminate the program if it wasn't sufficiently up-to-date. That still begs the question of how this should be done. Not very useful printing a message if the program is based on a GUI. The constructor would have to be run early to avoid the case of another (user or library) constructor needing the sync primitive.

...

eglibc always uses the compilers sync primitives

Which the compiler will either inline, or fall back to the libgcc versions.

...

(I've not checked if all of eglibc's atomic's match gcc's or if it needs any more).

I presume that (e)glibc doesn't use 64-bit sync or we would have added it before now...

David Gilbert

10 Jun 10 Jun

5:32 p.m.

On 25 May 2011 04:45, Nicolas Pitre nicolas.pitre@linaro.org wrote:

<snip>

...

FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

<snip>

Hi Nicolas, I've just about got a set of gcc backend changes working for the inline case and plan on attacking libgcc next week.

What are your intentions for that code that you sent in this message - do you intend to finish it off and upstream it or were you wanting me to do that as part of this task?

If you're doing it could you confirm the interface to work to. Also, do you think there should be a halfword and byte interface as well, given that halfword and byte ldrex implementations aren't available on older ARMs, or expect user space to do a bit of mask fiddling with the 32bit ones?

Dave

Nicolas Pitre

7:38 p.m.

On Fri, 10 Jun 2011, David Gilbert wrote:

...

On 25 May 2011 04:45, Nicolas Pitre nicolas.pitre@linaro.org wrote:

<snip>

...
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

<snip>

Hi Nicolas, I've just about got a set of gcc backend changes working for the inline case and plan on attacking libgcc next week.

What are your intentions for that code that you sent in this message - do you intend to finish it off and upstream it or were you wanting me to do that as part of this task?

I'll refine it and push it upstream.

...

If you're doing it could you confirm the interface to work to.

Here it goes:

* Reference prototype: * * int __kernel_cmpxchgd64(const int64_t *oldval, * const int64_t *newval, * volatile int64_t *ptr); * * Input: * * r0 = pointer to oldval * r1 = pointer to newval * r2 = pointer to target value * lr = return address * * Output: * * r0 = returned value (zero or non-zero) * C flag = set if r0 == 0, clear if r0 != 0 * * Clobbered: * * r3, condition flags * * Definition and user space usage example: * * typedef int (__kernel_cmpxchg64_t)(const int64_t *oldval, * const int64_t *newval, * volatile int64_t *ptr); * #define __kernel_cmpxchg64 (*(__kernel_cmpxchg64_t *)0xffff0f60) * * Atomically store *newval in *ptr if *ptr is equal to *oldval for user space. * Return zero if *ptr was changed or non-zero if no exchange happened. * The C flag is also set if *ptr was changed to allow for assembly * optimization in the calling code. * * Notes: * * - This routine already includes memory barriers as needed. * * - Due to the length of some sequences, this spans 2 regular kuser * "slots" so 0xffff0f80 is not used as a valid entry point. * * - This call is valid only if __kernel_helper_version >= 5.

Of course, as discussed, this would be preferable if the interface in libgcc was in a separate object file so any reference to it would also bring in a global constructor from which the __kernel_helper_version value could be verified in order to prevent usage of this interface on kernels that lack it.

...

Also, do you think there should be a halfword and byte interface as well, given that halfword and byte ldrex implementations aren't available on older ARMs, or expect user space to do a bit of mask fiddling with the 32bit ones?

It is indeed better if user space deals with it using the 32-bit interface. I'd prefer adding new kernel helpers only when absolutely required.

Nicolas

David Gilbert

13 Jun 13 Jun

10:50 a.m.

On 10 June 2011 20:38, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...

On Fri, 10 Jun 2011, David Gilbert wrote:

...
On 25 May 2011 04:45, Nicolas Pitre nicolas.pitre@linaro.org wrote:

<snip>

...
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):

<snip>

Hi Nicolas, I've just about got a set of gcc backend changes working for the inline case and plan on attacking libgcc next week.

What are your intentions for that code that you sent in this message - do you intend to finish it off and upstream it or were you wanting me to do that as part of this task?

I'll refine it and push it upstream.

OK, thanks.

...

...
If you're doing it could you confirm the interface to work to.

Here it goes:

* Reference prototype: * * int __kernel_cmpxchgd64(const int64_t *oldval, * const int64_t *newval, * volatile int64_t *ptr); * * Input: * * r0 = pointer to oldval * r1 = pointer to newval * r2 = pointer to target value * lr = return address * * Output: * * r0 = returned value (zero or non-zero) * C flag = set if r0 == 0, clear if r0 != 0 * * Clobbered: * * r3, condition flags

<snip>

...

* - This call is valid only if __kernel_helper_version >= 5.

Of course, as discussed, this would be preferable if the interface in libgcc was in a separate object file so any reference to it would also bring in a global constructor from which the __kernel_helper_version value could be verified in order to prevent usage of this interface on kernels that lack it.

I was thinking of looking at ifunc for this;but I've not looked yet.

...

...
Also, do you think there should be a halfword and byte interface as well, given that halfword and byte ldrex implementations aren't available on older ARMs, or expect user space to do a bit of mask fiddling with the 32bit ones?

It is indeed better if user space deals with it using the 32-bit interface. I'd prefer adding new kernel helpers only when absolutely required.

Actually, I've found Julian Brown already implemented that in user space in 2008 (see gcc/config/arm/linux-atomic.c)

Dave

David Gilbert

15 Jun 15 Jun

6:10 p.m.

Just to keep people up to date, attached is my current working diff on gcc (against gcc git of a few days ago)

It seems to work for the armv7 case; the testcases seem OK but I need to chat with some gcc people to see if the testcases are OK and I need to do a full native build and test. I'd expect it to need some more clean up yet, so don't panic over formatting etc.

libgcc fallbacks aren't done yet (but it will try and call libgcc).

Dave

David Gilbert

1 Jul 1 Jul

5:10 p.m.

Hi All, I've just submitted the patches for the 64 bit atomic stuff to the gcc-patches list. Richard Henderson has raised the question of why the ARM commpage isn't a full VDSO and, if it was, then it would make the version number check a lot simpler.

What's the history behind this/how big a job is it/

Dave (on holiday for a week but checking mail intermittently)

Dave Martin

5 Jul 5 Jul

2:49 p.m.

On Fri, Jul 1, 2011 at 6:10 PM, David Gilbert david.gilbert@linaro.org wrote:

...

Hi All, I've just submitted the patches for the 64 bit atomic stuff to the gcc-patches list. Richard Henderson has raised the question of why the ARM commpage isn't a full VDSO and, if it was, then it would make the version number check a lot simpler.

What's the history behind this/how big a job is it/

I think it's there partly for historical reasons, and partly because there has to be a page mapped somewhere for the exception vectors anyway, so we may as well put other snippets of required userspace code there.

...

Dave (on holiday for a week but checking mail intermittently)

Because existing userspace source and binaries already call directly into the helper functions in the vectors page, we can't easily remove/move it.

We could possibly wrap the vectors page to make it look like a DSO in a forwards-compatible way, but since this has not happened so far it feels like people never saw much benefit.

Can you describe the benefits with regard to this case? (And, out of interest, how do statically-linked programs make use of needed functionality in the VDSO? Browsing the web hasn't revealed any coherent answers to that question for me...)

Cheers ---Dave

David Gilbert

7 Jul 7 Jul

11:21 p.m.

On 5 July 2011 15:49, Dave Martin dave.martin@linaro.org wrote:

...

On Fri, Jul 1, 2011 at 6:10 PM, David Gilbert david.gilbert@linaro.org wrote:

...
Hi All, I've just submitted the patches for the 64 bit atomic stuff to the gcc-patches list. Richard Henderson has raised the question of why the ARM commpage isn't a full VDSO and, if it was, then it would make the version number check a lot simpler.

What's the history behind this/how big a job is it/

I think it's there partly for historical reasons, and partly because there has to be a page mapped somewhere for the exception vectors anyway, so we may as well put other snippets of required userspace code there.

OK.

...

...
Dave (on holiday for a week but checking mail intermittently)

Because existing userspace source and binaries already call directly into the helper functions in the vectors page, we can't easily remove/move it.

We could possibly wrap the vectors page to make it look like a DSO in a forwards-compatible way, but since this has not happened so far it feels like people never saw much benefit.

Any idea how big a job that is?

...

Can you describe the benefits with regard to this case? (And, out of interest, how do statically-linked programs make use of needed functionality in the VDSO? Browsing the web hasn't revealed any coherent answers to that question for me...)

Richard's question to my patch is here:

http://old.nabble.com/-Patch-0-3--ARM-64-bit-atomic-operations-tp31974816p31...

My patch is having to test an arbitrary address in the commpage in init code to see if it's new enough to have the new helper that Nicolas just added; Richard's argument is that if it was actually a VDSO I'd just have linked against a symbol and if the symbol wasn't there then I would have got a fairly normal linker error - I wouldn't have needed any special code; and to be honest I think that would also solve the problem that accessing that wacky address is also a pain for things like qemu because we're going to have to intercept that read.

(I've added Richard to the cc).

Dave

Richard Earnshaw

8 Jul 8 Jul

8:23 a.m.

On 08/07/11 00:21, David Gilbert wrote:

...

On 5 July 2011 15:49, Dave Martin dave.martin@linaro.org wrote:

...
On Fri, Jul 1, 2011 at 6:10 PM, David Gilbert david.gilbert@linaro.org wrote:

...
Hi All, I've just submitted the patches for the 64 bit atomic stuff to the gcc-patches list. Richard Henderson has raised the question of why the ARM commpage isn't a full VDSO and, if it was, then it would make the version number check a lot simpler.

What's the history behind this/how big a job is it/

I think it's there partly for historical reasons, and partly because there has to be a page mapped somewhere for the exception vectors anyway, so we may as well put other snippets of required userspace code there.

OK.

...
...
Dave (on holiday for a week but checking mail intermittently)

Because existing userspace source and binaries already call directly into the helper functions in the vectors page, we can't easily remove/move it.

We could possibly wrap the vectors page to make it look like a DSO in a forwards-compatible way, but since this has not happened so far it feels like people never saw much benefit.

Any idea how big a job that is?

...
Can you describe the benefits with regard to this case? (And, out of interest, how do statically-linked programs make use of needed functionality in the VDSO? Browsing the web hasn't revealed any coherent answers to that question for me...)

Richard's question to my patch is here:

http://old.nabble.com/-Patch-0-3--ARM-64-bit-atomic-operations-tp31974816p31...

My patch is having to test an arbitrary address in the commpage in init code to see if it's new enough to have the new helper that Nicolas just added; Richard's argument is that if it was actually a VDSO I'd just have linked against a symbol and if the symbol wasn't there then I would have got a fairly normal linker error - I wouldn't have needed any special code; and to be honest I think that would also solve the problem that accessing that wacky address is also a pain for things like qemu because we're going to have to intercept that read.

(I've added Richard to the cc).

There is a slight performance hit to using a VDSO in that each entry will need to go through the PLT rather than jumping directly to the helper function in the kernel.

Richard Henderson

4:04 p.m.

On 07/08/2011 01:23 AM, Richard Earnshaw wrote:

...

There is a slight performance hit to using a VDSO in that each entry will need to go through the PLT rather than jumping directly to the helper function in the kernel.

Yes. But IMO the flexibility gained is worth it.

Richard Earnshaw

4:26 p.m.

On 08/07/11 17:04, Richard Henderson wrote:

...

On 07/08/2011 01:23 AM, Richard Earnshaw wrote:

...
There is a slight performance hit to using a VDSO in that each entry will need to go through the PLT rather than jumping directly to the helper function in the kernel.

Yes. But IMO the flexibility gained is worth it.

r~

How do statically linked executables work?

Nicolas Pitre

7:33 p.m.

On Fri, 8 Jul 2011, Richard Henderson wrote:

...

On 07/08/2011 01:23 AM, Richard Earnshaw wrote:

...
There is a slight performance hit to using a VDSO in that each entry will need to go through the PLT rather than jumping directly to the helper function in the kernel.

Yes. But IMO the flexibility gained is worth it.

I'm not sure I agree. We're talking about extremely lightweight functions here, in the order of a very few assembly instructions only. Adding a significant overhead relative to their cost is not very appealing. For example, we have this code located at 0xffff0fe0 to retrieve the TLS value. Here's the non-SMP implementation:

ldr r0, [pc, #(16 - 8)] bx lr

Of course the location relative to the pc where the TLS value is retrieved is implementation specific and not part of the ABI at all. Yet, some people found the call to this code too much overhead and started fetching the TLS value directly from memory themselves (*). Obviously their program would break if executed on a SMP system because then the TLS value is not stored in memory. But my point is that they were willing to do such hacks to completely avoid the call overhead, and in such cases I don't see adding to it with a full blown VDSO as something positive.

(*) I even considered changing the location of the TLS value in that case to break those abusers and make it clear that this is not the proper interface.

Nicolas

Richard Henderson

9:55 p.m.

On 07/08/2011 12:33 PM, Nicolas Pitre wrote:

...

I'm not sure I agree. We're talking about extremely lightweight functions here, in the order of a very few assembly instructions only. Adding a significant overhead relative to their cost is not very appealing.

...

But my point is that they were willing to do such hacks to completely avoid the call overhead, and in such cases I don't see adding to it with a full blown VDSO as something positive.

My last word on a subject I'm not currently being paid to care about:

I find this attitude to be short-sighted. These exact same arguments were made about a.out vs elf, and all the horrible extra overhead that elf has with its plts and dynamic symbol resolution. And yet, thankfully, a.out has gone the way of the dodo.

I urge you to go ahead and create the VDSO anyway, even if you continue to bypass it with new fixed addresses for new entry points. The VDSO does more than simply hold the code for those functions, it also describes them with the symbol and symbol versioning tables, and the unwind info. This is of enormous value to debugging and other introspection tools.

Nicolas Pitre

9 Jul 9 Jul

12:09 a.m.

On Fri, 8 Jul 2011, Richard Henderson wrote:

...

I find this attitude to be short-sighted. These exact same arguments were made about a.out vs elf, and all the horrible extra overhead that elf has with its plts and dynamic symbol resolution.

The fact that people came up with hacks such as prelinking must not be for nothing.

...

I urge you to go ahead and create the VDSO anyway, even if you continue to bypass it with new fixed addresses for new entry points.

I have no problem with that at all. Keeping the fixed address ABI will at least provide an alternative to those who are always after the last bit of performance because they don't know better than calling get_tls() a couple thousand times per second within the same thread.

...

The VDSO does more than simply hold the code for those functions, it also describes them with the symbol and symbol versioning tables, and the unwind info. This is of enormous value to debugging and other introspection tools.

Absolutely.

Nicolas

Richard Henderson

8 Jul 8 Jul

4:03 p.m.

On 07/07/2011 04:21 PM, David Gilbert wrote:

...

...
We could possibly wrap the vectors page to make it look like a DSO in a forwards-compatible way, but since this has not happened so far it feels like people never saw much benefit.

Any idea how big a job that is?

It's slightly tricky to get the memory layout correct, but ought not be *that* hard. The x86_64 port also had legacy absolute addresses to deal with.

...

...
Can you describe the benefits with regard to this case? (And, out of interest, how do statically-linked programs make use of needed functionality in the VDSO?

There's limited support for dlopen within statically linked programs as well. The userland side can provide a static interface which defers to the kernel implementation.

Ideally this would be done with an IFUNC relocation, but ARM binutils doesn't support that feature yet. Thankfully it is ABI compatible to replace a normal function with an IFUNC at a later date.

Dave Martin

4:55 p.m.

On Fri, Jul 08, 2011 at 09:03:51AM -0700, Richard Henderson wrote:

...

On 07/07/2011 04:21 PM, David Gilbert wrote:

...
...
We could possibly wrap the vectors page to make it look like a DSO in a forwards-compatible way, but since this has not happened so far it feels like people never saw much benefit.

Any idea how big a job that is?

It's slightly tricky to get the memory layout correct, but ought not be *that* hard. The x86_64 port also had legacy absolute addresses to deal with.

Talking to Will Deacon about this, it sounds like there may be little appetite for VDSO-ifying the vectors page unless there's a real, concrete benefit.

Making the libc startup's job slightly easier probably doesn't count as such a benefit, but if it renders possible something which is otherwise impossible to do from userspace than that would be more compelling.

...

...
...
Can you describe the benefits with regard to this case? (And, out of interest, how do statically-linked programs make use of needed functionality in the VDSO?

There's limited support for dlopen within statically linked programs as well. The userland side can provide a static interface which defers to the kernel implementation.

But you don't dlopen the VDSO, it's just mapped by the kernel on process startup. To find it, on other architectures something in the C library needs to grab the base address passed by the kernel in the aux vector.

If calls into the VDSO are needed, the C library might have to find the VDSO and get its symbols by special means in the case of statically linked programs. I'm not sure how this works in practice for architectures using a VDSO.

...

Ideally this would be done with an IFUNC relocation, but ARM binutils doesn't support that feature yet. Thankfully it is ABI compatible to replace a normal function with an IFUNC at a later date.

If we can do this properly using IFUNC in the future, then it might make sense to go for an interim solution for now, and tidy it up when IFUNC is available.

Can individual IFUNCs be set to resolve at startup?

I'm still not sure how that would work for static executables...

Cheers ---Dave

Richard Henderson

6:36 p.m.

On 07/08/2011 09:55 AM, Dave Martin wrote:

...

Talking to Will Deacon about this, it sounds like there may be little appetite for VDSO-ifying the vectors page unless there's a real, concrete benefit.

Making the libc startup's job slightly easier probably doesn't count as such a benefit, but if it renders possible something which is otherwise impossible to do from userspace than that would be more compelling.

I don't consider it anything to do with "making libc's job easier", but of providing a standard mechanism for version control of the ABI.

All you're going to do by using a different mechanism is re-invent the wheel.

...

...
There's limited support for dlopen within statically linked programs as well. The userland side can provide a static interface which defers to the kernel implementation.

But you don't dlopen the VDSO, it's just mapped by the kernel on process startup. To find it, on other architectures something in the C library needs to grab the base address passed by the kernel in the aux vector.

Of course. Did you think I didn't know that? And so does the dynamic linker, and the dynamic linker stub in libc.a. The point is to get a handle that you can pass into the other functions like dlsym.

See also _dl_vdso_vsym in libc/sysdeps/unix/sysv/linux/dl-vdso.c, and the uses in libc/sysdeps/unix/sysv/linux/*/init-first.c.

...

Can individual IFUNCs be set to resolve at startup?

No. They're kinda sorta a special type of PLT entry. So they're resolved either when called, or at startup. See also LD_BIND_NOW for the entire program and DT_BIND_NOW (-Wl,-z,now) for individual libraries and executables.

Dave Martin

11 Jul 11 Jul

8:33 a.m.

On Fri, Jul 08, 2011 at 11:36:30AM -0700, Richard Henderson wrote:

...

On 07/08/2011 09:55 AM, Dave Martin wrote:

...
Talking to Will Deacon about this, it sounds like there may be little appetite for VDSO-ifying the vectors page unless there's a real, concrete benefit.

Making the libc startup's job slightly easier probably doesn't count as such a benefit, but if it renders possible something which is otherwise impossible to do from userspace than that would be more compelling.

I don't consider it anything to do with "making libc's job easier", but of providing a standard mechanism for version control of the ABI.

All you're going to do by using a different mechanism is re-invent the wheel.

...
...
There's limited support for dlopen within statically linked programs as well. The userland side can provide a static interface which defers to the kernel implementation.

But you don't dlopen the VDSO, it's just mapped by the kernel on process startup. To find it, on other architectures something in the C library needs to grab the base address passed by the kernel in the aux vector.

Of course. Did you think I didn't know that? And so does the dynamic linker, and the dynamic linker stub in libc.a. The point is to get a handle that you can pass into the other functions like dlsym.

I guess I misunderstood the point you were making when you referred to being able to call dlopen from statically-linked programs...

...

See also _dl_vdso_vsym in libc/sysdeps/unix/sysv/linux/dl-vdso.c, and the uses in libc/sysdeps/unix/sysv/linux/*/init-first.c.

...
Can individual IFUNCs be set to resolve at startup?

No. They're kinda sorta a special type of PLT entry. So they're resolved either when called, or at startup. See also LD_BIND_NOW for the entire program and DT_BIND_NOW (-Wl,-z,now) for individual libraries and executables.

The key problem that we have is not how to direct the 64-bit atomics to the correct function, but rather what to do about it when there is no suitable target function at all.

I really think we have to detect that on process startup -- detecting that and destroying the program just when it first tries to do a cmpxchg64 is unacceptable IMHO. (*)

I don't think that a constructor function solves this in a maintainable way, because there is no metadata allowing us to determine when this check should be done relative to the other constructors. Simply putting it first raises the question of what order to do things in if we also have another contructor which much be called first.

IFUNC doesn't solve the problem because either it gets resolved lazily (violating the above principle (*)), or we have to force _all_ symbols to resolve at startup, with may have a significant impact on startup time for large programs.

We could modify the libc startup objects to do the check, but that feels a bit unclean.

That brings us back to VDSO -- am I right in thinking that the symbol versioning checks allow the program to be killed on startup of the VDSO ABI is tool old? If so, maybe that's the right approach after all.

This doesn't replace the existing interface, and the few non-libc based programs can still interact directly with the vectors page to do the same check if they need to.

Cheers ---Dave

Richard Sandiford

9:42 a.m.

Dave Martin dave.martin@linaro.org writes:

...

IFUNC doesn't solve the problem because either it gets resolved lazily (violating the above principle (*)), or we have to force _all_ symbols to resolve at startup, with may have a significant impact on startup time for large programs.

IFUNCs are never resolved lazily; that's one way in which they differ from PLTs.

(BTW, in response to a comment upthread, ARM does support IFUNCs now.)

Richard

Dave Martin

10:09 a.m.

On Mon, Jul 11, 2011 at 10:42:27AM +0100, Richard Sandiford wrote:

...

Dave Martin dave.martin@linaro.org writes:

...
IFUNC doesn't solve the problem because either it gets resolved lazily (violating the above principle (*)), or we have to force _all_ symbols to resolve at startup, with may have a significant impact on startup time for large programs.

IFUNCs are never resolved lazily; that's one way in which they differ from PLTs.

(BTW, in response to a comment upthread, ARM does support IFUNCs now.)

Hmmm, OK then -- the point about IFUNC not being suitable for this case is perhaps less valid, in that case.

Cheers ---Dave

Dave Martin

11:30 a.m.

On Mon, Jul 11, 2011 at 10:42:27AM +0100, Richard Sandiford wrote:

...

Dave Martin dave.martin@linaro.org writes:

...
IFUNC doesn't solve the problem because either it gets resolved lazily (violating the above principle (*)), or we have to force _all_ symbols to resolve at startup, with may have a significant impact on startup time for large programs.

IFUNCs are never resolved lazily; that's one way in which they differ from PLTs.

(BTW, in response to a comment upthread, ARM does support IFUNCs now.)

Do you know when support was merged in eglibc?

For example, Linaro binutils 2.21.0.20110327-2ubuntu3 appears to support IFUNC, but the natty version of eglibc (2.13-0ubuntu13) does not handle it correctly.

Cheers ---Dave

David Gilbert

noon

On 11 July 2011 12:30, Dave Martin dave.martin@linaro.org wrote:

...

On Mon, Jul 11, 2011 at 10:42:27AM +0100, Richard Sandiford wrote:

...
Dave Martin dave.martin@linaro.org writes:

...
IFUNC doesn't solve the problem because either it gets resolved lazily (violating the above principle (*)), or we have to force _all_ symbols to resolve at startup, with may have a significant impact on startup time for large programs.

IFUNCs are never resolved lazily; that's one way in which they differ from PLTs.

(BTW, in response to a comment upthread, ARM does support IFUNCs now.)

Do you know when support was merged in eglibc?

For example, Linaro binutils 2.21.0.20110327-2ubuntu3 appears to support IFUNC, but the natty version of eglibc (2.13-0ubuntu13) does not handle it correctly.

That's why I didn't try and use it; I thought it was best to wait for it to percolate through.

Richard's patch went into glibc-ports on April 26th and looks like it got imported into eglibc-ports on 5th May (svn rev 13608); there was also a fix from Joseph Myers in that area on 21st June.

Dave

Dave Martin

12 Jul 12 Jul

10:43 a.m.

On Mon, Jul 11, 2011 at 01:00:08PM +0100, David Gilbert wrote:

...

On 11 July 2011 12:30, Dave Martin dave.martin@linaro.org wrote:

...
On Mon, Jul 11, 2011 at 10:42:27AM +0100, Richard Sandiford wrote:

...
Dave Martin dave.martin@linaro.org writes:

...
IFUNC doesn't solve the problem because either it gets resolved lazily (violating the above principle (*)), or we have to force _all_ symbols to resolve at startup, with may have a significant impact on startup time for large programs.

IFUNCs are never resolved lazily; that's one way in which they differ from PLTs.

(BTW, in response to a comment upthread, ARM does support IFUNCs now.)

Do you know when support was merged in eglibc?

For example, Linaro binutils 2.21.0.20110327-2ubuntu3 appears to support IFUNC, but the natty version of eglibc (2.13-0ubuntu13) does not handle it correctly.

That's why I didn't try and use it; I thought it was best to wait for it to percolate through.

Richard's patch went into glibc-ports on April 26th and looks like it got imported into eglibc-ports on 5th May (svn rev 13608); there was also a fix from Joseph Myers in that area on 21st June.

Just for context, I had a quick play to get a feel for the feasibility of implementing this directly, without relying either on a VDSO or on IFUNC.

It's possible to produce something which works reasonably well: see the attachment. The result is almost the same as what IFUNC would achieve (although &__kernel_cmpxchg64 leaves something to be desired; macros could potentially fix that).

[see attached]

The various helpers are checked independently at startup if they are used, by registering the resolver functions as constructors as others have suggested. (Note that gcc/libc already fails to do this check for the older kernel helper functions; it's just that you're unlikely to observe a problem on any reasonably new kernel.)

The problem of constructor ordering is solved by pointing each indirect function pointer at the appropriate resolver function initially. Threading is not taken into account because no additional threads can exist during execution of constructors (though if more libraries are later dlopen()'d they may invoke additional constructors in the presence of threads, so this might need fixing).

The main question is the correct way to blow the program up if a needed helper function is not available. Calling exit() is possibly too gentle, whereas calling abort() may be too aggressive. libc may have some correct internal mechanism for doing this, but my knowledge falls short of that.

What to do in the case of recursive lookup failures isn't obvious: currently I just call abort(). By this point the process should be running atexit handlers or destructors, so calling exit() isn't going to help in this situation anyway.

The other question is whether and how the program can be enabled to catch the error and report something intelligible to the user. If there is a hook for catching dynamic link failures, we should maybe use the same. If this problem is not solved for dynamic linking in general, it doesn't make sense to require the kuser functions to solve it either (except that if the problem is eventually solved, we want the fix to work smoothly for both).

Of course, all of these are generic dynamic linking challenges which ld.so likely has some policy or solution for already.

Overall, this isn't a perfect solution, but it doesn't feel catastrophically bad either.

Any thoughts?

Cheers ---Dave

David Gilbert

12:10 p.m.

On 12 July 2011 11:43, Dave Martin dave.martin@linaro.org wrote:

...

Just for context, I had a quick play to get a feel for the feasibility of implementing this directly, without relying either on a VDSO or on IFUNC.

I originally thought about doing something similar to what you've done with the indirection; but eventually convinced myself that it didn't provide anything that the initial constructor check didn't provide.

...

It's possible to produce something which works reasonably well: see the attachment. The result is almost the same as what IFUNC would achieve (although &__kernel_cmpxchg64 leaves something to be desired; macros could potentially fix that).

Does it help address rth's concerns though?

...

The various helpers are checked independently at startup if they are used, by registering the resolver functions as constructors as others have suggested. (Note that gcc/libc already fails to do this check for the older kernel helper functions; it's just that you're unlikely to observe a problem on any reasonably new kernel.)

I don't see any point going back and adding checks for those; the code is already there.

...

The problem of constructor ordering is solved by pointing each indirect function pointer at the appropriate resolver function initially. Threading is not taken into account because no additional threads can exist during execution of constructors (though if more libraries are later dlopen()'d they may invoke additional constructors in the presence of threads, so this might need fixing).

I'm not sure threads are a problem; you could have two threads going into the check at the same time; with any luck they would both make the same decision and rewrite the pointer to the same value and either call the function ro exit; I guess you could end up with two threads trying to print the error and exit at the same time.

...

The main question is the correct way to blow the program up if a needed helper function is not available. Calling exit() is possibly too gentle, whereas calling abort() may be too aggressive. libc may have some correct internal mechanism for doing this, but my knowledge falls short of that.

I took the write() -> abort() from someother libc code that forced an exit.

...

What to do in the case of recursive lookup failures isn't obvious: currently I just call abort(). By this point the process should be running atexit handlers or destructors, so calling exit() isn't going to help in this situation anyway.

You could always just send yourself a SIGKILL.

...

The other question is whether and how the program can be enabled to catch the error and report something intelligible to the user. If there is a hook for catching dynamic link failures, we should maybe use the same. If this problem is not solved for dynamic linking in general, it doesn't make sense to require the kuser functions to solve it either (except that if the problem is eventually solved, we want the fix to work smoothly for both).

Isn't this a bit over-the-top for the failure we're trying to catch?

...

Of course, all of these are generic dynamic linking challenges which ld.so likely has some policy or solution for already.

Overall, this isn't a perfect solution, but it doesn't feel catastrophically bad either.

Any thoughts?

We seem to be growing solutions rather than figuring out which ones satisfy the requirements of both gcc and kernel:

1) my original simple solution (about 5 lines of code) 2) IFUNC 3) Your more general indirected code 4) A VDSO

To be honest I don't see the point in the more general indirected approach; if we want to be more general then I think we should use IFUNC (it would be the 1st use of it, which means we may have to fix some issues but hey that's what we're here for).

There is some stuff that is a bit of a shame that it wasn't more general already; e.g. those slots in the commpage for helper functions - if they were filled with a known marker, you could just check the marker rather than having a separate version, or a faulting instruction or the like.

Also, remember this whole discussion is just to print a message and exit nicely in the case of someone using a currently incredibly rare function on an old kernel!

RTH: From a gcc point of view is anything other than the VDSO acceptible?

Dave

Dave Martin

12:40 p.m.

On Tue, Jul 12, 2011 at 01:10:24PM +0100, David Gilbert wrote:

...

On 12 July 2011 11:43, Dave Martin dave.martin@linaro.org wrote:

...
Just for context, I had a quick play to get a feel for the feasibility of implementing this directly, without relying either on a VDSO or on IFUNC.

I originally thought about doing something similar to what you've done with the indirection; but eventually convinced myself that it didn't provide anything that the initial constructor check didn't provide.

...
It's possible to produce something which works reasonably well: see the attachment. The result is almost the same as what IFUNC would achieve (although &__kernel_cmpxchg64 leaves something to be desired; macros could potentially fix that).

Does it help address rth's concerns though?

Which ones in particular?

...

...
The various helpers are checked independently at startup if they are used, by registering the resolver functions as constructors as others have suggested. (Note that gcc/libc already fails to do this check for the older kernel helper functions; it's just that you're unlikely to observe a problem on any reasonably new kernel.)

I don't see any point going back and adding checks for those; the code is already there.

Indeed -- I'm merely observing that this problem is not actually new with cmpxchg64; it's just that it wasn't solved previously.

...

...
The problem of constructor ordering is solved by pointing each indirect function pointer at the appropriate resolver function initially. Threading is not taken into account because no additional threads can exist during execution of constructors (though if more libraries are later dlopen()'d they may invoke additional constructors in the presence of threads, so this might need fixing).

I'm not sure threads are a problem; you could have two threads going into the check at the same time; with any luck they would both make the same decision and rewrite the pointer to the same value and either call the function ro exit; I guess you could end up with two threads trying to print the error and exit at the same time.

Hmmm, probably right.

...

...
The main question is the correct way to blow the program up if a needed helper function is not available. Calling exit() is possibly too gentle, whereas calling abort() may be too aggressive. libc may have some correct internal mechanism for doing this, but my knowledge falls short of that.

I took the write() -> abort() from someother libc code that forced an exit.

...
What to do in the case of recursive lookup failures isn't obvious: currently I just call abort(). By this point the process should be running atexit handlers or destructors, so calling exit() isn't going to help in this situation anyway.

You could always just send yourself a SIGKILL.

Probably, yes. abort() is almost the same, except that SIGKILL is not catchable.

It's probably up to libc guys to judge what would be correct.

...

...
The other question is whether and how the program can be enabled to catch the error and report something intelligible to the user. If there is a hook for catching dynamic link failures, we should maybe use the same. If this problem is not solved for dynamic linking in general, it doesn't make sense to require the kuser functions to solve it either (except that if the problem is eventually solved, we want the fix to work smoothly for both).

Isn't this a bit over-the-top for the failure we're trying to catch?

Yes and no. Implementing such a solution just for this case would be very over-the-top. But integrating with a general mechanism for reporting errors if it exists (now or in the future) is very sensible.

...

...
Of course, all of these are generic dynamic linking challenges which ld.so likely has some policy or solution for already.

Overall, this isn't a perfect solution, but it doesn't feel catastrophically bad either.

Any thoughts?

We seem to be growing solutions rather than figuring out which ones satisfy the requirements of both gcc and kernel:

my original simple solution (about 5 lines of code)

IFUNC

Your more general indirected code

A VDSO

I was just exploring the problem space; I appreciate that's not the same thing as choosing a solution.

...

To be honest I don't see the point in the more general indirected approach; if we want to be more general then I think we should use IFUNC (it would be the 1st use of it, which means we may have to fix some issues but hey that's what we're here for).

Does IFUNC rely on support from libc in order to get resolved correctly? My concern is that if so, binaries using it will silently misbehave on older libcs. This seems to be the case at present, though I haven't figured out whether this is caused by libc or the toolchain. It may be that I'm worrying for nothing -- I don't know exactly how IFUNC works.

...

There is some stuff that is a bit of a shame that it wasn't more general already; e.g. those slots in the commpage for helper functions - if they were filled with a known marker, you could just check the marker rather than having a separate version, or a faulting instruction or the like.

If it's necessary to to any check at all before concluding that it's safe to call a function, I don't see the difference. Checking a magic number in one place seems no better than checking for a version number in another place, AFAICT.

Having a faulting instruction doesn't permit the program to abort cleanly ahead of time (unless you check in advance -- in which case you're really back to a magic number again).

This matters if constructors might do things like screwing up the terminal -- if so, the chance to un-screw thing up should be offered to the program before it exits.

Possibly constructors should never do things like that, in which case these concerns don't apply. I don't know what the official line is regarding what kind of things global constructors should and shouldn't do.

...

Also, remember this whole discussion is just to print a message and exit nicely in the case of someone using a currently incredibly rare function on an old kernel!

I'd say we want to notify the operating environment and/or the user. This may be realised by writing some text to stderr, but that's not useful in the context of GUI environments.

I am not suggesting that we should engineer a special solution to that problem here, so long as we don't defeat potential solutions either. For now, it doesn't like any of the proposals risk that though.

Cheers ---Dave

Richard Sandiford

1:13 p.m.

Dave Martin dave.martin@linaro.org writes:

...

...
To be honest I don't see the point in the more general indirected approach; if we want to be more general then I think we should use IFUNC (it would be the 1st use of it, which means we may have to fix some issues but hey that's what we're here for).

Does IFUNC rely on support from libc in order to get resolved correctly? My concern is that if so, binaries using it will silently misbehave on older libcs.

Well, it uses new relocations, so it relies on the dynamic loader implementing those relocations correctly. But older loaders from older libcs will error out if they see that relocation. Programs wouldn't silently misbehave if your loader is too old.

Static executables are responsible for resolving their own ifuncs (in the common start-up code) so they should run on any system.

Richard

David Gilbert

1:22 p.m.

On 12 July 2011 13:40, Dave Martin dave.martin@linaro.org wrote:

...

On Tue, Jul 12, 2011 at 01:10:24PM +0100, David Gilbert wrote:

...

...
Does it help address rth's concerns though?

Which ones in particular?

Good question - hence my prompt to rth at the bottom; I know he originally asked why not go to VDSO, so what I don't yet understand is whether he would accept the other alternatives.

...

...
Isn't this a bit over-the-top for the failure we're trying to catch?

Yes and no. Implementing such a solution just for this case would be very over-the-top. But integrating with a general mechanism for reporting errors if it exists (now or in the future) is very sensible.

Right - but I don't think we should be doing anything new here; we can't be the first people to be trying to detect running on an older kernel and exiting relatively cleanly.

...

If it's necessary to to any check at all before concluding that it's safe to call a function, I don't see the difference. Checking a magic number in one place seems no better than checking for a version number in another place, AFAICT.

Something like AUXV would have seemed cleaner to me; but that's just preference.

...

Having a faulting instruction doesn't permit the program to abort cleanly ahead of time (unless you check in advance -- in which case you're really back to a magic number again).

This matters if constructors might do things like screwing up the terminal -- if so, the chance to un-screw thing up should be offered to the program before it exits.

Possibly constructors should never do things like that, in which case these concerns don't apply. I don't know what the official line is regarding what kind of things global constructors should and shouldn't do.

If I understand correctly that's the whole point of abort() - if they need to tidy up then they should be handling SIGABRT.

...

...
Also, remember this whole discussion is just to print a message and exit nicely in the case of someone using a currently incredibly rare function on an old kernel!

I'd say we want to notify the operating environment and/or the user. This may be realised by writing some text to stderr, but that's not useful in the context of GUI environments.

Yes it is; exiting with a non-zero return value and writing to stderr is perfectly normal - whatever launched it can go and display that text. (Or as I say trapping SIGABRT).

...

I am not suggesting that we should engineer a special solution to that problem here, so long as we don't defeat potential solutions either. For now, it doesn't like any of the proposals risk that though.

Dave

Richard Sandiford

1:25 p.m.

Sorry, should have included this in my last reply.

Dave Martin dave.martin@linaro.org writes:

...

...
Also, remember this whole discussion is just to print a message and exit nicely in the case of someone using a currently incredibly rare function on an old kernel!

I'd say we want to notify the operating environment and/or the user. This may be realised by writing some text to stderr, but that's not useful in the context of GUI environments.

Shouldn't a decent GUI environment be able to catch these kinds of errors and log them though?

One problem with early write()s is that we can't do any locale-specific translation. The message is always going to be in English. That's something that the VSO approach solves for free, because it's then up to libc or the dynamic loader to print a suitable message.

Richard

Dave Martin

1:53 p.m.

On Tue, Jul 12, 2011 at 02:25:09PM +0100, Richard Sandiford wrote:

...

Sorry, should have included this in my last reply.

Dave Martin dave.martin@linaro.org writes:

...
...
Also, remember this whole discussion is just to print a message and exit nicely in the case of someone using a currently incredibly rare function on an old kernel!

I'd say we want to notify the operating environment and/or the user. This may be realised by writing some text to stderr, but that's not useful in the context of GUI environments.

Shouldn't a decent GUI environment be able to catch these kinds of errors and log them though?

Ideally yes. Which is why calling some hook may be better than writing a message to stderr, because messages written to stderr are rarely usefully machine-readable (even if not locale-translated). GUI toolkits may provide such a hook, but in the case of dnyamic linking errors and similar, the GUI toolkit is unlikely to have started up at the time the error occurs.

Anyway, it's out of our scope to solve this here, so long as we're no worse than the de facto behaviour.

...

One problem with early write()s is that we can't do any locale-specific translation. The message is always going to be in English. That's something that the VSO approach solves for free, because it's then up to libc or the dynamic loader to print a suitable message.

Agreed, although locale-translated messages on stderr, while good for humans, tend to be even less machine-readable.

Cheers ---Dave

Dave Martin

8 Jul 8 Jul

5:01 p.m.

On Fri, Jul 08, 2011 at 12:21:27AM +0100, David Gilbert wrote:

...

On 5 July 2011 15:49, Dave Martin dave.martin@linaro.org wrote:

...
On Fri, Jul 1, 2011 at 6:10 PM, David Gilbert david.gilbert@linaro.org wrote:

[...]

...

My patch is having to test an arbitrary address in the commpage in init code to see if it's new enough to have the new helper that

Minor niggle: the address is in no way arbitrary. Although there's no corresponding C symbol, it's a well-known absolute address.

...

Nicolas just added; Richard's argument is that if it was actually a VDSO I'd just have linked against a symbol and if the symbol wasn't there then I would have got a fairly normal linker error - I wouldn't have needed any special code; and to be honest I think that would also solve the problem that accessing that wacky address is also a pain for things like qemu because we're going to have to intercept that read.

Note that pre-existing code already calls into the vectors page. Introducing a VDSO interface will not automatically cause all software to stop using the binary interface.

How much of a problem that is in practice, I don't know.

This doesn't necessarily mean that adding a VDSO interface is bad, but it does mean the existing interface can't easily be changed or removed.

Cheers ---Dave

Nicolas Pitre

6:32 p.m.

On Fri, 8 Jul 2011, Dave Martin wrote:

...

On Fri, Jul 08, 2011 at 12:21:27AM +0100, David Gilbert wrote:

...
Nicolas just added; Richard's argument is that if it was actually a VDSO I'd just have linked against a symbol and if the symbol wasn't there then I would have got a fairly normal linker error - I wouldn't have needed any special code; and to be honest I think that would also solve the problem that accessing that wacky address is also a pain for things like qemu because we're going to have to intercept that read.

Note that pre-existing code already calls into the vectors page.

Right, qemu must be intercepting it already. The kernel helpers located at a fixed address in the vector page have been around for more than 6 years.

...

This doesn't necessarily mean that adding a VDSO interface is bad, but it does mean the existing interface can't easily be changed or removed.

Agreed. There is nothing preventing the addition of a VDSO, but I've yet to see a convincing argument for spending time adding it.

Nicolas

Peter Maydell

11:29 p.m.

On 8 July 2011 19:32, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...

On Fri, 8 Jul 2011, Dave Martin wrote:

...
On Fri, Jul 08, 2011 at 12:21:27AM +0100, David Gilbert wrote:

...
Nicolas just added; Richard's argument is that if it was actually a VDSO I'd just have linked against a symbol and if the symbol wasn't there then I would have got a fairly normal linker error - I wouldn't have needed any special code; and to be honest I think that would also solve the problem that accessing that wacky address is also a pain for things like qemu because we're going to have to intercept that read.

Note that pre-existing code already calls into the vectors page.

Right, qemu must be intercepting it already. The kernel helpers located at a fixed address in the vector page have been around for more than 6 years.

QEMU supports calls into the fixed vector page (it just special cases attempts to execute at addresses >= 0xffff0000 and emits code to do "cause special purpose exception" so we get control back at runtime). What it doesn't support is attempts to do a load from the vector page, because so far nobody has had a need to probe the vector page for what it supports.

-- PMM

Dave Martin

11 Jul 11 Jul

8:36 a.m.

On Sat, Jul 09, 2011 at 12:29:01AM +0100, Peter Maydell wrote:

...

On 8 July 2011 19:32, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...
On Fri, 8 Jul 2011, Dave Martin wrote:

...
On Fri, Jul 08, 2011 at 12:21:27AM +0100, David Gilbert wrote:

...
Nicolas just added; Richard's argument is that if it was actually a VDSO I'd just have linked against a symbol and if the symbol wasn't there then I would have got a fairly normal linker error - I wouldn't have needed any special code; and to be honest I think that would also solve the problem that accessing that wacky address is also a pain for things like qemu because we're going to have to intercept that read.

Note that pre-existing code already calls into the vectors page.

Right, qemu must be intercepting it already. The kernel helpers located at a fixed address in the vector page have been around for more than 6 years.

QEMU supports calls into the fixed vector page (it just special cases attempts to execute at addresses >= 0xffff0000 and emits code to do "cause special purpose exception" so we get control back at runtime). What it doesn't support is attempts to do a load from the vector page, because so far nobody has had a need to probe the vector page for what it supports.

To solve that particular problem, can you just map a page in RAM with the right content for __kernel_helper_version?

The value there can be a static feature of QEMU, indicating which kuser helper functions QEMU knows how to emulate.

User processes shouldn't read anything else from the vectors page, at least for now.

Cheers ---Dave

David Gilbert

8:42 a.m.

On 11 July 2011 09:36, Dave Martin dave.martin@linaro.org wrote:

...

On Sat, Jul 09, 2011 at 12:29:01AM +0100, Peter Maydell wrote:

...
On 8 July 2011 19:32, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...
On Fri, 8 Jul 2011, Dave Martin wrote:

...
On Fri, Jul 08, 2011 at 12:21:27AM +0100, David Gilbert wrote:

...
Nicolas just added; Richard's argument is that if it was actually a VDSO I'd just have linked against a symbol and if the symbol wasn't there then I would have got a fairly normal linker error - I wouldn't have needed any special code; and to be honest I think that would also solve the problem that accessing that wacky address is also a pain for things like qemu because we're going to have to intercept that read.

Note that pre-existing code already calls into the vectors page.

Right, qemu must be intercepting it already. The kernel helpers located at a fixed address in the vector page have been around for more than 6 years.

QEMU supports calls into the fixed vector page (it just special cases attempts to execute at addresses >= 0xffff0000 and emits code to do "cause special purpose exception" so we get control back at runtime). What it doesn't support is attempts to do a load from the vector page, because so far nobody has had a need to probe the vector page for what it supports.

To solve that particular problem, can you just map a page in RAM with the right content for __kernel_helper_version?

We could, but that's dependent on the target system letting us do that. (for that matter if the target system decided to map another page there it won't fault access to it and then we're going to get who knows what).

Dave

Peter Maydell

9 a.m.

On 11 July 2011 09:42, David Gilbert david.gilbert@linaro.org wrote:

...

On 11 July 2011 09:36, Dave Martin dave.martin@linaro.org wrote:

...
On Sat, Jul 09, 2011 at 12:29:01AM +0100, Peter Maydell wrote:

...
QEMU supports calls into the fixed vector page (it just special cases attempts to execute at addresses >= 0xffff0000 and emits code to do "cause special purpose exception" so we get control back at runtime). What it doesn't support is attempts to do a load from the vector page, because so far nobody has had a need to probe the vector page for what it supports.

To solve that particular problem, can you just map a page in RAM with the right content for __kernel_helper_version?

We could, but that's dependent on the target system letting us do that. (for that matter if the target system decided to map another page there it won't fault access to it and then we're going to get who knows what).

To clarify, that's the system we're running on (usually x86, more often called the "host system"). This is because in linux-user emulation mode QEMU doesn't completely manage the guest address space, but just exposes a contiguous chunk of the host QEMU process' address space to the target. So if the host kernel has mapped something in the area corresponding to where the guest vector page is supposed to be then you're stuck.

(This implementation has other problems, eg https://bugs.launchpad.net/qemu-linaro/+bug/806873 but has the advantage of being fast... In the long term I quite like the idea suggested by rth on qemu-devel of making the linux-user mode use qemu's softmmu so we actually control the guest process address space completely; however that's a fairly substantial bit of rework.)

-- PMM

5207

days inactive

5274

days old

linaro-dev@lists.linaro.org

65 comments

participants

tags (0)

participants (15)

Dave Martin
David Gilbert
David Rusling
Ken Werner
Ken Werner
Michael Casadevall
Michael Hope
Nicolas Pitre
Peter Maydell
Ramana Radhakrishnan
Richard Earnshaw
Richard Henderson
Richard Sandiford
Rob Herring
saeed bishara