Hi,
We've been thinking about adding support for the built-in functions for 64bit atomic memory access and I'd like to know if this is of any interest. Currently the main use of these functions seems to be to implement (SMP safe) locking mechanisms where the existent 32bit memory ops are sufficient. However, there might be code out there that implements a parallel algorithm using 64bit atomic memory operations.
Currently the GCC ARM backend doesn't provide a pattern to inline 64bit __sync_* functions but the compiler emits __sync_*_8 function calls [1]. The libgcc does not provide these symbols via the usual thin wrapper around the kernel helper [2] because the ARM Linux __kernel_cmpxchg supports 32bit only. My understanding is that for ARMv7 the GCC backend could be enhanced to inline the __sync_* functions by using the LDREXD and STREXD instructions. But for ARMv5 we would still rely on a new kernel helper.
Any ideas/thoughts are appreciated.
[1] https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations#GCC [2] https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations#imple...
Regards Ken
On Fri, 2011-05-06 at 17:06 +0200, Ken Werner wrote:
I've been told (though I've not verified it) that the Membase NoSQL database needs 64-bit atomic operations to work effectively (http://www.membase.org/downloadsrc)
R.
On 6 May 2011 16:06, Ken Werner ken.werner@linaro.org wrote:
It's a bit tricky with when you want to use the kernel helper for v5t, so we've got to find a way of turning this on only with special knobs and not by default and that's a bit tricky. Think new user space and old kernel and a jump into never-never land. For v7-a you could inline the function using ldrexd / strexd.
cheers Ramana
On Fri, 6 May 2011, Ramana Radhakrishnan wrote:
What is the problem with v5t?
Think new user space and old kernel and a jump into never-never land.
The kernel helpers are "versioned". At 0xffff0ffc you can read the number of helpers currently available. If a program uses a new helper then when it is started this value can be verified to equal or exceed the expected value and bail out otherwise.
Nicolas
On Sat, May 7, 2011 at 12:39 AM, Nicolas Pitre nicolas.pitre@linaro.org wrote:
Hi, when I run the following on my jaunty (for armv5), I get the following
echo "void f(){__sync_synchronize();}" | gcc -O2 -x c -Wall -dA -S - -o - .arch armv6 .eabi_attribute 27, 3 .fpu vfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 2 .eabi_attribute 18, 4 .file "" .text .align 2 .global f .type f, %function f: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. @ basic block 2 bx lr .size f, .-f .ident "GCC: (Ubuntu 4.4.1-4ubuntu9) 4.4.1" .section .note.GNU-stack,"",%progbits
but I can see (with debugger) that the kernel helper functions are reached. any idea how?
saeed
On Wed, May 11, 2011 at 6:14 PM, saeed bishara saeed.bishara@gmail.com wrote:
There was a compiler bug some time ago which mistakenly generated no code for __sync_synchronize(): https://bugs.launchpad.net/ubuntu/+source/gcc-4.4/+bug/491872
I suspect your Jaunty-era compiler may be affected by this -- you probably need to upgrade it.
If you're observing calls to the helpers, they may be coming from glibc etc. instead of your function.
You _should_ be seeing code something like this:
$ echo 'void f(void) { __sync_synchronize(); }' | arm-linux-gnueabi-gcc -march=armv7-a -O2 -S -o - -x c - [...] f: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. dmb sy bx lr [...]
(If -march is set high enough, the dmb is explicit and inline.)
$ echo 'void f(void) { __sync_synchronize(); }' | arm-linux-gnueabi-gcc -march=armv5t -O2 -S -o - -x c - [...] f: push {r3, lr} bl __sync_synchronize @ sp needed for prologue pop {r3, pc}
(-march=armv5t indicates an architecture which doesn't have the barrier instructions, so the kuser helper is needed instead)
Cheers ---Dave
On Fri, May 6, 2011 at 10:39 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:
To what extent do we think that userspace code is actually checking this?
I may suggest a patch to the documentation text in entry-armv.S to make this requirement clearer, as well as getting rid of the example assembler code (which I consider to be mis-educative, but more importantly it's also Thumb-incompatible and will likely suffer poor branch-prediction on recent CPUs. This code was the source of a TLS bug in bionic, and may have been inappropriately pasted elsewhere too...)
Cheers ---Dave
On Tue, 17 May 2011, Dave Martin wrote:
I think right now it is none of it simply because most of the helpers were added almost all at once. But if future helpers are added then it would be a good idea to check this but only if the new helper is actually invoked for a given application.
If you have suggestions for improving this then please do so.
What bug?
Nicolas
On Tue, May 17, 2011 at 5:30 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:
I have a patch which I'll suggest at some point, but it's not high priority.
Actually, to be fair I think I may be mis-remembering here... I can't seem to find the exact bug now.
Cheers ---Dave
On 05/19/2011 12:40 PM, David Rusling wrote:
Is this going to end up in a blueprint? This is the last loose end of SMP / atomic memory operations work and I'd like to see it happen
Hi,
Yep, there is one (kind of a skeleton) in place at: https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/64-bit-sync-pri...
Regards Ken
On 19 May 2011 16:49, Ken Werner ken@linux.vnet.ibm.com wrote:
Which I'll be filling out in the next few days.
Dave
On Tue, May 24, 2011 at 10:33 AM, Michael Casadevall mcasadevall@ubuntu.com wrote:
Hi Michael. The topics for this planning cycle are listed here: https://wiki.linaro.org/Cycles/1111/TechnicalTopics/Toolchain
64 bit sync primitives are medium priority so they will be achieved in the next six months.
A draft of what is in whos queue is at: https://wiki.linaro.org/Cycles/1111/TechnicalTopics/Toolchain/Planning
The primitives are second in Dave's queue so should be started in the next three months.
-- Michael
On Tue, 24 May 2011, Michael Hope wrote:
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):
diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S index e8d8856..53830a7 100644 --- a/arch/arm/kernel/entry-armv.S +++ b/arch/arm/kernel/entry-armv.S @@ -383,7 +383,7 @@ ENDPROC(__pabt_svc) .endm
.macro kuser_cmpxchg_check -#if __LINUX_ARM_ARCH__ < 6 && !defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG) +#if !defined(CONFIG_CPU_32v6K) && !defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG) #ifndef CONFIG_MMU #warning "NPTL on non MMU needs fixing" #else @@ -392,7 +392,7 @@ ENDPROC(__pabt_svc) @ perform a quick test inline since it should be false @ 99.9999% of the time. The rest is done out of line. cmp r2, #TASK_SIZE - blhs kuser_cmpxchg_fixup + blhs kuser_cmpxchg64_fixup #endif #endif .endm @@ -797,6 +797,139 @@ __kuser_helper_start: /* * Reference prototype: * + * int __kernel_cmpxchgd64(int64_t *oldval, int64_t *newval, int64_t *ptr) + * + * Input: + * + * r0 = pointer to oldval + * r1 = pointer to newval + * r2 = pointer to target value + * lr = return address + * + * Output: + * + * r0 = returned value (zero or non-zero) + * C flag = set if r0 == 0, clear if r0 != 0 + * + * Clobbered: + * + * r3, flags + * + * Definition and user space usage example: + * + * typedef int (__kernel_cmpxchg64_t)(const int64_t *oldval, + * const int64_t *newval, + * volatile int64_t *ptr); + * #define __kernel_cmpxchg64 (*(__kernel_cmpxchg64_t *)0xffff0f60) + * + * Atomically store newval in *ptr if *ptr is equal to oldval for user space. + * Return zero if *ptr was changed or non-zero if no exchange happened. + * The C flag is also set if *ptr was changed to allow for assembly + * optimization in the calling code. + * + * Notes: + * + * - This routine already includes memory barriers as needed. + * + * - Due to the length of some sequences, this spans 2 regular kuser + * "slots" so 0xffff0f80 is not used as a valid entry point. + */ + +__kuser_cmpxchg64: @ 0xffff0f60 + +#if defined(CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG) + + /* + * Poor you. No fast solution possible... + * The kernel itself must perform the operation. + * A special ghost syscall is used for that (see traps.c). + */ + stmfd sp!, {r7, lr} + ldr r7, 1f @ it's 20 bits + swi __ARM_NR_cmpxchg64 + ldmfd sp!, {r7, pc} +1: .word __ARM_NR_cmpxchg64 + +#elif defined(CONFIG_CPU_32v6K) + + stmfd sp!, {r4, r5, r6, r7} + ldrd r4, r5, [r0] @ load old val + ldrd r6, r7, [r1] @ load new val + smp_dmb arm +1: ldrexd r0, r1, [r2] @ load current val + eors r3, r0, r4 @ compare with oldval (1) + eoreqs r3, r1, r5 @ compare with oldval (2) + strexdeq r3, r6, r7, [r2] @ store newval if eq + teqeq r3, #1 @ success? + beq 1b @ if no then retry + smp_dmb arm + rsbs r0, r3, #0 @ set returned val and C flag + ldmfd sp!, {r4, r5, r6, r7} + usr_ret lr + +#elif !defined(CONFIG_SMP) + +#ifdef CONFIG_MMU + + /* + * The only thing that can break atomicity in this cmpxchg64 + * implementation is either an IRQ or a data abort exception + * causing another process/thread to be scheduled in the middle + * of the critical sequence. To prevent this, code is added to + * the IRQ and data abort exception handlers to set the pc back + * to the beginning of the critical section if it is found to be + * within that critical section (see kuser_cmpxchg_fixup64). + */ + stmfd sp!, {r4, r5, r6, lr} + ldmia r0, {r4, r5} @ load old val + ldmia r1, {r6, lr} @ load new val +1: ldmia r2, {r0, r1} @ load current val + eors r3, r0, r4 @ compare with oldval (1) + eoreqs r3, r1, r5 @ compare with oldval (2) +2: stmeqia r2, {r6, lr} @ store newval if eq + rsbs r0, r3, #0 @ set return val and C flag + ldmfd sp!, {r4, r5, r6, pc} + + .text +kuser_cmpxchg64_fixup: + @ Called from kuser_cmpxchg_fixup. + @ r2 = address of interrupted insn (must be preserved). + @ sp = saved regs. r7 and r8 are clobbered. + @ 1b = first critical insn, 2b = last critical insn. + @ If r2 >= 1b and r2 <= 2b then saved pc_usr is set to 1b. + mov r7, #0xffff0fff + sub r7, r7, #(0xffff0fff - (0xffff0f60 + (1b - __kuser_cmpxchg64))) + subs r8, r2, r7 + rsbcss r8, r8, #(2b - 1b) + strcs r7, [sp, #S_PC] +#if __LINUX_ARM_ARCH__ < 6 + b kuser_cmpxchg32_fixup +#else + mov pc, lr +#endif + .previous + +#else +#warning "NPTL on non MMU needs fixing" + mov r0, #-1 + adds r0, r0, #0 + usr_ret lr +#endif + +#else +#error "incoherent kernel configuration" +#endif + + /* pad to next slot */ + .rept (16 - (. - __kuser_cmpxchg64)/4) + .word 0 + .endr + + .align 5 + +/* + * Reference prototype: + * * void __kernel_memory_barrier(void) * * Input: @@ -921,7 +1054,7 @@ __kuser_cmpxchg: @ 0xffff0fc0 usr_ret lr
.text -kuser_cmpxchg_fixup: +kuser_cmpxchg32_fixup: @ Called from kuser_cmpxchg_check macro. @ r2 = address of interrupted insn (must be preserved). @ sp = saved regs. r7 and r8 are clobbered.
On Tue, May 24, 2011 at 11:45:04PM -0400, Nicolas Pitre wrote:
* Do not attempt to call this function unless __kernel_helper_version >= 5. *
Can we just have movcs pc,lr here, and put kuser_cmpxchg32_fixup immediately after?
This would allow us to skip the branch, and the initial "mov r7" in the kuser_cmpxchg32_fixup code.
There's a fair amount of duplicated logic from the 32-bit case. Is it worth trying to merge the two?
Since we can reasonably expect there to be no additional cmpxchg helpers in the future, the answer may be "no"...
Cheers ---Dave
On Wed, 25 May 2011, Dave Martin wrote:
Yep. I will queue your other patch prior to this one and make this blurb consistent with the rest.
The 'mov r7' is still needed unless the second instruction uses another target register.
I thought about the possibility of "movcs pc, lr", especially since the .text segments are simply concatenated and therefore the branch is effectively branching to the very next instruction. So there could be like a common preamble, then a list of concatenated fixup segments (only two of them now) and finally a postamble which would simply be "mov pc, lr". This would all be put contigous at link time. However I'm not sure yet if this is worth optimizing that far since this code is far from being hot, and clarity would also be affected.
There's a fair amount of duplicated logic from the 32-bit case. Is it worth trying to merge the two?
The core logic spans 5 instructions. Only 3 of them are actually the same in both cases i.e. those with no references to 1b or 2b, and they're perfectly interlaced in between the others. Trying to make this into common runtime code would result in even bigger code I'm afraid. This could be made into common source code via a macro though.
Thanks for the review.
Nicolas
On Wed, May 25, 2011 at 10:21:43AM -0400, Nicolas Pitre wrote:
Oh, OK. I hadn't feedback on that yet, so I wasn't sure whether anyone was picking it up. I will repost to alkml, but I was waiting for the merge window to close since this was just a tidyup rather than something urgent.
This warning is more important for the new helper than for the existing ones (where really we could omit it without much risk).
What I meant is that at the end of the above sequence, r7 = 1b. So we can just fall through in to the 32-bit fixup code and add the appropriate small offset to r7 so that it now points to the corresponding 1b label for __kuser_cmpxchg32, without needing to re-derive the address using mov+sub.
It's a pretty minor tweak though.
Also, probably the number of fixups is never going to grow very large.
Fair enough -- a macro might be worth a try _if_ it simplifies things in the source, but I think you're right that merging the actual code probably isn't worth it just to save a few words in the vectors page (which eats up 4K regardless of what we put in it) and a few cycles per fault (which already costs many, many cycles).
Cheers ---Dave
On Thu, 26 May 2011, Dave Martin wrote:
In the normal cases, there is no additional cycles per fault as the inline check remains unchanged, and it goes like this:
@ Make sure our user space atomic helper is restarted @ if it was interrupted in a critical region. Here we @ perform a quick test inline since it should be false @ 99.9999% of the time. The rest is done out of line. cmp r2, #TASK_SIZE blhs kuser_cmpxchg_fixup
In most cases the branch is not taken.
Nicolas
On 25 May 2011 04:45, Nicolas Pitre nicolas.pitre@linaro.org wrote:
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):
OK, so that makes a eglibc part for that pretty easy. For things like fetch_and_add (which I can see membase needs) would you expect implementation using this cmpxchg so it has a fall back or just to use ldrexd directly which I assume would be somewhat more efficient.
(Question holds for both eglibc and gcc's __sync_*)
Hmm I wonder whether something like the atomic add primitives from user space need that or not; it depends whether people are using them to build larger data structures or just trying to keep a consistent count somewhere.
Dave
On Wed, May 25, 2011 at 12:58:30PM +0100, David Gilbert wrote:
It depends on the baseline architecture for the build.
An eglibc built for ARMv6 and above would need to call the helper by default, though it could also use ldrexd/strexd if it determines at run- time that this is supported by the CPU.
Similarly, if GCC is building for -march=marmv7-a it can inline the atomics directly using ldrex/strex and friends, but for -march=armv6 it will need to call helpers via libgcc.
The GCC __sync_* primitives must mostly be full barriers. This is what the Itanium ABI specifies (this is the spec GCC follows for these).
The kernel user helper itself could omit the barriers, but this would deviate from the existing 32-bit implementation, and also slow down the (common) case where at least one of the barriers really is needed.
_just_ updating a counter doesn't need the barriers. But if it's important that a counter can be atomically updated by multiple threads, the synchronisation of that update against other data structures usually turns out to be important too...
larger data structures or just trying to keep a consistent count somewhere.
GCC's atomic primitives don't address the full/partial barrier distinction. Some other atomics APIs, like Qt's for example, do express different flavours of barrier behavour and so in principle can be more optimal for cases where it makes a difference.
Cheers ---Dave
On 05/25/2011 03:17 PM, Dave Martin wrote:
I would have thought that the libc does not decide this directly but just calls the GCC __sync_* routines (if build with a GCC that supports them). Then the GCC decides whether to inline them using ldrexd/strexd (ARMv6+) or emit calls to libgcc which calls the kernel helpers.
Regards Ken
On Tue, 2011-05-31 at 13:17 +0100, Dave Martin wrote:
I think the difficulty here is that glibc expects either the compiler, or libgcc to provide the sync primitives; and while GCC can tie the inlined copy of the primitive to use of CPUs with the relevant instruction, the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...
It could throw the dependency back on glibc, but then you've got an expensive operation (the libgcc copies are normally implemented as private, per-library, helpers to avoid a PLT call overhead).
I'm not sure what the best solution is here.
R.
On Tue, May 31, 2011 at 03:35:42PM +0100, Richard Earnshaw wrote:
The libgcc 64-bit atomic helpers could do a runtime check on the __kernel_helper_version field in the vectors page before calling the 64-bit cmpxchg helper. This will allow the absence of the helper to be reliably detected on older kernels. Because this is data, it might cause an extra D-TLB miss to accompany any other miss associated with calling the kernel helper (if it exists).
It's an overhead, so this may not be very desirable; however, the overhead will normally not apply if the platform is built for armv7+, since I believe in that case the atomics will usually get inlined -- is that the case?
If the runtime check finds the kernel helper isn't there, there's a question of what to do. If libgcc was not built for v7, the safest approach is probably to spit out a diasnotsic message and call abort() or similar, since there's no guarantee of doing a 64-bit atomic at all in such situations.
These issues only apply to the 64-bit atomics. For the other atomics, we have a de facto "assume the kernel is new enough" policy, which seems OK in practice due to the fact that the GCC atomics support is rather newer than the kernel helpers themselves in any case.
Cheers ---Dave
On Tue, 31 May 2011, Dave Martin wrote:
Isn't there a way to pull in a global constructor or similar whenever a reference to the 64-bit cmpxchg helper is made, so that the constructor code could simply validate the kuser version number and bail out otherwise?
I know this is certainly the case on x86 that if you statically compile a binary on a modern distro in the hope of being able to execute that binary anywhere, you may get a "kernel too old" error message when trying to run it on older distros. The same thing could be done for the ARM 64-bit cmpxchg helper as well, at startup rather than on every invokation, no?
Nicolas
On Tue, May 31, 2011 at 12:53:35PM -0400, Nicolas Pitre wrote:
Hmm, that should be possible.
I don't know how this interacts with other constructors: really this would need to run at a defined point early program startup. If it just ends up somewhere in the middle of the list of global constructurs, we'd have a problem if some of the other constructurs try to use the 64-bit atomics.
Richard, do you know how this could work?
As a principle, it feels like it ought to work.
Cheers ---Dave
On 31 May 2011 15:35, Richard Earnshaw rearnsha@arm.com wrote:
You say 'the libgcc version doesn't know how to specify that the code it's relying on requires a minimal kernel version...'
but why isn't that code just in libgcc; if the libgcc helper gets called and the kernel interface is too old it throws an error.
If the compiler knows it's on 6k or above it inlines else it calls libgcc If libgcc was built with a 6k compiler it inlines (in case something explicitly calls the library routine) If libgcc was built with a <6k compiler it checks to see if it finds itself on a new enough kernel else it errors.
eglibc always uses the compilers sync primitives
(I've not checked if all of eglibc's atomic's match gcc's or if it needs any more).
Dave
On Tue, 2011-05-31 at 16:03 +0100, David Gilbert wrote:
You don't want to repeat the test every time the sync function is called. That's excessive. Further, you want to know when the program starts if the kernel isn't current enough, not a long way into execution.
It might be possible to construct an object file that contained the support, but which also contained a 'constructor' that was run at init time. The 'constructor' would run the kernel version check and terminate the program if it wasn't sufficiently up-to-date. That still begs the question of how this should be done. Not very useful printing a message if the program is based on a GUI. The constructor would have to be run early to avoid the case of another (user or library) constructor needing the sync primitive.
eglibc always uses the compilers sync primitives
Which the compiler will either inline, or fall back to the libgcc versions.
(I've not checked if all of eglibc's atomic's match gcc's or if it needs any more).
I presume that (e)glibc doesn't use 64-bit sync or we would have added it before now...
R.
On 25 May 2011 04:45, Nicolas Pitre nicolas.pitre@linaro.org wrote:
<snip>
FWIW, here's what the kernel part might look like, i.e. for compatibility with pre ARMv6k systems (beware, only compile tested):
<snip>
Hi Nicolas, I've just about got a set of gcc backend changes working for the inline case and plan on attacking libgcc next week.
What are your intentions for that code that you sent in this message - do you intend to finish it off and upstream it or were you wanting me to do that as part of this task?
If you're doing it could you confirm the interface to work to. Also, do you think there should be a halfword and byte interface as well, given that halfword and byte ldrex implementations aren't available on older ARMs, or expect user space to do a bit of mask fiddling with the 32bit ones?
Dave
On Fri, 10 Jun 2011, David Gilbert wrote:
I'll refine it and push it upstream.
If you're doing it could you confirm the interface to work to.
Here it goes:
* Reference prototype: * * int __kernel_cmpxchgd64(const int64_t *oldval, * const int64_t *newval, * volatile int64_t *ptr); * * Input: * * r0 = pointer to oldval * r1 = pointer to newval * r2 = pointer to target value * lr = return address * * Output: * * r0 = returned value (zero or non-zero) * C flag = set if r0 == 0, clear if r0 != 0 * * Clobbered: * * r3, condition flags * * Definition and user space usage example: * * typedef int (__kernel_cmpxchg64_t)(const int64_t *oldval, * const int64_t *newval, * volatile int64_t *ptr); * #define __kernel_cmpxchg64 (*(__kernel_cmpxchg64_t *)0xffff0f60) * * Atomically store *newval in *ptr if *ptr is equal to *oldval for user space. * Return zero if *ptr was changed or non-zero if no exchange happened. * The C flag is also set if *ptr was changed to allow for assembly * optimization in the calling code. * * Notes: * * - This routine already includes memory barriers as needed. * * - Due to the length of some sequences, this spans 2 regular kuser * "slots" so 0xffff0f80 is not used as a valid entry point. * * - This call is valid only if __kernel_helper_version >= 5.
Of course, as discussed, this would be preferable if the interface in libgcc was in a separate object file so any reference to it would also bring in a global constructor from which the __kernel_helper_version value could be verified in order to prevent usage of this interface on kernels that lack it.
It is indeed better if user space deals with it using the 32-bit interface. I'd prefer adding new kernel helpers only when absolutely required.
Nicolas
On 10 June 2011 20:38, Nicolas Pitre nicolas.pitre@linaro.org wrote:
OK, thanks.
OK
<snip>
I was thinking of looking at ifunc for this;but I've not looked yet.
Actually, I've found Julian Brown already implemented that in user space in 2008 (see gcc/config/arm/linux-atomic.c)
Dave
Just to keep people up to date, attached is my current working diff on gcc (against gcc git of a few days ago)
It seems to work for the armv7 case; the testcases seem OK but I need to chat with some gcc people to see if the testcases are OK and I need to do a full native build and test. I'd expect it to need some more clean up yet, so don't panic over formatting etc.
libgcc fallbacks aren't done yet (but it will try and call libgcc).
Dave
Hi All, I've just submitted the patches for the 64 bit atomic stuff to the gcc-patches list. Richard Henderson has raised the question of why the ARM commpage isn't a full VDSO and, if it was, then it would make the version number check a lot simpler.
What's the history behind this/how big a job is it/
Dave (on holiday for a week but checking mail intermittently)
On Fri, Jul 1, 2011 at 6:10 PM, David Gilbert david.gilbert@linaro.org wrote:
I think it's there partly for historical reasons, and partly because there has to be a page mapped somewhere for the exception vectors anyway, so we may as well put other snippets of required userspace code there.
Dave (on holiday for a week but checking mail intermittently)
Because existing userspace source and binaries already call directly into the helper functions in the vectors page, we can't easily remove/move it.
We could possibly wrap the vectors page to make it look like a DSO in a forwards-compatible way, but since this has not happened so far it feels like people never saw much benefit.
Can you describe the benefits with regard to this case? (And, out of interest, how do statically-linked programs make use of needed functionality in the VDSO? Browsing the web hasn't revealed any coherent answers to that question for me...)
Cheers ---Dave
On 5 July 2011 15:49, Dave Martin dave.martin@linaro.org wrote:
OK.
Any idea how big a job that is?
Richard's question to my patch is here:
http://old.nabble.com/-Patch-0-3--ARM-64-bit-atomic-operations-tp31974816p31...
My patch is having to test an arbitrary address in the commpage in init code to see if it's new enough to have the new helper that Nicolas just added; Richard's argument is that if it was actually a VDSO I'd just have linked against a symbol and if the symbol wasn't there then I would have got a fairly normal linker error - I wouldn't have needed any special code; and to be honest I think that would also solve the problem that accessing that wacky address is also a pain for things like qemu because we're going to have to intercept that read.
(I've added Richard to the cc).
Dave
On Fri, 8 Jul 2011, Richard Henderson wrote:
I'm not sure I agree. We're talking about extremely lightweight functions here, in the order of a very few assembly instructions only. Adding a significant overhead relative to their cost is not very appealing. For example, we have this code located at 0xffff0fe0 to retrieve the TLS value. Here's the non-SMP implementation:
ldr r0, [pc, #(16 - 8)] bx lr
Of course the location relative to the pc where the TLS value is retrieved is implementation specific and not part of the ABI at all. Yet, some people found the call to this code too much overhead and started fetching the TLS value directly from memory themselves (*). Obviously their program would break if executed on a SMP system because then the TLS value is not stored in memory. But my point is that they were willing to do such hacks to completely avoid the call overhead, and in such cases I don't see adding to it with a full blown VDSO as something positive.
(*) I even considered changing the location of the TLS value in that case to break those abusers and make it clear that this is not the proper interface.
Nicolas
On 07/08/2011 12:33 PM, Nicolas Pitre wrote:
...
My last word on a subject I'm not currently being paid to care about:
I find this attitude to be short-sighted. These exact same arguments were made about a.out vs elf, and all the horrible extra overhead that elf has with its plts and dynamic symbol resolution. And yet, thankfully, a.out has gone the way of the dodo.
I urge you to go ahead and create the VDSO anyway, even if you continue to bypass it with new fixed addresses for new entry points. The VDSO does more than simply hold the code for those functions, it also describes them with the symbol and symbol versioning tables, and the unwind info. This is of enormous value to debugging and other introspection tools.
r~
On Fri, 8 Jul 2011, Richard Henderson wrote:
The fact that people came up with hacks such as prelinking must not be for nothing.
I urge you to go ahead and create the VDSO anyway, even if you continue to bypass it with new fixed addresses for new entry points.
I have no problem with that at all. Keeping the fixed address ABI will at least provide an alternative to those who are always after the last bit of performance because they don't know better than calling get_tls() a couple thousand times per second within the same thread.
Absolutely.
Nicolas
On 07/07/2011 04:21 PM, David Gilbert wrote:
It's slightly tricky to get the memory layout correct, but ought not be *that* hard. The x86_64 port also had legacy absolute addresses to deal with.
There's limited support for dlopen within statically linked programs as well. The userland side can provide a static interface which defers to the kernel implementation.
Ideally this would be done with an IFUNC relocation, but ARM binutils doesn't support that feature yet. Thankfully it is ABI compatible to replace a normal function with an IFUNC at a later date.
r~
On Fri, Jul 08, 2011 at 09:03:51AM -0700, Richard Henderson wrote:
Talking to Will Deacon about this, it sounds like there may be little appetite for VDSO-ifying the vectors page unless there's a real, concrete benefit.
Making the libc startup's job slightly easier probably doesn't count as such a benefit, but if it renders possible something which is otherwise impossible to do from userspace than that would be more compelling.
But you don't dlopen the VDSO, it's just mapped by the kernel on process startup. To find it, on other architectures something in the C library needs to grab the base address passed by the kernel in the aux vector.
If calls into the VDSO are needed, the C library might have to find the VDSO and get its symbols by special means in the case of statically linked programs. I'm not sure how this works in practice for architectures using a VDSO.
If we can do this properly using IFUNC in the future, then it might make sense to go for an interim solution for now, and tidy it up when IFUNC is available.
Can individual IFUNCs be set to resolve at startup?
I'm still not sure how that would work for static executables...
Cheers ---Dave
On 07/08/2011 09:55 AM, Dave Martin wrote:
I don't consider it anything to do with "making libc's job easier", but of providing a standard mechanism for version control of the ABI.
All you're going to do by using a different mechanism is re-invent the wheel.
Of course. Did you think I didn't know that? And so does the dynamic linker, and the dynamic linker stub in libc.a. The point is to get a handle that you can pass into the other functions like dlsym.
See also _dl_vdso_vsym in libc/sysdeps/unix/sysv/linux/dl-vdso.c, and the uses in libc/sysdeps/unix/sysv/linux/*/init-first.c.
Can individual IFUNCs be set to resolve at startup?
No. They're kinda sorta a special type of PLT entry. So they're resolved either when called, or at startup. See also LD_BIND_NOW for the entire program and DT_BIND_NOW (-Wl,-z,now) for individual libraries and executables.
r~
On Fri, Jul 08, 2011 at 11:36:30AM -0700, Richard Henderson wrote:
I guess I misunderstood the point you were making when you referred to being able to call dlopen from statically-linked programs...
The key problem that we have is not how to direct the 64-bit atomics to the correct function, but rather what to do about it when there is no suitable target function at all.
I really think we have to detect that on process startup -- detecting that and destroying the program just when it first tries to do a cmpxchg64 is unacceptable IMHO. (*)
I don't think that a constructor function solves this in a maintainable way, because there is no metadata allowing us to determine when this check should be done relative to the other constructors. Simply putting it first raises the question of what order to do things in if we also have another contructor which much be called first.
IFUNC doesn't solve the problem because either it gets resolved lazily (violating the above principle (*)), or we have to force _all_ symbols to resolve at startup, with may have a significant impact on startup time for large programs.
We could modify the libc startup objects to do the check, but that feels a bit unclean.
That brings us back to VDSO -- am I right in thinking that the symbol versioning checks allow the program to be killed on startup of the VDSO ABI is tool old? If so, maybe that's the right approach after all.
This doesn't replace the existing interface, and the few non-libc based programs can still interact directly with the vectors page to do the same check if they need to.
Cheers ---Dave
Dave Martin dave.martin@linaro.org writes:
IFUNCs are never resolved lazily; that's one way in which they differ from PLTs.
(BTW, in response to a comment upthread, ARM does support IFUNCs now.)
Richard
On Mon, Jul 11, 2011 at 10:42:27AM +0100, Richard Sandiford wrote:
Do you know when support was merged in eglibc?
For example, Linaro binutils 2.21.0.20110327-2ubuntu3 appears to support IFUNC, but the natty version of eglibc (2.13-0ubuntu13) does not handle it correctly.
Cheers ---Dave
On 11 July 2011 12:30, Dave Martin dave.martin@linaro.org wrote:
That's why I didn't try and use it; I thought it was best to wait for it to percolate through.
Richard's patch went into glibc-ports on April 26th and looks like it got imported into eglibc-ports on 5th May (svn rev 13608); there was also a fix from Joseph Myers in that area on 21st June.
Dave
On Mon, Jul 11, 2011 at 01:00:08PM +0100, David Gilbert wrote:
Just for context, I had a quick play to get a feel for the feasibility of implementing this directly, without relying either on a VDSO or on IFUNC.
It's possible to produce something which works reasonably well: see the attachment. The result is almost the same as what IFUNC would achieve (although &__kernel_cmpxchg64 leaves something to be desired; macros could potentially fix that).
[see attached]
The various helpers are checked independently at startup if they are used, by registering the resolver functions as constructors as others have suggested. (Note that gcc/libc already fails to do this check for the older kernel helper functions; it's just that you're unlikely to observe a problem on any reasonably new kernel.)
The problem of constructor ordering is solved by pointing each indirect function pointer at the appropriate resolver function initially. Threading is not taken into account because no additional threads can exist during execution of constructors (though if more libraries are later dlopen()'d they may invoke additional constructors in the presence of threads, so this might need fixing).
The main question is the correct way to blow the program up if a needed helper function is not available. Calling exit() is possibly too gentle, whereas calling abort() may be too aggressive. libc may have some correct internal mechanism for doing this, but my knowledge falls short of that.
What to do in the case of recursive lookup failures isn't obvious: currently I just call abort(). By this point the process should be running atexit handlers or destructors, so calling exit() isn't going to help in this situation anyway.
The other question is whether and how the program can be enabled to catch the error and report something intelligible to the user. If there is a hook for catching dynamic link failures, we should maybe use the same. If this problem is not solved for dynamic linking in general, it doesn't make sense to require the kuser functions to solve it either (except that if the problem is eventually solved, we want the fix to work smoothly for both).
Of course, all of these are generic dynamic linking challenges which ld.so likely has some policy or solution for already.
Overall, this isn't a perfect solution, but it doesn't feel catastrophically bad either.
Any thoughts?
Cheers ---Dave
On 12 July 2011 11:43, Dave Martin dave.martin@linaro.org wrote:
Just for context, I had a quick play to get a feel for the feasibility of implementing this directly, without relying either on a VDSO or on IFUNC.
I originally thought about doing something similar to what you've done with the indirection; but eventually convinced myself that it didn't provide anything that the initial constructor check didn't provide.
Does it help address rth's concerns though?
I don't see any point going back and adding checks for those; the code is already there.
I'm not sure threads are a problem; you could have two threads going into the check at the same time; with any luck they would both make the same decision and rewrite the pointer to the same value and either call the function ro exit; I guess you could end up with two threads trying to print the error and exit at the same time.
I took the write() -> abort() from someother libc code that forced an exit.
You could always just send yourself a SIGKILL.
Isn't this a bit over-the-top for the failure we're trying to catch?
We seem to be growing solutions rather than figuring out which ones satisfy the requirements of both gcc and kernel:
1) my original simple solution (about 5 lines of code) 2) IFUNC 3) Your more general indirected code 4) A VDSO
To be honest I don't see the point in the more general indirected approach; if we want to be more general then I think we should use IFUNC (it would be the 1st use of it, which means we may have to fix some issues but hey that's what we're here for).
There is some stuff that is a bit of a shame that it wasn't more general already; e.g. those slots in the commpage for helper functions - if they were filled with a known marker, you could just check the marker rather than having a separate version, or a faulting instruction or the like.
Also, remember this whole discussion is just to print a message and exit nicely in the case of someone using a currently incredibly rare function on an old kernel!
RTH: From a gcc point of view is anything other than the VDSO acceptible?
Dave
On Tue, Jul 12, 2011 at 01:10:24PM +0100, David Gilbert wrote:
Which ones in particular?
Indeed -- I'm merely observing that this problem is not actually new with cmpxchg64; it's just that it wasn't solved previously.
Hmmm, probably right.
Probably, yes. abort() is almost the same, except that SIGKILL is not catchable.
It's probably up to libc guys to judge what would be correct.
Yes and no. Implementing such a solution just for this case would be very over-the-top. But integrating with a general mechanism for reporting errors if it exists (now or in the future) is very sensible.
I was just exploring the problem space; I appreciate that's not the same thing as choosing a solution.
Does IFUNC rely on support from libc in order to get resolved correctly? My concern is that if so, binaries using it will silently misbehave on older libcs. This seems to be the case at present, though I haven't figured out whether this is caused by libc or the toolchain. It may be that I'm worrying for nothing -- I don't know exactly how IFUNC works.
If it's necessary to to any check at all before concluding that it's safe to call a function, I don't see the difference. Checking a magic number in one place seems no better than checking for a version number in another place, AFAICT.
Having a faulting instruction doesn't permit the program to abort cleanly ahead of time (unless you check in advance -- in which case you're really back to a magic number again).
This matters if constructors might do things like screwing up the terminal -- if so, the chance to un-screw thing up should be offered to the program before it exits.
Possibly constructors should never do things like that, in which case these concerns don't apply. I don't know what the official line is regarding what kind of things global constructors should and shouldn't do.
I'd say we want to notify the operating environment and/or the user. This may be realised by writing some text to stderr, but that's not useful in the context of GUI environments.
I am not suggesting that we should engineer a special solution to that problem here, so long as we don't defeat potential solutions either. For now, it doesn't like any of the proposals risk that though.
Cheers ---Dave
Dave Martin dave.martin@linaro.org writes:
Well, it uses new relocations, so it relies on the dynamic loader implementing those relocations correctly. But older loaders from older libcs will error out if they see that relocation. Programs wouldn't silently misbehave if your loader is too old.
Static executables are responsible for resolving their own ifuncs (in the common start-up code) so they should run on any system.
Richard
On 12 July 2011 13:40, Dave Martin dave.martin@linaro.org wrote:
On Tue, Jul 12, 2011 at 01:10:24PM +0100, David Gilbert wrote:
Does it help address rth's concerns though?
Which ones in particular?
Good question - hence my prompt to rth at the bottom; I know he originally asked why not go to VDSO, so what I don't yet understand is whether he would accept the other alternatives.
Right - but I don't think we should be doing anything new here; we can't be the first people to be trying to detect running on an older kernel and exiting relatively cleanly.
Something like AUXV would have seemed cleaner to me; but that's just preference.
If I understand correctly that's the whole point of abort() - if they need to tidy up then they should be handling SIGABRT.
Yes it is; exiting with a non-zero return value and writing to stderr is perfectly normal - whatever launched it can go and display that text. (Or as I say trapping SIGABRT).
Dave
Sorry, should have included this in my last reply.
Dave Martin dave.martin@linaro.org writes:
Shouldn't a decent GUI environment be able to catch these kinds of errors and log them though?
One problem with early write()s is that we can't do any locale-specific translation. The message is always going to be in English. That's something that the VSO approach solves for free, because it's then up to libc or the dynamic loader to print a suitable message.
Richard
On Tue, Jul 12, 2011 at 02:25:09PM +0100, Richard Sandiford wrote:
Ideally yes. Which is why calling some hook may be better than writing a message to stderr, because messages written to stderr are rarely usefully machine-readable (even if not locale-translated). GUI toolkits may provide such a hook, but in the case of dnyamic linking errors and similar, the GUI toolkit is unlikely to have started up at the time the error occurs.
Anyway, it's out of our scope to solve this here, so long as we're no worse than the de facto behaviour.
Agreed, although locale-translated messages on stderr, while good for humans, tend to be even less machine-readable.
Cheers ---Dave
On Fri, Jul 08, 2011 at 12:21:27AM +0100, David Gilbert wrote:
On 5 July 2011 15:49, Dave Martin dave.martin@linaro.org wrote:
On Fri, Jul 1, 2011 at 6:10 PM, David Gilbert david.gilbert@linaro.org wrote:
[...]
My patch is having to test an arbitrary address in the commpage in init code to see if it's new enough to have the new helper that
Minor niggle: the address is in no way arbitrary. Although there's no corresponding C symbol, it's a well-known absolute address.
Note that pre-existing code already calls into the vectors page. Introducing a VDSO interface will not automatically cause all software to stop using the binary interface.
How much of a problem that is in practice, I don't know.
This doesn't necessarily mean that adding a VDSO interface is bad, but it does mean the existing interface can't easily be changed or removed.
Cheers ---Dave
On Fri, 8 Jul 2011, Dave Martin wrote:
Right, qemu must be intercepting it already. The kernel helpers located at a fixed address in the vector page have been around for more than 6 years.
Agreed. There is nothing preventing the addition of a VDSO, but I've yet to see a convincing argument for spending time adding it.
Nicolas
On 8 July 2011 19:32, Nicolas Pitre nicolas.pitre@linaro.org wrote:
QEMU supports calls into the fixed vector page (it just special cases attempts to execute at addresses >= 0xffff0000 and emits code to do "cause special purpose exception" so we get control back at runtime). What it doesn't support is attempts to do a load from the vector page, because so far nobody has had a need to probe the vector page for what it supports.
-- PMM
On Sat, Jul 09, 2011 at 12:29:01AM +0100, Peter Maydell wrote:
To solve that particular problem, can you just map a page in RAM with the right content for __kernel_helper_version?
The value there can be a static feature of QEMU, indicating which kuser helper functions QEMU knows how to emulate.
User processes shouldn't read anything else from the vectors page, at least for now.
Cheers ---Dave
On 11 July 2011 09:36, Dave Martin dave.martin@linaro.org wrote:
We could, but that's dependent on the target system letting us do that. (for that matter if the target system decided to map another page there it won't fault access to it and then we're going to get who knows what).
Dave
On 11 July 2011 09:42, David Gilbert david.gilbert@linaro.org wrote:
To clarify, that's the system we're running on (usually x86, more often called the "host system"). This is because in linux-user emulation mode QEMU doesn't completely manage the guest address space, but just exposes a contiguous chunk of the host QEMU process' address space to the target. So if the host kernel has mapped something in the area corresponding to where the guest vector page is supposed to be then you're stuck.
(This implementation has other problems, eg https://bugs.launchpad.net/qemu-linaro/+bug/806873 but has the advantage of being fast... In the long term I quite like the idea suggested by rth on qemu-devel of making the linux-user mode use qemu's softmmu so we actually control the guest process address space completely; however that's a fairly substantial bit of rework.)
-- PMM