On 18 April 2011 22:35, Michael Hope michael.hope@linaro.org wrote:
Hi there. There's a problem with the thread local storage register in the Tegra 2 which is exposed by Ubuntu switching to GCC 4.5. The fault is quite serious and causes many applications to segfault. There's more details in LP: #739374.
The problem is Tegra 2 specific and is caused by bit 20 of the CP15 thread pointer register always reading as zero. With GCC 4.4, access to the thread pointer always goes through the GLIBC helper function '__aeabi_read_tp' which calls into the kernel which then reads and returns CP15. Either GLIBC or the kernel[1] itself swaps bit 20 and bit 0 which works around the problem.
Actually in the Android version of this workaround both glibc and the kernel do the swapping (and presumably apps are compiled with gcc options that don't allow the inline cp15 access). Since Ubuntu's libc doesn't have the workaround, we can deduce that maverick glibc must use the kernel provided code in the commpage. (I imagine most people with hardware with this erratum (ie AC100s) are using the Android kernel still, or else have a custom kernel which includes the workaround.) Natty glibc presumably doesn't use the commpage code, since things break if you upgrade to it.
Change GCC to swap bits 20 and 0 as well. This is a hack and requires rebuilding the archive. The performance should be small.
I believe this is unlikely to find any favour with gcc upstream :-) Also this will be an incompatible workaround, because any program compiled with this gcc hack won't work on a system without the corresponding kernel/libc change, and vice-versa. I think that makes it a non-starter.
Change GCC to always call the helper function. The helper function can detect the processor and call into the kernel on Tegra devices, or return CP15 directly on others. IFUNC could be used to reduce the overhead. The archive would have to be rebuilt. Worse performance than above, but still better than 4.4.
NB that this isn't a "call into the kernel" in the "system call" sense -- the register read and bit shuffle can all be done in user space, so the overhead is simply that of calling the commpage routine rather than doing it inline.
This is a better fallback if we can't do the third thing:
Change GLIBC to allocate thread local storage on a 2 M boundary. Bit 20 would always be zero. The thread pointer is a base address so the thread could still have more than 2 M of thread local data. GLIBC would have to be rebuilt and this limits the maximum number of threads. No runtime performance hit.
This is by far the best approach if we can make it work. (I know I was the one who said it would limit the maximum number of threads but on reflection I'm not sure I was right. I think I was confusing it with 2MB minimum stack sizes, which really do limit the number of threads.)
-- PMM