Hi Andrew,
On 27 June 2016 at 19:32, Pinski, Andrew Andrew.Pinski@cavium.com wrote:
up to 64bits, the calls to the libatomic routines are inlined and armv8.1 CAS and load-operate version are used when the application is build for armv8.1 architecture. For 128bits, a call to the lib is made which uses the same LL/SC implementation with or without LSE support, as CAS and load-operate instruction don't support this data size.
I don't have armv8.1 hardware and made the analysis on the generated assembler. Do you have use case on your side where an ifunc version can be useful ? I'm not aware of an algorithm which can replace effectively LL/SC implementation with shorter CAS, do you have any pointers ? Maybe CASP can be used in some cases, I'll investigate it.
Thanks Yvan
I'm curious about what workloads / benchmarks you considered for this activity - the traditional spec benchmarks don't really trigger anything in libatomic - so where do we see the improvements or none ?
regards Ramana
________________________________________ From: linaro-toolchain linaro-toolchain-bounces@lists.linaro.org on behalf of Yvan Roux yvan.roux@linaro.org Sent: 27 June 2016 20:04:52 To: Pinski, Andrew Cc: Linaro Toolchain Mailman List Subject: Re: [ACTIVITY] Week 25
Hi Andrew,
On 27 June 2016 at 19:32, Pinski, Andrew Andrew.Pinski@cavium.com wrote:
up to 64bits, the calls to the libatomic routines are inlined and armv8.1 CAS and load-operate version are used when the application is build for armv8.1 architecture. For 128bits, a call to the lib is made which uses the same LL/SC implementation with or without LSE support, as CAS and load-operate instruction don't support this data size.
I don't have armv8.1 hardware and made the analysis on the generated assembler. Do you have use case on your side where an ifunc version can be useful ? I'm not aware of an algorithm which can replace effectively LL/SC implementation with shorter CAS, do you have any pointers ? Maybe CASP can be used in some cases, I'll investigate it.
Thanks Yvan
_______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Ramaan,
On 29 June 2016 at 17:03, Ramana Radhakrishnan Ramana.Radhakrishnan@arm.com wrote:
I'm curious about what workloads / benchmarks you considered for this activity - the traditional spec benchmarks don't really trigger anything in libatomic - so where do we see the improvements or none ?
First, let me precise the purpose of this task which was to evaluate and implement ARMv8.1 support in libatomic, and not to evaluate the performance of ARMv8.1 architecture. Sorry if it wasn't clear in this short weekly format.
Given this objectif, I didn't consider benchmarking for this activity, my plan was to:
1. Verify the support of the new ARMv8.1 atomic instructions in the __atomic builtins 2. Familiarize with libatomic code base and build system and check that the builtins are used. 3. Enable and implement the ifunc version of the lib if needed.
My observations and conclusions are:
1. __atomic builtins already have a full support of the new atomic instructions, and generate cas, swp and ld<op> as needed on data types up to 8 bytes. 2. libatomic uses the atomic builtins proprely, thus building the lib for ARMv8.1 architecture or enabling multilib on AArch64 generates a libatomic which contains the expected code. 3. I don't see any benefits in implementing an Ifunc version of the lib which will decide at runtime to use the LSE version or not, as for the version up to 8bytes they are expanded inline at compile time, and the 16bytes version are the same with or without LSE support. Maybe I miss some use case or lack some background on libatomic usage here, and I'd be happy if you can give me some inputs.
Regarding the 16bytes version, as I said I recently saw that LSE contains a CASP instruction, which might be used to implement a 128bits compare exchange builtin, but if I understand well the discussion in this bugzilla it might be better to wait for a new version of the architecture which contains the proper 128bit instruction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814
Thanks Yvan
linaro-toolchain@lists.linaro.org