Hello there,
Coming back from Asia I've been putting a lot of thought about how we can make sure we spend our engineering cycles on the work that is most valuable to the current Linaro members, and part of that means reassessing assumptions that we've carried since our foundation.
The first point I want to raise is Thumb-2, the alternative ISA described by ARM like this:
For performance optimised code Thumb-2 technology uses 31 percent less memory to reduce system cost, while providing up to 38 percent higher performance than existing high density code, which can be used to prolong battery-life or to enrich the product feature set. Thumb-2 technology is featured in the processor, and in all ARMv7 architecture-based processors.
We've from the beginning set Thumb-2 in our standard configuration across platforms, but outside of Ubuntu, I think we're unique in that. So the questions I have are:
- Do we know how much better Thumb-2 actually is, in practice? It's easy for us to confirm this on Android; what do the numbers and feel of the system tell us?
- What are the downsides to using Thumb-2 in general? Do we have anecdotes or threads that talk about bad experiences or blockers in the transition?
- If it's so great, how could we lead a wide-ranging transition to Thumb2 becoming the standard ISA for modern v7 applications, including Android, Yocto and anything else relevant that runs on a Cortex A?
Thanks,
Hi Kiko,
These are all excellent questions and I for one would be keen to see a position on this.
How would it be captured and the output published?
Ta
Sent from yet another ARM powered device
On 20 Oct 2011, at 18:30, "Christian Robottom Reis" kiko@linaro.org wrote:
Hello there,
Coming back from Asia I've been putting a lot of thought about how we can make sure we spend our engineering cycles on the work that is most valuable to the current Linaro members, and part of that means reassessing assumptions that we've carried since our foundation.
The first point I want to raise is Thumb-2, the alternative ISA described by ARM like this:
For performance optimised code Thumb-2 technology uses 31 percent less memory to reduce system cost, while providing up to 38 percent higher performance than existing high density code, which can be used to prolong battery-life or to enrich the product feature set. Thumb-2 technology is featured in the processor, and in all ARMv7 architecture-based processors.
We've from the beginning set Thumb-2 in our standard configuration across platforms, but outside of Ubuntu, I think we're unique in that. So the questions I have are:
Do we know how much better Thumb-2 actually is, in practice? It's easy for us to confirm this on Android; what do the numbers and feel of the system tell us?
What are the downsides to using Thumb-2 in general? Do we have anecdotes or threads that talk about bad experiences or blockers in the transition?
If it's so great, how could we lead a wide-ranging transition to Thumb2 becoming the standard ISA for modern v7 applications, including Android, Yocto and anything else relevant that runs on a Cortex A?
Thanks,
Christian Robottom Reis, Engineering VP Brazil (GMT-3) | [+55] 16 9112 6430 | [+1] 612 216 4935 Linaro.org: Open Source Software for ARM SoCs
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
This isn't exactly overflowing with up to date numbers, but...
http://elinux.org/images/8/8a/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf
Slides 14 and 15 say that across EEMBC Thumb-2 gives 98% of the performance of ARM 32 bit instructions (assume performance optimised) and binaries are 26% smaller (didn't catch what binary/binaries that was). These are numbers from 2007 and benchmarked on an ARM 11. I assume using ARMCC.
Slide 17 tells us that GCC 4.1 can shrink the kernel by 29% and common libraries by 20%.
None of these slides tell us about if Thumb 2 improves multitasking by a noticeable amount by reducing cache thrashing (well, as far as I know, none of the benchmarks used are multi-threaded). My guess is that this would help with user interaction and smoothing out multimedia playback, but not help with server tasks, which I wouldn't expect to be noticeably impacted by task switching speed.
The slides don't tell us about running a typical set of applications in limited RAM and how much benefit Thumb 2 can give in that situation - my old Android phone has 256MB of RAM and is, as far as I can tell, performance limited by its lack of RAM (rooting and deleting everything I didn't use helped quite a bit). It may be interesting to look at running Android in 256MB Vs 512MB and seeing how much difference the ISA choice makes.
Something that isn't mentioned at all is power draw. If a SoC can switch off RAM because it isn't in use then it can save power. If code is smaller, there is less bus activity to load instructions, again saving power.
On the subject of ISAs, is any work being done with Jazelle RCT to accelerate VMs? Searching around most talk of Jazelle with Android is talking about DBX (direct byte code execution), which (as I understand it) isn't much use to Dalvik, but RCT may be since it is designed to accelerate JIT generated code.
James
On 20 October 2011 18:41, Roger Teague Roger.Teague@arm.com wrote:
Hi Kiko,
These are all excellent questions and I for one would be keen to see a position on this.
How would it be captured and the output published?
Ta
Sent from yet another ARM powered device
On 20 Oct 2011, at 18:30, "Christian Robottom Reis" kiko@linaro.org wrote:
Hello there,
Coming back from Asia I've been putting a lot of thought about how we can make sure we spend our engineering cycles on the work that is most valuable to the current Linaro members, and part of that means reassessing assumptions that we've carried since our foundation.
The first point I want to raise is Thumb-2, the alternative ISA described by ARM like this:
For performance optimised code Thumb-2 technology uses 31 percent less memory to reduce system cost, while providing up to 38 percent higher performance than existing high density code, which can be used to prolong battery-life or to enrich the product feature set. Thumb-2 technology is featured in the processor, and in all ARMv7 architecture-based processors.
We've from the beginning set Thumb-2 in our standard configuration across platforms, but outside of Ubuntu, I think we're unique in that. So the questions I have are:
- Do we know how much better Thumb-2 actually is, in practice? It's easy for us to confirm this on Android; what do the numbers and feel of the system tell us?
- What are the downsides to using Thumb-2 in general? Do we have anecdotes or threads that talk about bad experiences or blockers in the transition?
- If it's so great, how could we lead a wide-ranging transition to Thumb2 becoming the standard ISA for modern v7 applications, including Android, Yocto and anything else relevant that runs on a Cortex A?
Thanks,
Christian Robottom Reis, Engineering VP Brazil (GMT-3) | [+55] 16 9112 6430 | [+1] 612 216 4935 Linaro.org: Open Source Software for ARM SoCs
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
On Fri, Oct 21, 2011 at 7:48 AM, James Tunnicliffe james.tunnicliffe@linaro.org wrote:
This isn't exactly overflowing with up to date numbers, but...
http://elinux.org/images/8/8a/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf
Slides 14 and 15 say that across EEMBC Thumb-2 gives 98% of the performance of ARM 32 bit instructions (assume performance optimised) and binaries are 26% smaller (didn't catch what binary/binaries that was). These are numbers from 2007 and benchmarked on an ARM 11. I assume using ARMCC.
I just ran EEMBC with gcc-linaro-4.6-2011.10 with -mfpu=neon -O3 -mtune=cortex-a9 and got similar numbers. Five of the 32 tests ran faster with Thumb-2 which is nice. I'll send the results privately as I'm not sure we can share.
EEMBC embeds the test data in the executable so it it's hard to tell the change in text size. After de-duplicating, the average on-disk package size was 88 % of the ARM version. Ignoring the ones that are likely to have embedded test data, the average text size was 82 % of ARM mode.
These days we're not really short of RAM so, as Mans says, improvements in startup time, cache footprint, or on-disk size might be a win.
-- Michael
On 20 October 2011 23:07, Michael Hope michael.hope@linaro.org wrote:
On Fri, Oct 21, 2011 at 7:48 AM, James Tunnicliffe james.tunnicliffe@linaro.org wrote:
This isn't exactly overflowing with up to date numbers, but...
http://elinux.org/images/8/8a/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf
Slides 14 and 15 say that across EEMBC Thumb-2 gives 98% of the performance of ARM 32 bit instructions (assume performance optimised) and binaries are 26% smaller (didn't catch what binary/binaries that was). These are numbers from 2007 and benchmarked on an ARM 11. I assume using ARMCC.
I just ran EEMBC with gcc-linaro-4.6-2011.10 with -mfpu=neon -O3 -mtune=cortex-a9 and got similar numbers. Five of the 32 tests ran faster with Thumb-2 which is nice. I'll send the results privately as I'm not sure we can share.
How much faster? What about the ones that didn't run faster? I also don't think EEMBC is representative of real-world apps.
EEMBC embeds the test data in the executable so it it's hard to tell the change in text size.
The 'size' command?
After de-duplicating, the average on-disk package size was 88 % of the ARM version. Ignoring the ones that are likely to have embedded test data, the average text size was 82 % of ARM mode.
Same as Libav then.
These days we're not really short of RAM so, as Mans says, improvements in startup time, cache footprint, or on-disk size might be a win.
Startup time is the only real win I can see. The rest are either non-existent or irrelevant.
On Fri, Oct 21, 2011 at 11:16 AM, Mans Rullgard mans.rullgard@linaro.org wrote:
On 20 October 2011 23:07, Michael Hope michael.hope@linaro.org wrote:
On Fri, Oct 21, 2011 at 7:48 AM, James Tunnicliffe james.tunnicliffe@linaro.org wrote:
This isn't exactly overflowing with up to date numbers, but...
http://elinux.org/images/8/8a/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf
Slides 14 and 15 say that across EEMBC Thumb-2 gives 98% of the performance of ARM 32 bit instructions (assume performance optimised) and binaries are 26% smaller (didn't catch what binary/binaries that was). These are numbers from 2007 and benchmarked on an ARM 11. I assume using ARMCC.
I just ran EEMBC with gcc-linaro-4.6-2011.10 with -mfpu=neon -O3 -mtune=cortex-a9 and got similar numbers. Five of the 32 tests ran faster with Thumb-2 which is nice. I'll send the results privately as I'm not sure we can share.
How much faster? What about the ones that didn't run faster?
More than 10 %. I can't share raw numbers in public as our EEMBC license doesn't allow it. I've sent the raw numbers to the linaro-toolchain-benchmarks list.
I also don't think EEMBC is representative of real-world apps.
Agreed. It's an embedded benchmark. SPEC would be interesting.
EEMBC embeds the test data in the executable so it it's hard to tell the change in text size.
The 'size' command?
It turns the images and such into C arrays so they appear in the text segment. Hmm, I wonder if they get split into .rodata?
-- Michael
On 20 October 2011 23:22, Michael Hope michael.hope@linaro.org wrote:
On Fri, Oct 21, 2011 at 11:16 AM, Mans Rullgard mans.rullgard@linaro.org wrote:
On 20 October 2011 23:07, Michael Hope michael.hope@linaro.org wrote:
On Fri, Oct 21, 2011 at 7:48 AM, James Tunnicliffe james.tunnicliffe@linaro.org wrote:
This isn't exactly overflowing with up to date numbers, but...
http://elinux.org/images/8/8a/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf
Slides 14 and 15 say that across EEMBC Thumb-2 gives 98% of the performance of ARM 32 bit instructions (assume performance optimised) and binaries are 26% smaller (didn't catch what binary/binaries that was). These are numbers from 2007 and benchmarked on an ARM 11. I assume using ARMCC.
I just ran EEMBC with gcc-linaro-4.6-2011.10 with -mfpu=neon -O3 -mtune=cortex-a9 and got similar numbers. Five of the 32 tests ran faster with Thumb-2 which is nice. I'll send the results privately as I'm not sure we can share.
How much faster? What about the ones that didn't run faster?
More than 10 %. I can't share raw numbers in public as our EEMBC license doesn't allow it. I've sent the raw numbers to the linaro-toolchain-benchmarks list.
Yes, I saw it.
I also don't think EEMBC is representative of real-world apps.
Agreed. It's an embedded benchmark. SPEC would be interesting.
I don't think it's representative of anything, embedded or otherwise.
EEMBC embeds the test data in the executable so it it's hard to tell the change in text size.
The 'size' command?
It turns the images and such into C arrays so they appear in the text segment. Hmm, I wonder if they get split into .rodata?
Arrays end up in .data or .rodata, never .text.
On 20 October 2011 23:22, Michael Hope michael.hope@linaro.org wrote:
On Fri, Oct 21, 2011 at 11:16 AM, Mans Rullgard mans.rullgard@linaro.org wrote:
On 20 October 2011 23:07, Michael Hope michael.hope@linaro.org wrote:
On Fri, Oct 21, 2011 at 7:48 AM, James Tunnicliffe james.tunnicliffe@linaro.org wrote:
This isn't exactly overflowing with up to date numbers, but...
http://elinux.org/images/8/8a/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf
Slides 14 and 15 say that across EEMBC Thumb-2 gives 98% of the performance of ARM 32 bit instructions (assume performance optimised) and binaries are 26% smaller (didn't catch what binary/binaries that was). These are numbers from 2007 and benchmarked on an ARM 11. I assume using ARMCC.
I just ran EEMBC with gcc-linaro-4.6-2011.10 with -mfpu=neon -O3 -mtune=cortex-a9 and got similar numbers. Five of the 32 tests ran faster with Thumb-2 which is nice. I'll send the results privately as I'm not sure we can share.
How much faster? What about the ones that didn't run faster?
More than 10 %.
I suspect that is more likely due to some gcc quirk than to any intrinsic (dis)advantage of Thumb-2. Someone should analyse the code and find out what went wrong in the slow version. Perhaps something for the toolchain team :)
On 20 October 2011 18:27, Christian Robottom Reis kiko@linaro.org wrote:
- Do we know how much better Thumb-2 actually is, in practice? It's easy for us to confirm this on Android; what do the numbers and feel of the system tell us?
I did some tests comparing Libav built for ARM and Thumb-2. The Thumb-2 build has 18% smaller code size than the ARM build. Data size is of course unchanged. The overall size reduction of text+data+bss is 10%. Benchmarking on a Cortex-A9 (Panda), the Thumb-2 build is 1-3% slower in most of my test cases, only one test being faster by 1%. In these tests, the hand-written assembly code was enabled (it can be built as ARM or Thumb-2).
This is of course highly specialised code so the results are not generally applicable. Nevertheless, I would expect similar results from other compute-intensive applications.
- What are the downsides to using Thumb-2 in general? Do we have anecdotes or threads that talk about bad experiences or blockers in the transition?
The r1pX versions of Cortex-A8 had a few Thumb-2 related errata that caused a bit of grief until they were properly understood and worked around. Some of the workarounds required on these core revisions have a negative impact on performance (extra invalidations of some branch prediction buffers etc).
- If it's so great, how could we lead a wide-ranging transition to Thumb2 becoming the standard ISA for modern v7 applications, including Android, Yocto and anything else relevant that runs on a Cortex A?
The space savings provided by Thumb-2 only matter if the available memory (either RAM or non-volatile storage) is almost fully utilised, which is not the typically the case on the type of systems we are focusing on (Android and desktop distributions), where I strongly doubt code size is the major contributor to memory usage.
One real benefit not mentioned in the quoted blurb is possibly reduced startup times for applications when they are loaded from disk/flash. This could make a case for building things started on system boot as Thumb-2 in order to speed the boot process.
The promised speed gains from Thumb-2 are only possible as a result of better I-cache utilisation due to reduced code size. As seen in the Libav case, a 20% reduction is code is realistic. For this to give a significant speed boost, the execution pattern would have to be such that reducing the instruction working set by 20% allows it to fit within the 16-32k (typical) I-cache thus reducing thrashing. For instruction working sets outside this fairly narrow range, switching to Thumb-2 would have little impact on performance. I have no numbers to go by here, but my feeling is that the number of realistic workloads that would benefit here is fairly small.
In light of these observations, I do not think pushing for either instruction set to be applied system-wide is proper. Instead, each application/library should be built using whichever gives it best performance. There is no problem mixing the instruction sets.