Following on from yesterday's call about what it would take to enable
SMS by default: one of the problems I was seeing with the SMS+IV patch
was that we ended up with excessive moves. E.g. a loop such as:
void
foo (int *__restrict a, int n)
{
int i;
for (i = 0; i < n; i += 2)
a[i] = a[i] * a[i + 1];
}
would end up being scheduled with an ii of 3, which means that in the
ideal case, each loop iteration would take 3 cycles. However, we then
added ~8 register moves to the loop in order to satisfy dependencies.
Obviously those 8 moves add considerably to the iteration time.
I played around with a heuristic to see whether there were enough
free slots in the original schedule to accomodate the moves.
That avoided the problem, but it was a hack: the moves weren't
actually scheduled in those slots. (In current trunk, the moves
generated for an instruction are inserted immediately before that
instruction.)
I mentioned this to Revital, who told me that Mustafa Hagog had
tried a more complete approach that really did schedule the moves.
That patch was quite old, so I ended up reimplementing the same kind
of idea in a slightly different way. (The main functional changes
from Mustafa's version were to schedule from the end of the window
rather than the start, and to use a cyclic window. E.g. moves for
an instruction in row 0 column 0 should be scheduled starting at
row ii-1 downwards.)
The effect on my flawed libav microbenchmarks was much greater
than I imagined. I used the options:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
The "before" code was from trunk, the "after" code was trunk + the
register scheduling patch alone (not the IV patch). Only the tests
that have different "before" and "after" code are run. The results were:
a3dec
before: 500000 runs take 4.68384s
after: 500000 runs take 4.61395s
speedup: x1.02
aes
before: 500000 runs take 20.0523s
after: 500000 runs take 16.9722s
speedup: x1.18
avs
before: 1000000 runs take 15.4698s
after: 1000000 runs take 2.23676s
speedup: x6.92
dxa
before: 2000000 runs take 18.5848s
after: 2000000 runs take 4.40607s
speedup: x4.22
mjpegenc
before: 500000 runs take 28.6987s
after: 500000 runs take 7.31342s
speedup: x3.92
resample
before: 1000000 runs take 10.418s
after: 1000000 runs take 1.91016s
speedup: x5.45
rgb2rgb-rgb24tobgr16
before: 1000000 runs take 1.60513s
after: 1000000 runs take 1.15643s
speedup: x1.39
rgb2rgb-yv12touyvy
before: 1500000 runs take 3.50122s
after: 1500000 runs take 3.49887s
speedup: x1
twinvq
before: 500000 runs take 0.452423s
after: 500000 runs take 0.452454s
speedup: x1
Taking resample as an example: before the patch we had an ii of 27,
stage count of 6, and 12 vector moves. Vector moves can't be dual
issued, and there was only one free slot, so even in theory, this loop
takes 27 + 12 - 1 = 38 cycles. Unfortunately, there were so many new
registers that we spilled quite a few.
After the patch we have an ii of 28, a stage count of 3, and no moves,
so in theory, one iteration should take 28 cycles. We also don't spill.
So I think the difference really is genuine. (The large difference
in moves between ii=27 and ii=28 is because in the ii=27 schedule,
a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled
with time(B) == time(A) + ii + 1.)
I also saw benefits in one test in a "real" benchmark, which I can't
post here.
Richard
Hello,
Following today performance call
(https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2011-08-23)
here are some points raised regarding the steps towards enabling SMS by default:
* Benchmarks testing:
-- Running benchmarks as EEMBC and SPEC2006 with SMS enabled is
crucial to expose loops where SMS degrades the performance. those
loops need to be analysed to construct a cost model.
-- SMS increases code size by introducing prologue and epilogue to the
loop kernel. This should also be measured.
-- Measure increase in compile time: on native or cross build?
Currently SMS fails to bootstrap trunk on ARM machine. this should
also be taken into account when considering enabling it by default.
Should it be turned on with -O2 or -O3?
SMS flags to use for testing:
-O3 -fmodulo-sched-allow-regmoves -fmodulo-sched
-funsafe-loop-optimizations -fno-auto-inc-dec
Thanks,
Revital
Hi
Some time ago we agreed that not everyone here uses Ubuntu distribution
and decided to provide so called 'generic linux' cross toolchain.
Recently I managed to get it done and now need brave testers to tell is
it working or not.
Get it here: http://people.linaro.org/~hrw/generic-linux/ (64bit only)
Needed files are toolchain-11.07.tar.xz and init.sh script. Unpack
tarball from / so /opt/linaro/11.07/ will be populated and put init.sh
anywhere you want (it will be integrated into tarball later).
How to use:
$ source init.sh
this will add cross toolchain into PATH and also set LD_LIBRARY_PATH to
two directories:
- one with binutils libraries
- second with all extra libraries which may be needed
Feel free to experiment with second dir by removing files from there and
checking are system provided libs are fine too.
So far I checked this toolchain under few distributions:
- Ubuntu 10.04 'lucid' LTS
- Ubuntu 11.04 'natty'
- Fedora 14
- OpenSUSE 11.4
- CentOS 5.6
It failed only under CentOS (which was expected due to it's age).
How did I checked? So far compilation of 'gpm' and 'zlib' were tested.
==GCC==
===Progress===
* Continue to look at the test failure with mvectorize-with-neon-quad.
Should be able to commit the backend workaround in on Monday .
* Having some problems getting my panda board working reliably. I'm
not sure if its the temperature or what but when it gets hot in the
office as it was on Tuesday keeping it working reliably is hard. The
board locks up and then crashes quite often.
* Looked at VFP moves again for some more time.
* Committed tbh range change.
* Committed fixes for PR50022
=== Plans ===
* Finish off VFP moves patch.
* Look at BRANCH_COST results.
* Breakdown the T2 performance blueprints into smaller blueprints.
* Backport tbh range changes to Linaro 4.6
* Test the intrinsics patch once with some more intrinsics tests and
then merge it in to Linaro gcc 4.6
Meetings:
* 1-1s
* TCWG calls
Absences.
* 29th Aug - Sept. 2 - Holiday booked and approved.
* 31st Oct - 4th Nov - Linaro Summit Orlando - Travel booked - hotel
to be booked.
Hi all,
I'm having real trouble here :(
I just can't seem to get bzr to work! I've tried to branch
gcc-linaro/4.6 again and again, and it just won't. My other machine
refuses to do the merge from lp:gcc/4.6, presumable because the bzr on
there is too old.
I'm stuck. Can anybody else do the merge from upstream?
I'm going to keep trying.
Andrew
* GCC
Completed merging GCC 4.5 from FSF to Linaro.
Spun release tarballs for Linaro GCC 4.5 and 4.6. Uploaded them to
Michael's server, and kicked off the test builds remotely.
Submitted expenses for Linaro Connect.
Finally (!) committed my widening multiplies patches to FSF. :)
Continued trying to figure out what's wrong with my thumb2 constants
patches. I think I have identified a possible flaw, but I'm having
trouble reproducing the problem as I have been unable to pin down a
specific constant/expression combination that makes it through all the
other optimizations intact, and triggers the problem. I've not run out
of idea yet though ...
* Other
On leave all day Wednesday.
Prepared for the big CodeSourcery to Mentor switch-over by moving all my
work-in-progress data over to the new servers.
== String routines ==
* Working through updating my eglibc patch for memchr, I think I'm
nearly there - took way too long
to persuade an eglibc make check to work cross (can't get native
build happy).
== QEMU ==
* Sent a new version of my QEMU patch for the atomic helpers to Peter.
* Tested the Android beagle image on a real beagle - it fails in
pretty much the same way as the
QEMU run.
== Other ==
* Had a brief look at bug 825711 - scribus ftbfs on ARM - this is QT
being built to define qreal as
float on ARM when it's double on most other things, scribus
having a qreal variable and something
it's defined as a double and then passing it to a template that
requires two arguments of the same type;
not really sure which one is to blame here!
I'm on holiday next week.
Dave
== GDB ==
* Created and published Linaro GDB 7.3 2011-08 release.
* Analyzed --with-sysroot=remote: testsuite failures,
and opened bug LP #829595.
* Reviewed Yao's latest Thumb-2 displaced stepping patch.
== Schedule ==
* I'll be on vacation 08/23 through 08/31.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Hi,
* continued to work on getting libunwind support for remote unwinding
upstream
* reworked some of the code to address concerns from the ml
* now upstream!
* made smaller fixes to have another libunwind testcase passing
* interfaced with the Linaro Android group to solve an issue where a
compile was failing when using -O3
* turned out that the Linaro GCC vectorizes a loop by generating some
neon instructions
* unfortunately the gas of the 2.20.1 binutils (that is currently
used by the Linaro Android toolchain) doesn't properly understand the
alignment restrictions of the generated asm code and throws an error
* this has be fixed upstream and using a gas from recent binutils
fixes the issue
* Bernhard is already working on getting newer binutils in their
Androuid toolchain build system
* continued the work to get libunwind building on Linaro Android
* wrote an Android.mk and got an initial libunwind.so built (ugly
hacks involved)
* next step is modify the debuggerd to make use of libunwind.so
Note: Next week I'll be on vacation.
Regards
Ken
== This week ==
* Looked at LP #823711. Turned out to be a problem with symbol
visibility in libgcc.a. Tested a fix that was accepted and applied
upstream. Will backport to upstream release branches, so we should
be able to pull the fix in that way.
* Backported the fix for BZ PR49987 to Linaro 4.6 and 4.5.
* Looked at the regrename bug that Ramana reported on gcc@.
* Looked at why libav wasn't being vectorised. Discussed with Ira.
I think we now have a Plan.
* Submitted address writeback scheduling patches upstream.
* Submitted and applied some tweaks to the rtx cost interface upstream.
* Spent a while trying to figure out what the targetm.rtx_costs
API actually is, and how rtx_cost should use it to evaluate the
cost of a SET. Discussed on gcc@.
* Found that ARM was giving SETs a base cost of 4 instructions.
Benchmarked the cost of "fixing" this. It generally seemed positive.
* Wrote a couple of other rtx cost patches.
== Next week ==
* Backport fix for #823711 to upstream branches.
* Hopefully finish off rtx costs stuff.
* Unless there's a clear outcome from the gcc@ discussion, I think
I'll abandon my idea of using insn_rtx_cost in the new auto inc/dec
patch, and simply sum the cost of every SET. Should be a small change.
Richard
* Started running EEMBC on Panda. Got three errors in the automotive test at
this point.
* Started documenting necessary steps for my start-up task:
https://wiki.linaro.org/Internal/ToolChain/Benchmarks/First%20time%20notes
* Upgraded the Snowball board to the latest version (V3). Created a
corresponding test image for Snowball (Linaro 11.06). There is a problem
with the serial console freezing after a couple of minutes without any
error, not sure if it is a complete crash or just the serial output. The
people I have talked to so far has not experienced the same problem. I will
set up the networking for the board and see where ssh gets me.
Best Regards
Åsa
Hi,
- change of default vector size for auto-vectorization on NEON -
submitted and approved
- continued working on vectorization of widening shifts
- looked into SLP vectorization for libav
- two vacation days
I'll be on vacation on Aug 22-30.
Ira
I put a build harness around libav and gathered some profiling data. See:
bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite
It includes a Makefile that builds a C only, h.264 only decoder and
two Creative Commons licensed videos to use as input.
README.rst has the basic commands for running ffmpeg and initial perf
results showing the hot functions. Dave, 20 % of the time is spent in
memcpy() so you might want to have a look.
The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll
look into extracting and harnessing the functions themselves later
this week.
-- Michael
Hi,
is the Linaro toolchain (esp. gcc) useful on x86/x86_64, or is an
attempt to use the Linaro toolchain with such a target just asking for
trouble?
(No, I'm not secretly an Intel spy ;) Just trying to have some fun
with my desktop machine ;) )
ttyl
bero
The Linaro Toolchain Working Group is pleased to announce the release
of Linaro GDB 7.3.
Linaro GDB 7.3 2011.08 is the first release in the 7.3 series. Based
off the latest GDB 7.3, it includes a number of ARM-focused bug fixes.
This release includes all bug fixes from the latest Linaro GDB 7.2
release that were not already included in FSF GDB 7.3.
In addition, this release fixes:
* LP: #804401 [remote testsuite] Thread support
* LP: #804387 [remote testsuite] Shared library test problems
* LP: #804392 [remote testsuite] Rebuilt executables not copied
* LP: #804396 [remote testsuite] Spurious failures
The source tarball is available at:
https://launchpad.net/gdb-linaro/+milestone/7.3-2011.08
More information on Linaro GDB is available at:
https://launchpad.net/gdb-linaro
The Linaro Toolchain Working Group is pleased to announce the release
of Linaro QEMU 2011.08.
Linaro QEMU 2011.08 is the latest monthly release of qemu-linaro. Based
off upstream (trunk) QEMU, it includes a number of ARM-focused bug fixes
and enhancements.
This month's release is primarily minor improvements:
- Fixes LP:816791: ARMv6 cp15 barrier instructions now work
in linux-user mode as well as system mode
- Support for ARM1176JZF-S core has been added (thanks to
Jamie Iles <jamie(a)jamieiles.com>)
- Add workaround for kernel bug LP:727781 (which has resurfaced
in 3.0) to suppress warnings about bad-width omap i2c accesses
Plus of course new upstream fixes and improvements.
Performance:
When running qemu in system mode with an SD card image we have
determined that performance is best when the image is in writeback
caching mode. This significantly increases the performance of the SD
card (by factors of 10 or more). An example command line option is:
-drive if=sd,cache=writeback,file=my-sd-card.img
Note that cache=writeback may result in data not being written to
disk if the host system powers down unexpectedly (guest crashes
or powerdowns are not a problem).
Known issues:
- The beagle and beaglexm models still do not support USB networking
- There may be some problems with running multithreaded programs in
linux-user mode (LP:823902)
The source tarball is available at:
https://launchpad.net/qemu-linaro/+milestone/2011.08
Binary builds of this qemu-linaro release are being prepared and
will be available shortly for users of Ubuntu. Packages will be in
the linaro-maintainers tools ppa:
https://launchpad.net/~linaro-maintainers/+archive/tools/
More information on Linaro QEMU is available at:
https://launchpad.net/qemu-linaro
The Linaro Toolchain Working Group is pleased to announce the 2011.08
release of both Linaro GCC 4.6 and Linaro GCC 4.5.
Linaro GCC 4.6 2011.08 is the sixth release in the 4.6 series. Based
off the latest GCC 4.6.1+svn177703, it focuses on fixing bugs found
during the Android integration and in SMS. This is a quiet release
due to Linaro Connect.
Interesting changes include:
* Updates to 4.6.1+r177703
Fixes:
* LP: #736007 ICE immed_double_const at emit-rtl.c
* LP: #809768 ICE when compiling bionic's libm
* LP: #815777 Inconsistent packaging between tarball and root
directory names
Linaro GCC 4.5 2011.08 is the thirteenth release in the 4.5
series. Based off the latest GCC 4.5.3+svn177552, the release is
focused on maintenance.
Interesting changes in 4.5 include:
* Updates to 4.5.3+r177552
* Now builds for PowerPC
Fixes:
* LP: #736007 ICE immed_double_const at emit-rtl.c
* LP: #809768 ICE when compiling bionic's libm
* LP: #815435 ICE: insn does not satisfy its constraints
The source tarballs are available from:
https://launchpad.net/gcc-linaro/+milestone/4.6-2011.08https://launchpad.net/gcc-linaro/+milestone/4.5-2011.08
Downloads are available from the Linaro GCC page on Launchpad:
https://launchpad.net/gcc-linaro
Mailing list: http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Bugs: https://bugs.launchpad.net/gcc-linaro/
Questions? https://ask.linaro.org/
Interested in commercial support? inquire at support(a)linaro.org
-- Michael
On Thu, Aug 18, 2011 at 4:16 AM, Richard Earnshaw
<Richard.Earnshaw(a)arm.com> wrote:
> I was just browsing libgmp this afternoon and noticed that it really
> could do with an overhaul to support recent ARM chips.
>
> The ARM code seems to have been written for StrongARM; which is now
> almost obsolete (for example, it loads from a cache line it is about to
> write to in order to pre-allocate the line in the cache).
>
> It doesn't support v4T interworking.
>
> It doesn't make any use of v5 or later instructions.
>
> There is some Thumb(1) code, but again it has no support for
> interworking, is pretty poor and limited in scope.
>
> I'm not sure overall how useful this is to gcc performance; the library
> is needed to build GCC, but I think it's mostly there to support libmpfr.
>
> Nevertheless, there are other apps out there that make use of this
> stuff, including some crypto code, IIRC.
I looked at using gmp as a benchmark some time ago. The assembly
version is twice as fast as the C version already, which is nice. I
assume NEON would be a big improvement as well.
I had a quick poke through the dependencies in Ubuntu and came up with
the following popular packages that use libgmp or libmpfr:
* guile
* python-crypto
* gch (Haskel)
* maxima
* darcs
Nothing earth shattering but probably worthwhile. I've registered:
https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/improve-libgmp
so that we don't lose it.
-- Michael
Hi there. The 2011.08 release has been spun and is testing up well.
The 4.5 and 4.6 branches are now open so feel free to commit any
approved patches.
-- Michael
> . Would you be interested in adding a Firefox-based benchmark? As a large
> application it is a good testbed for LTO, FDO and other aggressive
> optimizations.
Sorry about the delayed response. I did notice your mail last week but
I was busy with our conference and then the first couple of days this
week have just disappeared with some internal training.
I would be interested in hearing how you get on with LTO and FDO on
ARM. Listening to Honza talking at the GCC unconference in London
about the memory usage for full LTO with trunk I did wonder what would
happen if we tried it on the ARM target to see what we got, but I
never managed to get around to trying anything there :) . We did look
at getting FDO working with Linaro GCC last cycle but there are still
a couple of issues with PGO in Linaro GCC 4.5.
With respect to LTO , the one problem we have currently is that the
Neon intrinsics aren't streamed out and streamed back in. So you might
have a few issues if your code uses arm_neon.h .
https://bugs.launchpad.net/gcc-linaro/+bug/823548 is an example of
this problem. This was fixed upstream and we probably just need to
backport that into our 4.6 tree. I've tried a backport this morning
and I think I have this right finally.
If you could do a build and a firefox benchmark run in about 30-60
minutes by all means please do let us know how you get on and what you
find. We've been steadily trying to improve the performance of the ARM
toolchain and the biggest improvements you'll notice will be with the
vectorizer but there will be other small improvements that you'll
notice in other general areas of code generation. We would be
interested in feedback about what can be done and to add to our queue
of things to look at and improve for the ARM port of GCC.
With respect to the images, Kiko's probably answered that bit.
cheers
Ramana
* GCC
Continued tracking down problems in my various broken patches. Fixed one
bug, investigated two more. Re-submitted the widening multiplies for
testing, and this time it returned with no problems. Yay, I can now
check it in next week.
Merged from upstream GCC 4.5. The launchpad import bug still exists
(although should not for much longer) so I had to ask on #launchpad to
get the imports done. Submitted the merged branch for testing.
Tried to merge GCC 4.6 similarly, but failed. Bzr just refused to play
ball, which was very frustrating. Michael Hope has now done the merge
instead.
* Other
On leave Wednesday and Friday.
* libauqntum - running the SMSed version on ARM machine did not show
significant improvement. Discussed it with Richard Sandiford.
Apparently in the SMS phase the instructions are of DI mode due to the
fact the loop contains 64 bit operations while they later been
generated as 32 bit operations. This makes SMS less accurate and I'm
now looking into a version which disables DI mode operations.
* Started to look at the potential of SMS on libav. Initial runs of
Richard's microbenchmarks with SMS show some regressions as well as
improvements that I'm looking at.