- linaro-toolchain - lists.linaro.org

[ACTIVITY] 2010-11-26

by David Gilbert

Hand crafted a simple strchr and comparing it with Libc: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr It's interesting it's significantly faster than libc's on A9's, but on A8's it's slower for large sizes. I've not really looked why yet; my implementation is just the absolute simplest thumb-2 version. Did some ltrace profiling to see what typical strchr and strlen sizes were, and got a bit surprised at some of the typical behaviours (Lots of cases where strchr is being used in loops to see if another string contains anyone of a set of characters, a few cases of strchr being called with Null strings, and the corner case in the spec that allows you to call strchr with \0 as the character to search for). Trying some other benchmarks (pybench spends very little time in libc,package builds of simple packages seem to have a more interesting mix of libc use). Sorting out some of the red tape for contributing. Dave

14 years, 6 months

1
0
0 0

Notes on mixing D16/D32 code

by Michael Hope

It's a bit of a newbie question, but I've been wondering if you can intermix hard float VFPv3-D16 code with VFPv3-D32 code. You can as: According to the ABI: * d0-d15 are used for floating point parameters, no matter if you are D16 or D32 * d0-d15 are not preserved across function calls * d16-d31 must be preserved across function calls The scenarios are: A D32 function calls a D16 function: * The first 16 (!) parameters are passed in D0-D15 * Any remaining are passed on the stack * The D16 function doesn't know about D16-D31, doesn't use them, and hence preserves them A D16 function calls a D32 function: * The first 16 parameters are passed in D0-D15 * Any remaining are passed on the stack * The D32 function preserves any of the D16-D31 registers that it uses. Redundant, but fine. A D32 function (A) calls a D16 function (B) which calls a D32 function (C): * Parameters are OK, as above * B doesn't use D16-D31 and hence preserves them * C preserves any of the D16-D31 that it uses, which preserves them from A's point of view -- Michael

14 years, 6 months

2
3
0 0

[ACTIVITY] report week 47

by Peter Maydell

(short week: only three days) RAG: Red: Amber: Green: qemu: initial pull req sent; vfp-in-sighandlers patchset sent Milestones: | Planned | Estimate | Actual | finish virtio-system | 2010-08-27 | postponed | | get valgrind into linaro PPA | 2010-09-15 | 2010-09-28 | 2010-09-28 | complete a qemu-maemo update | 2010-09-24 | 2010-09-22 | 2010-09-22 | finish testing PCI patches | 2010-10-01 | 2010-10-22 | 2010-10-18 | Progress: * qemu: final polish on a patchset for saving/restoring VFP and iWMMXT registers across linux-user mode signal handlers; patch series sent to mailing list * qemu: sent a pull request for a small set of ARM fixes (make SMC undef; fix PXHxx; fix saturating add/sub; fix VCVT) * reviewed arm semihosting SYS_GET_CMDLINE patch v2 * I now have enough qemu patches in flight that I'm tracking them at https://wiki.linaro.org/PeterMaydell/QemuPatchStatus (simple manual list for now, hopefully will be sufficient) Meetings: toolchain, pdsw-tools Plans - qemu consolidation Absences: (complete to end of 2010) Thu/Fri 25-26 Nov; Fri 17 Dec - Tue 4 Jan inclusive. (Dallas Linaro sprint 9-15 Jan.)

14 years, 6 months

1
0
0 0

__sync barriers

by Richard Sandiford

For the record, the thing I half-remembered on the call was: http://gcc.gnu.org/ml/gcc-patches/2009-08/msg00697.html and: http://gcc.gnu.org/ml/gcc-patches/2009-09/msg02112.html The problem is that all __sync operations besides __sync_lock_test_and_set and __sync_lock_release are defined to be full barriers. Using something like __sync_val_compare_and_swap for __arch_compare_and_exchange_val_*_acq and __arch_compare_and_exchange_val_*_rel may on some architectures be too heavyweight, since those macros only need acquire/after and release/before barriers. See in particular: http://gcc.gnu.org/ml/gcc-patches/2009-08/msg00928.html from the first thread, where the feeling was that the future wasn't these __sync builtins, but the new C and C++ atomic memory support. Probably already known, sorry. I just wasn't sure that trying to convert everyone (not just ARM) to __sync_* was necessarily going to go down well. Richard

14 years, 6 months

3
2
0 0

Status Report 11-22-2010

by Zach Welch

== Last Week == * Reached the point with understanding libunwind where I can begin writing patches for parsing unwind information out of .ARM.exidx and .ARM.extab ELF sections. == This Week == * Begin writing support for ARM-specific unwind information to libunwind. -- Zach Welch CodeSourcery zwelch(a)codesourcery.com (650) 331-3385 x743

14 years, 6 months

1
0
0 0

[ACTIVITY] November 15-21st

by Julian Brown

== Linaro GCC == * Continued looking at big-endian/quad-vector patch: attempted to figure out the proper semantics for vec_extract in big endian mode (about 1 day). Put on hold temporarily to work on lp675347, QT failing to build due to constraint failure in inline asm statements used for atomic operations: found the patch which introduced the failure, and suggested a workaround to the OP. Came up with a plausible-looking patch, and started testing it, after spending some time trying to figure out why ARM Linux mainline doesn't build at present. Patch sent upstream.

14 years, 6 months

1
0
0 0

Of instruction timings

by David Gilbert

Hi Richard, As per the discussion at this mornings call; I've reread the TRM and I agree with you about the LSLS being the same speed as the TST. (1 cycle) However as we agreed, the uxtb does look like 2 cycles v the AND 1 cycle. On the space v perf theme, one thing that would be interesting to know is whether there are any icache/issue stage limitations; i.e. if I have a stream of 32-bit Thumb-2 instructions that are all listed as 1 cycle and are all in i-cache, can they be fetched and issued fast enough, or is there a performance advantage to short instructions? Dave

14 years, 6 months

1
0
0 0

[ACTIVITY] 15th - 19th November

by Andrew Stubbs

LP:663939 - Thumb2 constants * Continued testing, found a few bugs. Tidied a few bits up. * Wrote some new testcases to go with the patch. LP:618684 - ICE * Begun looking at this one. So far I can't reproduce it. I have a debuggable native toolchain building, but it'd been delayed by hardware issues. In the course of testing I discovered that the ARM FSF config wasn't testing the right thing, so begun work on a new, more appropriate FSF build/test config for Linaro work. Also found the the SD card rootfs in my IGEPv2 board was corrupted. I've restored it from backup, and now it's working once more.

14 years, 6 months

1
0
0 0

[ACTIVITY] Nov.15 -- Nov.21

by Chung-Lin Tang

== Linaro and upstream GCC == * Linaro launchpad issues: - LP #672833, x64-64 varargs regression: after testing pushed bzr branch for merging. - LP #634738, inefficient low bit extraction: some discussion with Yao. - LP #618684, ICE when building ziproxy: looked into and quickly found not reproducible anymore of Linaro 4.5 trunk. * Worked on some GCC bugzilla PRs: - PR44557, ICE in Thumb-1 secondary reload: this should be fixed by a change of the scratch operand constraint of "reload_inhi" from "r" to "l". Interesting to note that this was from the merged-arm-thumb-backend-branch merge, from about 10 years ago. - PR46508: libffi fails to build on VFP asm instructions, seems to need a '.fpu vfp' directive. Probably missed earlier because my toolchain was configured with --with-fpu=vfp. - PR45416: 4.6 code generation regression on ARM, after expand from SSA changes. Looking at this currently. == This week == * Look at Linaro issues with higher priority. * Continue working on GCC PRs.

14 years, 6 months

1
0
0 0

[ACTIVITY] (Yao Qi) Nov 15th -- 21st

by Yao Qi

== Linaro GCC == * Merge ldm/stm patch to Linaro 4.5 tree. Found two regressions on the last minute of proposing merge request in pass ce3. Revert one of ldm/stm patches about ifcvt. Complete testcase in branch. * Try Richard E.'s "TST to LSLS transformation" patch on cortex-a9 with FFMPEG. No speed improvements. * Various Linaro GCC Bug fixing. ** LP:634738 Follow the fix to GCC PR40697, and create a new patch, which emits extzv or shift rather than loading constants in some cases. Tested on FSF GCC trunk, and no regression. However, found a regression by eyes in pr44999.c, in which, ubfx (4byte) is generated, rather than uxth (2byte). uxth is produced by combiner from ashift and lshiftrt. During reading arm.c, find that constant handling in thumb2 should be improved to some extent. ** LP:633243 Re-implement regrename improvement, as Eric B. suggested in gcc-patches. Spend some time on understanding API in GCC related to hard-reg. Tested on x86_64-linux. No regression. ** LP:638935 Update my tree to FSF trunk, and find RTL seq for fldm/fstm peephole disappears due to fix to PR45722. Extend arm-ldmstm.ml to support vfp. Peephole and RTL patterns for vfp are done. Will revise arm.c:{load,store}_multiple_sequence to accept vfp data. Fix a bug in ldm/stm peephole when starting offset is negative. == This week == * LP:634738: Figure out how uxth is produced by combiner. * LP:633243: Test it on ARM. * LP:638935: Revise {load,store}_multiple_sequence to accept vfp data. -- Yao (齐尧)

14 years, 6 months

1
0
0 0

GCC SVN vs. BZR/LP

by Andrew Stubbs

Re my recent email "Upstream GCC feature freeze", I think we're agreed that we need to create a branch that tracks GCC 4.6 development, but has our own performance improvements included. The question is where to host it? Option 1: Launchpad/bzr Pros: * We need no permission to do it * The branch will naturally evolve into our 4.6 release series in time. * The 3-way merge works well (if slowly) * We can include patches that we have no intention of posting upstream ever * Our patch tracker will Just Work. * Merge requests will be available. Cons: * Bzr ;) * It's hidden away from the view of most GCC developers Option 2: GCC SVN branch Pros: * We can work in the open, submitting patches via gcc-patches, as usual * The final merge to GCC trunk (come stage 1) will be eased, a little Cons: * We can't really apply anything we want just for ourselves * we may end up maintaining an LP branch shadowing the svn branch * When we do want to do 4.6 in LP, we'll have to backport all our patches from 4.7, and this may no longer be straightforward. * Write permissions not clear. * Although I think you can just go ahead and do it? OK, so I'm sure I've missed some big ones. Please discuss! ;) I think the big question here is, when will we start wanting to make (unstable/experimental) Linaro GCC 4.6 releases? If we want to do it early, then we'll have no choice but to have an LP branch to release from. Andrew

14 years, 6 months

6
13
0 0

[ACTIVITY] week 46

by Marcin Juszkiewicz

Like everyone from Toolchain WG I will share my activites in last week: 1. cross compilers for archive - discussed with doko about dropping update-alternatives use - wrote gcc-defaults-armel-cross 1.4 which does proper symlinks for cross compilers - wrote gcc-4.5-armel-cross 1.41 which removes update-alternatives support - wrote gcc-4.4-armel-cross 1.37 which removes update-alternatives support - wrote armel-cross-toolchain-base 1.53 which has all updates which I had - sent all of them to Steve for review Status of changes: - default version of armel cross compiler will be 4.5 like it is in Natty - both 4.4 and 4.5 will be provided as it is for native - any traces of update-alternatives use should be removed Needs to be done: - adding conflicts on older cross compilers to gcc-defaults-armel-cross Order of upload to archive: - armel-cross-toolchain-base - gcc-4.5-armel-cross - gcc-4.4-armel-cross - gcc-defaults-armel-cross 2. Checked few old bugs do they still apply: - Bugs #646729, #637454, #671455 are done with armel-cross-toolchain-base 1.52 (landed in maverick-proposed) Regards, -- JID: hrw(a)jabber.org Website: http://marcin.juszkiewicz.com.pl/ LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz

14 years, 6 months

1
0
0 0

[ACTIVITY] 2010-11-19

by David Gilbert

Short week. Finally got external hard drive for my beagle - makes it sanely possible to natively build things. Got eglibc cross built (Thanks to Wookey for pointing me in the right direction with the magic incantation of dpkg-buildpackage -aarmel --target=binary) and easily rebuilding . I have a version with the neon version of my memset built into it - it doesn't seem to make a noticeable difference to my ghostscript benchmark though. Panda's aren't likely to turn up until mid December; arranging borrowing an A9 is turning out to be difficult, but it looks like we should be able to get access to the one in the London datacentre - although it has a disc problem at the moment. I did manage to get a colleague to try my tests on his own Toshiba AC-10 (Tegra-2 - no Neon); the graphs had approximately the same shape as my previous Panda tests. Memchr looked pretty good on there. Also trying to look at the sign off I need for various libc access. Dave

14 years, 6 months

1
0
0 0

[ACTIVITY] weekly status

by Ken Werner

I mainly worked on the atomic memory operations blueprint/item: * posted an updated patch for #643171 on the libc-ports ml after running the glibc testsuite natively on the vexpress * continued to learn about the ARM instructions involved :) * started to write some gcc testcases that scan the asm output of the __sync builtins (mainly to detect differences between the gcc versions - not sure how useful those tests would be for upstream as the sequences may easily change) Ken

14 years, 6 months

1
0
0 0

[ACTIVITY] report week 46

by Peter Maydell

RAG: Red: Amber: Green: Milestones: | Planned | Estimate | Actual | finish virtio-system | 2010-08-27 | postponed | | get valgrind into linaro PPA | 2010-09-15 | 2010-09-28 | 2010-09-28 | complete a qemu-maemo update | 2010-09-24 | 2010-09-22 | 2010-09-22 | finish testing PCI patches | 2010-10-01 | 2010-10-22 | 2010-10-18 | Progress: * Most of this week spent at the Meego conference in Dublin. This seemed to be a rather apps-developer centric conf, with not much of interest on the low-level side. There were a few useful talks/conversations, though. * Intel were giving away Atom-based netbooks to all attendees; that's a lot of developers who are going to be testing and optimising their apps for Atom devices rather than ARM... * qemu: looked at https://bugs.launchpad.net/bugs/668799 ; we don't seem to be taking the right lock before we manipulate the graph of translation blocks. I have a fix which stops the reported segfault, but the code has a number of "XXX not thread safe" and "FIXME: not SMP safe" comments and generally doesn't seem to have a coherent locking design :-( * qemu: sent some minor patches upstream: + enable iwmmxt coprocessors in user mode + remove some unused functions from target-arm and target-sparc + fix a failure to build bug in a makefile * qemu: some review of a patch to fix semihosting SYS_GET_CMDLINE Plans - qemu consolidation - post-toolchain-review, sort out some milestones for this report Absences: (complete to end of 2010) Thu/Fri 25-26 Nov; Fri 17 Dec - Tue 4 Jan inclusive. (Dallas Linaro sprint 9-15 Jan.)

14 years, 6 months

1
0
0 0

[ACTIVITY] Weekly status

by Richard Sandiford

== This week == Started looking at STT_GNU_IFUNC support in BFD. There were a couple of janitorial changes I needed to make in order to prepare elf32-arm.c for the main patch. I tested those separately and submitted them upstream: http://sourceware.org/ml/binutils/2010-11/msg00330.html http://sourceware.org/ml/binutils/2010-11/msg00331.html I've now finished a prototype implementation of the STT_GNU_IFUNC support itself. It wasn't as mechanical as I'd originally assumed, which was nice. Tests that I've run by hand seem to be doing the right thing. I've now started writing tests for the testsuite (meaning: I've completed 1 test so far). == Next week == * Add more tests, including Thumb coverage. * Start on the libc changes. Richard

14 years, 6 months

1
0
0 0

gcc seg fault on kernel build

by Rob Herring

Doing an allmodconfig build on the kernel, I get the following: CC arch/arm/kernel/asm-offsets.s In file included from /home/rob/proj/git/linux-2.6-dt/include/linux/kernel.h:12, from /home/rob/proj/git/linux-2.6-dt/include/linux/sched.h:54, from /home/rob/proj/git/linux-2.6-dt/arch/arm/kernel/asm-offsets.c:13: /usr/lib/gcc/arm-linux-gnueabi/4.4.5/include/stdarg.h:40: internal compiler error: Segmentation fault It occurs on Maverick 4.4, 4.5 and CodeSourcery 2009Q1 cross toolchains. It's confirmed by Codesourcery here: http://www.codesourcery.com/archives/arm-gnu/msg03719.html What's the status on this issue? I didn't see anything in Linaro gcc bugs that looks related. Rob

14 years, 6 months

4
5
0 0

[ACTIVITY] November 14-18

by Ira Rosen

Hi, This week I continued looking into vld/vst support in GCC. I also fixed GCC PR 46312 - testsuite failures on ARM. Ira

14 years, 6 months

1
0
0 0

STT_GNU_IFUNC and R_ARM_IRELATIVE

by Richard Sandiford

The STT_GNU_IFUNC blueprint: https://wiki.linaro.org/WorkingGroups/ToolChain/Specs/Binutils-STT_GNU_IFUNC says "the ARM EABI will be updated to support STT_GNU_IFUNC's requirements". I suppose the most obvious thing that needs to be defined is the relocation number for R_ARM_IRELATIVE. What's the best way of handling that? The main options seem to be: 1. Reserve a relocation number with ARM first (129?). 2. Go ahead and implement it without having the EABI updated. See whether the results are good before deciding whether to bless it in the EABI. 3. Since STT_GNU_IFUNC is a GNU-specific, treat R_ARM_IRELATIVE as GNU-specific too, and pinch one of the R_ARM_PRIVATE relocs. I'm pretty sure (3)'s not the way to go, but I was aiming for completeness. :-) Richard

14 years, 6 months

3
3
0 0

Fwd: Fw: GCC SVN vs. BZR/LP

by Ira Rosen

Hi, On 17 November 2010 05:35, Michael Hope <michael.hope(a)linaro.org> wrote: > 1. How easy is it to frequently merge in SVN? It used to be terrible > as you had to manually track the merges. These days can you do a 'svn > merge trunk' and have it just work? I asked Mike Meissner to answer this question. Mike is very experienced in GCC and GCC SVN branch management. I am attaching his reply. Ira I sent this recently to ppc64-toolchain(a)linux.ibm.com on how to use svnmerge to manage branches: This script (also ~meissner/meissner/bin.sh/svnmerge) is what I use to update svn directories, such as ibm-gcc-4_5-branch. I think I originally got it from Ben E. and it may be in the contrib directory. Typically the way I start a branch, such as my normal power7-meissner branch, I do the following: $ export TRUNK="svn+ssh://@gcc.gnu.org/svn/gcc/trunk" $ export BNAME="power7-meissner" $ export BRANCH="svn+ssh://@gcc.gnu.org/svn/gcc/branches/ibm/$BNAME" $ export SRC="$HOME/fsf-src" $ svn delete -m"delete old branch" $BRANCH $ svn copy -m"Clone new branch" $TRUNK $BRANCH $ cd $SRC $ svn co $BRANCH $ cd $BNAME $ svnmerge init $ svn update # this is sometimes needed $ svn commit -m'Create svnmerge init info' $ export REV="xxxx" # substitute subversion id for xxxxx $ echo "power7-meissner branch, based on $REV." > gcc/REVISION $ touch gcc/ChangeLog.power7 $ <edit gcc/ChangeLog.power to create initial contents> $ svn add gcc/ChangeLog.power gcc/REVISION $ svn commit -m'Add REVISION to branch' In particular, creating GCC/REVISION allows you to tell what subversion revision the source is based against. You can find the information via: $ svn propget svnmerge-integrated but it is a lot easier if you have a compiler tree to do gcc -v. After you do a propget, you will need to do a svn update. In this case, I use gcc/ChangeLog.power7 to hold the ChangeLog entries local to the branch. That way I can see a summary of the changes, but not pollute the normal ChangeLog files. To do merges, you need to make sure that all local changes are checked into the branch. Then do: $ cd $SRC/$BNAME $ svnmerge merge $ <edit gcc/REVISION and ChangeLog.power7 to indicate merge> $ <test merged files, if satisified, check them in> $ export REV="xxxx" # substitute subversion id for xxxxx $ svn update # just in case $ svn commit -m"Update to subversion id $REV" Now, to create a patch file do, make sure the files are checked in: $ cd $SRC/$BNAME $ export PATCHFILE="$HOME/patches/mypatch.patch01" $ <make ChangeLog entries in $PATCHFILE> $ svn diff --old $TRUNK --new . -r $REV >> $PATCHFILE $ <delete ChangeLog.power7, REVISION, property changes from $PATCHFILE> $ submit patch To see if there are changes to be merge in: $ svnmerge avail For example on the ibm-gcc-4_5-branch, the following changes are available to be merged in: 164657-166510 when I originally wrote this message on the 9th of November, and Peter has subsequently updated the merge. I put the folliwng in ~/.subversion/config to provide my own diff command: ### Set diff-cmd to the absolute path of your 'diff' program. ### This will override the compile-time default, which is to use ### Subversion's internal diff implementation. diff-cmd = /home/meissner/bin.sh/svndiff Every so often, I find svnmerge misses, for example in deleting directories. It is helpful to do a diff from the mainline every so often to make sure you are not missing newly created files or still are keeping older files or just missed a change. I'll include svndiff for the smarter svndiff command and mrm-changelog.el that looks for the ChangeLog.<name> files I use in different branches. Feel free to contact me to clarify some stuff. -- Michael Meissner, IBM 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA meissner(a)linux.vnet.ibm.com fax +1 (978) 399-6899 (See attached file: svnmerge)(See attached file: svndiff)(See attached file: mrm-changelog.el)

14 years, 6 months

1
0
0 0

Public plan review recording

by Michael Hope

Hi there, There's a recording of this mornings public plan review available on the wiki at: https://wiki.linaro.org/Releases/1105/PublicPlanReview Also included is a copy of the slides and supporting documents. Might be interesting for those who missed it. -- Michael

14 years, 6 months

1
0
0 0

Thumb-2 performance discussion

by Michael Hope

A heads up. I'd like to have a brainstorming session on potential Thumb-2 performance improvements in GCC. Think about what you'd like in such a session, and what preperation should be done, and we can discuss the discussion (heh) on Monday. -- Michael

14 years, 6 months

2
1
0 0

Abstract submissions for QEMU Users Forum (March 18th)

by Christian Robottom Reis

Hi there, I noticed that there's a QEMU users forum at: http://adt.cs.upb.de/quf/ and that the abstract submission phase is still open, and closes November 28th. It would be great to see some participation there and help identify other key people interested in using and improving QEMU. -- Christian Robottom Reis | [+55] 16 9112 6430 | http://launchpad.net/~kiko Linaro Engineering VP | [ +1] 612 216 4935 | http://async.com.br/~kiko

14 years, 6 months

5
6
0 0

[ACTIVITY] Digest for 2010-11-16

by Michael Hope

Zach Welch -- == Last Week == * Continued working on libunwind support. Trying to figure out why my signal frame detection doesn't work as expected. * Kept pace with the ltrace tree, testing recent patches on ARM. == This Week == * Continue to work on libunwind signal frame detection. Julian Brown -- == Linaro GCC == * Looked at issues #663198 (double-precision register expected) -- which was already fixed on the linaro branch, but the bug was reported against a version just prior to that, and #667490 -- which involved a possible problem with the NEON "load 0.0" patch. Experimented for a while with the latter, but could not find anything wrong. Followed up upstream, and requested a stand-alone test case. * Worked on a proper solution to the VMOVN-in-big-endian-mode problem, discovering that several other quadword-register operations were similarly broken in the process. WIP patch sent to linaro-toolchain for discussion, but it needs a little more work before it can be applied. Peter Maydell -- Progress: * qemu: more cleanup of signal handler VFP patchset; I think I just need to add iwmmx support and it's good * qemu: VCVT: found yet another bug, did final patchset cleanup: submitted to upstream list [8 patch series] * qemu: submitted a trivial patch to fix a problem where __get_user/_put_user macros had an unnecessary local var which could clash with a var being used by the macro user * set up a tree on git.linaro.org which we can use for a branch to make pull requests for ARM qemu fixes * did a rough estimate of time to do an Eagle qemu model (6 months + testing/bug fixing time) Issues: * lost some time to a problem where Linux VMs stopped being able to talk to the LDAP server; however I have a workaround and IT are investigating Meetings: * toolchain, toolchain standup, pdsw-tools, PD doughnuts Plans - attend Meego conference in Dublin (Nov 15-18 inc travel) http://conference2010.meego.com/ - start on qemu consolidation by upstreaming various ARMv7 correctness fixes Andrew Stubbs -- == GCC 4.5 == * Continued working on LP:663939. * I still have not worked out how best to fix the constant propagation problem that has been thwarting my optimization patch, however I think I understand it better now. * I have started on adding replicated pattern support to the constant splitting. Initial results were good, but I discovered that I had to rearrange the code somewhat to get the cost estimation and negative/inverted constant support working correctly. So far, I have it successfully using 16-bit replication pattern constants for set/add/subtract operations. Other operations appear broken at the moment, but it's almost certainly just a few tweaks required. * TODO: Add support for 32-bit replicated pattern constants. Adjust some of the other two-instruction constant generation techniques to let them fall through to this new code, where it would be beneficial. * Pushed the latest set of GCC patches into Linaro GCC 4.5. Chung-Lin Tang -- == Linaro GCC == * Linaro #672833, one batch of my backports of Bernd's postreload patches exposed some varargs regressions for x86-64, was reverted by Michael. Tested the compiler and found it was fixed on mainline rev.162384. Backporting this revision plus the postreload patches fixed the regressions; x86-64 bootstrap also verified okay. There is however another PR45027 fix that was needed on trunk, but needs a bit more clarification if needed on a 4.5 compiler. * Linaro #641397, CS issue #6753: bitfield optimization. Patch tested without regressions, posted for CS internal review, should soon push for Linaro merge. * Started looking further at some GCC DF, IRA internals. == This week == * Look at more Linaro issues. * Maybe start looking at some GCC bugzilla PRs. * There is a local ARM technical event in Hsinchu on Thursday, might go and look around. Yao Qi -- == Linaro GCC == * Mainline patch backport to Linaro 4.5. ** Patch "Fix an if statement in arm_rtx_costs_1". Verified on Linaro 4.5. 0.1% smaller on size, and 0.2% faster on speed. Merged to Linaro 4.5 by Andrew S. ** Try Nathan F's ifcvt-cond-move patch on cortex-a8 with -O2/O3. No improvements on speed/size for EEMBC. ** Bernd's ldm/stm patch. Analyze the reason of regression on Linaro 4.5. Found something wrong in IRA rtl dump, and spend sometime on understanding IRA rtl dump log. Thanks to Chung-Lin, I realize that IRA dump is correct, and look back to ARM RTL patterns on ldm/stm. Compared RTL patterns in 4.5 and 4.6, found some difference. Regenerate ldmstm.md for Linaro 4.5 after update arm-ldmstm.ml a little bit. Regressions goes away! No speed improvement, but code is smaller by 0.2% in EEMBC. Still prefer to merge to Linaro 4.5. ocaml is an interesting language, but not easy to learn and read in vim. * Some discussion on Linaro development process. * My regrename improvement patch (re. LP:633243). Communicate with Eric Botcazou back and forth, but current patch is still too target-dependent to him, as a Middle-End maintainer. Still need some improvements. * Build FSF GCC trunk. CLoog requirement in configure is wrong, revert configure to previous version, and then pass the version checking during gcc configure.

14 years, 6 months

1
0
0 0

Activity reports

by Michael Hope

Hi there. Could everyone in the toolchain working group start sending their activity reports to this list please? Put [ACTIVITY] at the start of the subject line so that they can be filtered. Ta, -- Michael

14 years, 6 months

1
0
0 0

Status reports

by Michael Hope

Hi there. Attached are the status reports from the Toolchain WG members for last week. -- Michael Ken Werner -- Hi Michael, * got access to the internal wiki/calendar/email :) * continued to setup the borrowed vexpress board * upgraded to the Linaro 10.11 release * encountered various issues until I found that the /etc/hosts is empty (#674090) * learned that the SD card issue is a known problem (#632798) * the network interface sometimes dies if stressed (Matt was able to reproduce this) * the disabled CONFIG_SWAP is being tracked as #672656 * sometimes the entire system hangs (when under heavy load?) * David noticed that /proc/cpuinfo lacks neon support (but his string benchmark/testcase ran fine) * wondering why the kernel reports only about 800 BogoMIPS while it's around 2k on the panda board * started to work on the atomic memory operations item * identified the relevant GCC patches * still looking for a good way to verify the GCC support * posted a patch on the glibc-ports ml with regard to #643171 David A Gilbert -- I managed to get to try Ken Werner's Versatile Express board with an A9MP tile; the shape of the graphs matches that from the Panda, but the raw performance is down by a factor of about 3 - I'm guessing it's clocked lower for some reason. It confirms however that the Neon behaviour I was seeing with memset is not Panda/OMAP4 specific; no one has replied to my post to linaro-toolchain. It's a difficult situation in that my fastest memset on Beagle is with Neon, and my fastest on v9 is without Neon - what would you select on? I've just finished writing memchr tests and my first crack at a faster version; I realised I could use the same trick that I had used for strlen and it works nicely - it seems to be about 50% faster than the libc version; I've not tested against any other versions yet. Paul Mckenney hasn't replied yet about the OSSC stuff, but apparently he's out travelling and back next week; so I'll catch him then. I tried preloading my faster memset into ghostscript, but found it was blatantly ignoring it - I think the memset is being called from somewhere inside libc; I managed to get xdeb to cross build me a libc but haven't yet got my changes into it. My order for a USB hard drive for my beagle seems to have been delayed by the supplier; I'm pushing this but it's starting to be a bit of a pain. Richard Sandiford -- == Last Week == * Pinged my GAS fix for Thumb PLT branches to locally-defined symbols. Committed it to binutils trunk and 2.21 branch after approval. This fixes the libgcc.so build failure that I was seeing with GOLD. * Worked on a patch to fix GOLD's handling of non-function references to weak undefined symbols. This ended up touching every backend (i386, x86_64, ARM, Power and SPARC) and was quite invasive, so it took a while in the end. Committed to binutils trunk after approval. * Ran more tests, both with -marm and -mthumb. I'm getting identical GCC test results (including gfortran and objc) for GOLD and BFD ld, so I think we're at the stage where GOLD is a viable replacement for the BFD linker. == Next Week == * I'll start looking at the IFUNC support. * I'll take another look at launchpad bug 665598. Peter Maydell -- Progress: * qemu: more cleanup of signal handler VFP patchset; I think I just need to add iwmmx support and it's good * qemu: VCVT: found yet another bug, did final patchset cleanup: submitted to upstream list [8 patch series] * qemu: submitted a trivial patch to fix a problem where __get_user/_put_user macros had an unnecessary local var which could clash with a var being used by the macro user * set up a tree on git.linaro.org which we can use for a branch to make pull requests for ARM qemu fixes * did a rough estimate of time to do an Eagle qemu model (6 months + testing/bug fixing time) Issues: * lost some time to a problem where Linux VMs stopped being able to talk to the LDAP server; however I have a workaround and IT are investigating Meetings: * toolchain, toolchain standup, pdsw-tools, PD doughnuts Plans - attend Meego conference in Dublin (Nov 15-18 inc travel) http://conference2010.meego.com/ - start on qemu consolidation by upstreaming various ARMv7 correctness fixes Ira Rosen -- Here is this week report: 1. BeagleBoard installed, now "playing" with it 2. Continued to work on auto-detection of vector size 3. Looked into mixed vector sizes 4. Learning about vld and vst instructions It looks like I won't be able to participate in Wed calls, since I am alone with the kids on Wednesday evenings.

14 years, 6 months

1
0
0 0

Assembler bug blocking Thumb-2 kernel builds

by Dave Martin

Hi all, I've hit a probable assembler bug trying to build a Thumb-2 kernel: Trying to assemble the attached file, I get: arch/arm/kernel/relocate_kernel.S: Assembler messages: arch/arm/kernel/relocate_kernel.S:10: Error: invalid offset, value too big (0xFFFFFFFFFFFFFFFC) arch/arm/kernel/relocate_kernel.S:11: Error: invalid offset, value too big (0xFFFFFFFFFFFFFFFC) arch/arm/kernel/relocate_kernel.S:58: Error: invalid offset, value too big (0xFFFFFFFFFFFFFFFC) arch/arm/kernel/relocate_kernel.S:59: Error: invalid offset, value too big (0xFFFFFFFFFFFFFFFC) The code appears correct and resonable, except that there should be a .align directive before the data words at the end of the file (but adding this doesn't fix the error) Assembling in ARM (i.e., without -mthumb), or deleting the .globl lines associated with the affected target symbols, the problem goes away. I believe this may be already by tracked by CodeSourcery as is issue #8775 (?) Has anyone hit this issue before? Is it fixed upstream? Any help much appreciated. Cheers ---Dave

14 years, 6 months

3
6
0 0

A9 Neon confusion

by David Gilbert

Hi, I've been looking at some basic libc routine optimisation and have a curious problem with memset and wondered if anyone can offer some insights. Some graphs and links to code are on https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset I've written a simple memset in both a with and without Neon variety and tested them on a Beagle(C4) and a Panda board and I'm finding that the Neon version is faster than the non-neon version (a bit) on the Beagle but a LOT slower on the Panda - and I'd like to understand why it's slower than the non-neon version - I'm guessing it's some form of cache interaction. The graphs on that page are all generated by timing a loop that repeatedly memsets the same area of memory; the X axis is the size of the memset. Prior to the test loop the area is read into cache (I came to the conclusion the A8 didn't write allocate?). There are two variants of the graphs - absolute in MB/s on Y, and a relative set (below the absolute) that are relative to the performance of the libc routines. (The ones below those pairs are just older versions). if you look at the top left graph on that page you can see that on the Beagle (left) my Neon routine beats my Thumb routine a bit (both beating libc). If you look on the top right you see the Panda performance with my Thumb code being the fastest and generally following libc, but the Neon code (red line) topping out at about 2.5GB/s which is substantially below the peak of the libc and ARM code. The core loop of the Neon code (see the bzr link for the full thing) is: 4: subs r4,r4,#32 vst2.8 {d0,d1,d2,d3}, [ r3:256 ]! bne 4b while the core of the non-Neon version is: 4: subs r4,r4,#16 stmia r3!,{r1,r5,r6,r7} bne 4b I've also tried vst1 and vstm in the neon loop and it still won't match the non-Neon version. All suggestions welcome, plus I'd appreciate if anyone can suggest which particular limit it's hitting - does anyone have figures for the theoretical bus and L1 and L2 write bandwidths for a Panda (and Beagle) ? Thanks in advance, Dave

14 years, 6 months

1
0
0 0

Draft of next weeks public review

by Michael Hope

Hi there. I've uploaded a draft of the slides and notes for next weeks public review at: http://bazaar.launchpad.net/~linaro-toolchain-wg/+junk/publicreview1105/fil… 'Toolchain Public Review 11.05.odp' is a set of slides I'll talk to. The first 15-20 minutes will go through these to describe our focus and goals and how they tie together the blueprints and priorities. The rest of the session will go through the current blueprints and priorities. See: Toolchain Blueprints (short).pdf for the summary version and: Toolchain Blueprints (long).pdf for the long version. The long version is interesting if you can't find a particular tool or technology. It may be small enough to be called out as a single work item. These are only a draft, but I realised I haven't shared the plans with the rest of the group very well and Monday's meeting won't be the best. I'm on holiday tomorrow but feel free to send me any comments, -- Michael

14 years, 6 months

1
0
0 0

Reviewing blueprints for the TSC

by Michael Hope

Hi there. I've been going through the blueprints in preparation for next weeks TSC review. The top level topics are good, and I'd like to have the rest of the engineering blueprints checked over and updated to match what we talked about at the summit. Ira, could you please create blueprints for the areas you plan to look into? Anything that will take longer than a month should have a blueprint. It's worth having a catch-all blueprint for anything left over. Please add these as a dependency to: https://blueprints.launchpad.net/linaro/+spec/tr-toolchain-neon-performance Zach, could you check: https://blueprints.launchpad.net/linaro-toolchain-misc/+spec/ltrace-support https://blueprints.launchpad.net/linaro/+spec/tr-toolchain-openocd Peter, there's a whole range of QEMU ones that could do with a pass over. For more about the review, see: https://wiki.linaro.org/Releases/1105/PublicPlanReview For a list of all of the toolchain blueprints, see: http://ex.seabright.co.nz/helpers/blueprints#toolchain -- Michael

14 years, 6 months

2
4
0 0

Mixed vector sizes

by Ira Rosen

Hi, I started to look into mixed vector sizes (in the same loop). My main reason for this was to allow widening and narrowing instructions, that have different vector sizes for src and dest, to work properly. My example was widen_mult (int = short * short), I thought its implementation was not optimal. But now that I have a working GCC mainline for ARM, I see that it works just fine. short ub[], uc[]; int c[]; for (i = 0; i < n; i++) c[i] = ub[i] * ua[i]; is compiled as: .L11: add r1, r1, #1 vldmia r4!, {d18-d19} cmp r5, r1 vldmia ip!, {d16-d17} vmull.s16 q10, d18, d16 vstr d20, [r3, #-32] vstr d21, [r3, #-24] vmull.s16 q8, d19, d17 vstr d16, [r3, #-16] vstr d17, [r3, #-8] add r3, r3, #32 bhi .L11 which looks good to me at least from the vmull point of view. Does anyone have an example when mixed vector size instructions are not used properly? Another reason for mixed sizes could be cases where only part of the loop can be vectorized with the wider vectors. I don't know how common this is. Are there any other reasons to implement mixed vector sizes? I understand that this can be a useful feature, I am just not sure it's the most important one. Thanks, Ira

14 years, 6 months

1
0
0 0

Backport criteria

by Michael Hope

I've been going through the ChangeLog for the release and am having trouble justifying some of the changes brought in. In particular: * -fstrict-volatile-bitfields, which is more appropriate for bare metal/kernel code * Cortex-M4 support * C locale support in libstdc++-v3 The march/mcpu clean up is OK but marginal. Our focus is time based performance on the Cortex-A series with an implied applications over kernel/bare metal. This is a very narrow view, but every non-performance line of code we bring in can also bring in a bug. Any thoughts? For those who are looking at using our toolchain, is earlier access to other toolchain improvements interesting? -- Michael

14 years, 6 months

4
3
0 0

Upstream GCC feature freeze

by Andrew Stubbs

Hi all, As you may or may not know, upstream GCC has now entered 'stage 3' of it's development cycle. This will last until spring. This means that they are only accepting bug fixes and documentation improvements. New features and any performance improvements must wait until GCC 4.6 branches, prior to release, and GCC 4.7 development opens. During this process, our usual preferred work flow (upstream first) will not work, so we'll have to do something else. Here's my proposal: * Create a new Launchpad branch for GCC 4.6. * Synchronize this branch with upstream regularly * once per week, perhaps. * Try to get upstream approval for all new patches in the usual way * on the understanding that they won't be applied until stage 1 * bug fixes are unaffected and may commit as usual. * Commit all pending patches to our own 4.6 branch * and backport them to our 4.5, branch, of course. * Usual "no test regressions" policy applies to our own patches * but beware regressions from merges from upstream. * we may want to track the clean 4.6 test results for comparison This is little different to what we do with the 4.5 release branch now. Thoughts? Andrew

14 years, 6 months

8
7
0 0

Linaro GCC 4.5 2010-11 released

by Michael Hope

The Linaro Toolchain Working Group is pleased to announce the latest release of Linaro GCC 4.5. Linaro GCC 4.5 is the fourth release in the 4.5 series. Based off the latest GCC 4.5.1+svn164911, it includes many ARM-focused performance improvements and bug fixes. Interesting changes include: * Various NEON related fixes * Performance improvements * A clean up of some of the testsuite test cases * An updated version of the __sync multicore primitives * Improvements in data packing when optimising for size * C locale support in libstdc++-v3 This release adds the new option -fstrict-volatile-bitfields and enables it by default on ARM. See doc/invoke.texi for more information. The source tarball is available from: https://launchpad.net/gcc-linaro/+milestone/4.5-2010.11-0 Downloads are available from the Linaro GCC page on Launchpad: https://launchpad.net/gcc-linaro Note that there were no changes to the 4.4 series. -- Michael

14 years, 6 months

1
0
0 0

Linaro GDB 7.2 2010.11-0 released

by Michael Hope

The Linaro Toolchain Working Group is pleased to announce the release of Linaro GDB 7.2. Linaro GDB 7.2 2010.11-0 is the second release in the 7.2 series. Based off the latest GDB 7.2, it includes a number of ARM-focused bug fixes and enhancements. This release concentrates on the GDB test suite and tidies up a number of failures. The source tarball is available at: https://launchpad.net/gdb-linaro/+milestone/7.2-2010.11-0 More information on Linaro GDB is available at: https://launchpad.net/gdb-linaro -- Michael

14 years, 6 months

1
0
0 0

Auto-detection of vector size for NEON

by Ira Rosen

Hi, It looks like it's enough to implement targetm.vectorize. autovectorize_vector_sizes for NEON in order to enable initial auto-detection of vector size. With the attached patch and -mvectorize-with-neon-quad flag, the vectorizer first tries to vectorize for 128 bit, and if this fails, it tries to vectorize for 64 bit. For example, in the attached testcase number of iterations is too small for 128 bit (first 2 iterations have to be peeled in order to align the array accesses), but is sufficient for 64 bit (the accesses are aligned here). I'd appreciate your comments on the patch, and I also have a few questions: 1. Why the default vector size is 64? 2. Where is the place of NEON vectorization tests? I found NEON tests with intrinsics at gcc.target/arm, is that the right place? 3. According to gcc.dg/vect/vect.exp the only flag that is used for NEON (in addition to target independent flags) is -ffast-math. Is that enough? Thanks, Ira ChangeLog: * config/arm/arm.c (arm_autovectorize_vector_sizes): New function. (TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES): Define. Index: config/arm/arm.c =================================================================== --- config/arm/arm.c (revision 166032) +++ config/arm/arm.c (working copy) @@ -246,6 +246,7 @@ static bool arm_builtin_support_vector_misalignmen const_tree type, int misalignment, bool is_packed); +static unsigned int arm_autovectorize_vector_sizes (void); /* Table of machine attributes. */ @@ -391,6 +392,9 @@ static const struct default_options arm_option_opt #define TARGET_VECTOR_MODE_SUPPORTED_P arm_vector_mode_supported_p #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE arm_preferred_simd_mode +#undef TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES +#define TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES \ + arm_autovectorize_vector_sizes #undef TARGET_MACHINE_DEPENDENT_REORG #define TARGET_MACHINE_DEPENDENT_REORG arm_reorg @@ -23223,6 +23227,12 @@ arm_expand_sync (enum machine_mode mode, } } +static unsigned int +arm_autovectorize_vector_sizes (void) +{ + return TARGET_NEON_VECTORIZE_QUAD ? 16 | 8 : 0; +} + static bool arm_vector_alignment_reachable (const_tree type, bool is_packed) { test: #define N 5 unsigned int ub[N+2] = {1,1,6,39,12,18,14}; unsigned int uc[N+2] = {2,3,4,11,6,7,1}; void main1 () { int i; unsigned int udiff = 2; unsigned int umax = 10; for (i = 0; i < N; i++) { /* Summation. */ udiff += (ub[i+2] - uc[i]); /* Maximum. */ umax = umax < uc[i+2] ? uc[i+2] : umax; } }

14 years, 6 months

3
5
0 0

Weekly meeting at 0900 UTC today

by Michael Hope

Hi there. Just a reminder that today's call is the first at the new time of 0900 UTC which is 9am in the UK, 10am in Germany, 11am in Israel, and 5pm in China. I've updated the meetings page at: https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings with the new details. -- Michael

14 years, 6 months

1
0
0 0

Help on merging two patches

by Yao Qi

Hi, I am backporint some patches from FSF mainline, which may improve Linaro 4.5 gcc on thumb2 speed. The first one is done by Richard E. "Improve optimization to transform TST into LSLS" http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html After it applied to Linaro 4.5 tree, EEMBC speed number downgrades, while code size is reduced to some extent. The code difference is like this, 6801 ldr r1, [r0, #0] f831 3013 ldrh.w r3, [r1, r3, lsl #1] -f413 6f00 tst.w r3, #2048 ; 0x800 -f43f af41 beq.w cc <t_run_test+0xcc> +0518 lsls r0, r3, #20 +f57f af44 bpl.w cc <t_run_test+0xcc> 4610 mov r0, r2 After reading cortex-a8 TRM, I can't find exact timing cycles of lsls. Under Chung-Lin's help, we feel that lsls should be slower than tst, but don't have any evidence to prove. If any people is familiar with arm microarch, help is welcome. If our assumption is correct, we may can change this patch to an optimization specific to size only. The second patch is Bernd's "Fix an if statement in arm_rtx_costs_1" http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html After this patch applied, EEMBC benchmark number is not changed. Shall we merge this patch to linaro 4.5 tree? I am inclined to merge it, but if you have concerns on this patch, let us discuss here. -- Yao Qi CodeSourcery yao(a)codesourcery.com (650) 331-3385 x739

14 years, 7 months

3
4
0 0

Problems with BeagleBoard installation

by Ira Rosen

Hi, I am trying to install my new BeagleBoard xM. I started with https://wiki.linaro.org/Releases/DailyBuilds, but it failed to write to the SD card giving some errors about QEMU (I can try that again if the error messages are important). Dave Martin also gave me this link https://wiki.linaro.org/Source/ImageInstallation, but it doesn't exist anymore. So, I tried these instructions: https://wiki.linaro.org/Source/VexpressImageInstallation with http://snapshots.linaro.org/10.11-daily/linaro-headless/20101107/0/images/t… http://jameswestby.net:8080/view/Hardware%20Packs/job/linaro-omap3%20hwpack… --dev beagle). The installation of the SD card went ok, but the board doesn't work. During boot I see the printings as in the attached file and it doesn't go any further. I hope someone can help me with that, because otherwise I am at dead end... Thanks, Ira

14 years, 7 months

1
0
0 0

Changing meetings

by Michael Hope

Hi there. I plan to change the Toolchain WG meetings due to daylight savings and to better cover the US. The Monday meeting will be at 0900 UTC which is 9 am in the UK, 10 am in central Europe, and 5 pm in Beijing. The standup calls will be merged into one at 1800 UTC on Wednesday which is 6 pm in the UK, 7 pm in central Europe, and 1 pm on the US East Coast. I don't expect China to call in as it's a quite unreasonable time. I'll update the invites and wiki page to reflect this. We'll start the new times next week, so Monday the 8th will be the first meeting at the new time. -- Michael

14 years, 7 months

3
2
0 0

Links

by Michael Hope

This is more for the people in the sprint... http://ex.seabright.co.nz/helpers/blueprints#toolchain http://ex.seabright.co.nz/helpers/planner -- Michael

14 years, 7 months

1
0
0 0

Report of thumb2 tuning investigation

by Yao Qi

The gaol and plan of investigation has been described in [1] In the plan, this task is divided into three parts, 1) patch backport, 2) regression fix, and 3) exploration and study other ARM compilers. This report follow the same manner. 1. Patch backport. 8 patches are listed in [1]. Backport them to Linaro 4.5 tree will improve speed performance. Action/Recommendation: Backport them if speed improves. These patches are ones that I think they *should* improve speed, but "performance surprise" is not impossible. 2. Regression fix. So far (until r99399), Linaro GCC 4.5 is slower than FSF GCC 4.5.0 on some EEMBC benchmarks. Performance regression is introduced by four commits, r99324,r99330,r99369,r99380, see details in [2]. Action/Recommendation: Figure out why speed regression is introduced, and try to fix it. One cent here is that how to avoid speed regression. I do believe that sometimes regression is unavoidable, but it is better if can track them, and keep them manageable. 3. Exploration and study other ARM compilers. In this part, I don't find any possible thumb-2 specific improvements. However, loop optimization and instruction scheduling should be improved on ARM. (This statement may be true to all ports, or even all compilers) Some tickets are opened for this part, LP:660644 Missed optimization opportunities LP:662692 Inner loop in autcor00 can be optimized better LP:656957 LP:645267 Improve code generation on switch statement LP:663793 Tune Swing Modulo Scheduling or Selective Scheduling for ARM LP:656373 Try -fsched-pressure for ARM I have to admit that instruction scheduling is quite hard, but if we can do something here, that will be great. I've put it in "performance-insdie-gcc" session on UDS. Let us talk about it a little there next week. During this investigation, I also find LTO or "whole-program optimization" is useful to some EEMBC benchmarks. (I didn't run LTO/WPO at all, but I got this when read source of benchmarks) [1] Plan of CS304: Thumb2 tuning investigation. http://lists.linaro.org/pipermail/linaro-toolchain/2010-October/000300.html [2] https://wiki.linaro.org/YaoQi/Sandbox/Thumb2SpeedOptimization -- Yao Qi CodeSourcery yao(a)codesourcery.com (650) 331-3385 x739

14 years, 7 months

1
0
0 0

finding in 4.5.2 20101003 (Linaro) [release 4.5-2010.10-0]

by leonid.moiseichuk＠nokia.com

Hello, Sorry I cannot find a bugzilla. One issue detected during compilation of ffmpeg, it looks like gcc -DHAVE_AV_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -I. -I"/home/ldm/misc/tc/suite/cases/ffmpeg/ffmpeg" -O3 -D_ISOC99_SOURCE -D_POSIX_C_SOURCE=200112 -std=c99 -fomit-frame-pointer -Wdeclaration-after-statement -Wall -Wno-switch -Wdisabled-optimization -Wpointer-arith -Wredundant-decls -Wno-pointer-sign -Wcast-qual -Wwrite-strings -Wtype-limits -Wundef -fno-math-errno -fno-signed-zeros -c -o libavcodec/dsputil.o libavcodec/dsputil.c /tmp/ccK6jdSP.s: Assembler messages: /tmp/ccK6jdSP.s:34764: Error: VFP/Neon double precision register expected -- `vmovl.s16 q3,s12' /tmp/ccK6jdSP.s:34781: Error: VFP/Neon double precision register expected -- `vmovl.s16 q2,s8' /tmp/ccK6jdSP.s:34798: Error: VFP/Neon double precision register expected -- `vmovl.s16 q1,s4' /tmp/ccK6jdSP.s:34800: Error: VFP/Neon double precision register expected -- `vmovl.s16 q0,s0' /tmp/ccK6jdSP.s:34820: Error: VFP/Neon double precision register expected -- `vmovl.s16 q8,s8' The sources, assembler and test case are attached. With best wishes, Leonid

14 years, 7 months

3
4
0 0

2010-10-18 Toolchain WG minutes

by Michael Hope

Minutes from the 2010-10-18 call are here: https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2010-10-18 I'm afraid I lost the recording of this call so the minutes are incomplete Minutes from the 2010-10-20 standup call are here: https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2010-10-18 The calls focused on the upcoming summit. -- Michael

14 years, 7 months

1
0
0 0

Linaro@UDS sessions

by Michael Hope

Hi there. I've updated the list of potential Summit sessions based on yesterdays call. Could people please check the Sessions table on https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2010-10-18 and flesh out the agenda for sessions that have your name against them. The agenda should be five to ten discussion points, preferably of things that are not well understood and could use input from the group. There's a good discussion on what to expect at a Summit here: http://oubiwann.blogspot.com/2010/10/q-the-ubuntu-developer-summits.html You can check the already-approved sessions here: http://summit.ubuntu.com/uds-n/ Feel free to join in to any other sessions you might find interesting. There will be quite a few people with diverse backgrounds there, including ~80 people from Linaro, ~400 from Ubuntu, ~200 from the community, and ~200 remote. The overlap between Toolchain and Ubuntu interests might not be great, so I'll make sure a common work room is available for idle time. -- Michael

14 years, 7 months

1
0
0 0

Flavoured toolchains anyone?

by Marcin Juszkiewicz

Some of Linaro developers works with ARM devices older then ARMv7-a architecture. Other people experiments with hard-float ABI. Each of them has to rebuild toolchain for own use and that means playing with components to have them build properly. But it is no more - I made some patches and armel-cross-toolchain-base since 1.53 version + newer source packages for gcc-4.[45]-armel-cross have support for "debian/flavour" file which allows to set some flags related to toolchain build. So far supported things are: - ARM architecture - float ABI - FPU mode - Thumb mode This feature is not merged into regular Ubuntu packages yet as this is work in progress which needs to be cleaned first. http://people.linaro.org/~hrw/armel-cross-toolchain/ has all source packages needed. Regards, -- JID: hrw(a)jabber.org Website: http://marcin.juszkiewicz.com.pl/ LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz

14 years, 7 months

2
1
0 0

Flavoured toolchains anyone?

by Marcin Juszkiewicz

Some of Linaro developers works with ARM devices older then ARMv7-a architecture. Other people experiments with hard-float ABI. Each of them has to rebuild toolchain for own use and that means playing with components to have them build properly. But it is no more - I made some patches and armel-cross-toolchain-base since 1.53 version + newer source packages for gcc-4.[45]-armel-cross have support for "debian/flavour" file which allows to set some flags related to toolchain build. So far supported things are: - ARM architecture - float ABI - FPU mode - Thumb mode This feature is not merged into regular Ubuntu packages yet as this is work in progress which needs to be cleaned first. http://people.linaro.org/~hrw/armel-cross-toolchain/ has all source packages needed. Regards, -- JID: hrw(a)jabber.org Website: http://marcin.juszkiewicz.com.pl/ LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz

14 years, 7 months

1
0
0 0

NEON vectorization: use of specialized load/store instructions

by Julian Brown

I meant to send this to the "external" Linaro toolchain mailing list, not the internal CS one. Apologies to those who receive it twice! In a follow-up message, Joseph Myers pointed out a post he'd written previously on the same subject: http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html In further followups (at the risk of misrepresenting Joseph & Paul Brook's opinions!), there seemed to be general agreement that a scheme something like that outlined below, with "permuting" loads/stores and some way of handling multiple in-register layouts for vectors seems like it will be a necessary addition to the vectorizer, going forward. Julian Begin forwarded message: Date: Thu, 7 Oct 2010 16:45:17 +0100 From: Julian Brown <julian(a)codesourcery.com> To: Ira Rosen <IRAR(a)il.ibm.com> Cc: Tejas Belagod <Tejas.Belagod(a)arm.com>, Linaro List <gnu-linaro-tools(a)codesourcery.com> Subject: [gnu-linaro-tools] NEON vectorization: use of specialized load/store instructions Hi, We're having some system issues, so I thought I'd take the chance to write down some things I've been thinking about re: utilising the NEON load/store instructions more effectively. I've also attempted to summarize the problems with big-endian mode. All unverified as of yet, so please take with a pinch of salt :-). Comments appreciated. It's been a while since I last thought about some of this stuff... Cheers, Julian Use of specialized load instructions ==================================== To provide good support for NEON's element and structure load/store instructions, GCC lacks support for a couple of key features: 1. A good way of representing a set of two, three or four vector registers (either D- or Q-sized), possibly with non-unit stride. 2. A generalised mapping between memory locations and lane numbers. To start with point 1: currently the element and structure load/store instructions are only supported via intrinsics. These are specified to load and store as if going via an array embedded in a union, i.e.: typedef struct int8x8x2_t { int8x8_t val[2]; } int8x8x2_t; __extension__ static __inline int8x8x2_t __attribute__ ((__always_inline__)) vld2_s8 (const int8_t * __a) { union { int8x8x2_t __i; __builtin_neon_ti __o; } __rv; __rv.__o = __builtin_neon_vld2v8qi ((const __builtin_neon_qi *) __a); return __rv.__i; } Even for a trivial test program, e.g.: #include <arm_neon.h> int foo (int8_t *x) { int8x8x2_t result = vld2_s8 (x); return vget_lane_s8 (result.val[0], 1); } We will generate code like so: sub sp, sp, #32 vld2.8 {d16-d17}, [r0] mov r3, sp vstmia sp, {d16-d17} add ip, sp, #16 ldmia r3, {r0, r1, r2, r3} stmia ip, {r0, r1, r2, r3} fldd d16, [sp, #16] vmov.s8 r0, d16[1] add sp, sp, #32 bx lr I.e., rather than being used directly, the registers loaded by vld2 will always be spilled to the stack then reloaded. This obviously reduces the usefulness of these intrinsics by a large factor. With some planning, it'd be good to find a powerful enough solution to this problem so that the same representation for multiple registers can be used by the autovectorizer as well as the intrinsic-handling code. (One difficulty is that the "foo.val[X]" interface should still be available to user code. There's probably no need for "val" to literally be an array, though other representations would require front-end changes). Assuming it's hard for the register allocator to deal with highly-constrained situations like requiring four consecutive registers, one (ugly) possibility might be to run a pass before register allocation, looking for "big" multi-register vectors and pre-allocating them to hard registers. Even using a fixed allocation of a single set of registers (e.g. make it so that all multi-reg loads/stores larger than a Q register must use d0-d7, or whatever) would probably give better code than what we produce at present, in most cases. Now, point 2. To start with, an aside: AIUI, there is currently an assumption in the vectoriser code that increasing element numbers in vector registers correspond to increasing addresses when those registers are loaded from and stored to memory (as if the vector was a short array, or alternatively as if a union of the vector register and an array of element-types had the same numberings for lanes and array indices corresponding to the same elements). Unfortunately that is only true for NEON in little-endian mode: in big-endian mode, the story is more complicated, for reasons I will try to explain. To remain compliant with the soft-float variant of the ARM EABI, we must pass vector register arguments in ARM registers (or the stack), not vector registers. This means that we must be very careful with the ordering of elements for values passed to functions. Consider the trivial function: int __attribute__((noinline)) qux (int16x8_t x) { x = vaddq_s16 (x, x); return vgetq_lane_s16 (x, 1); } This is compiled by GCC to the following (slightly unimpressively): vmov d18, r1, r0 @ v8hi vmov d19, r3, r2 vmov d20, r1, r0 @ v8hi vmov d21, r3, r2 vadd.i16 q8, q9, q10 vmov.s16 r0, d16[1] bx lr Which may then be called like, e.g.: ldmia sp, {r0-r3} blx qux So: notice that we're careful that when vector values are transferred from NEON registers to core registers, the same result will be transferred to/from memory when we use ldm/stm (core registers) or vldm/vstm (vector registers) -- i.e. we might use "vldm rX, {d18-d19}", storing d18 and d19 in consecutive increasing addresses, or "ldmia rX, {r0-r3}", again with consecutive registers in increasing memory locations, and we get the same outcome. The fact that we can use the multiple-register loads/stores is also important for spilling/reloading between vector and core registers, which inevitably happens occasionally. Notice also that when we call the above function like so: typedef union { int16x8_t quadvec; int16_t half[8]; } u; int foo (int8_t *x) { u bar; int i; for (i = 0; i < 8; i++) bar.half[i] = i; qux (bar.quadvec); } The value returned from "qux" is NOT 2 (1+1), as it would be if we were accessing the value at index 1 in the superimposed array in the union "u". The vgetq_lane_s16 call still interprets the array as if it had been loaded in little-endian element order. But we don't get the result we would have if the vector had been interpreted in purely big-endian order either (i.e. 12, 6+6)! In fact from the perspective of the element numbering used by vgetq_lane_s16, the vector elements we see for each of the (equal) operands of the "vadd" instruction in the qux function are: equiv. core register lane number (at function entry) value ----------- -------------------- ----- [0] high part of r1 3 [1] low part of r1 2 [2] high part of r0 1 [3] low part of r0 0 [4] high part of r3 7 [5] low part of r3 6 [6] high part of r2 5 [7] low part of r2 4 So the value returned will be 2+2, 4. Now, coming back to the vectorizer. Current practice means that increasing element numbers should correspond to increasing memory locations: i.e., that "array ordering" is in effect, just as in the call to vgetq_lane_s16 in the above example. This leads to an anomaly: it means that when the vectorizer asks for a particular element, it will generally get a different one. Most of the time we get away with this, since the vectorizer mostly deals with "opaque" vectors which are operated on element-wise: i.e. we only deal with data at the granularity of whole vectors, so it doesn't matter which order the elements are in. The ARM implementations of reduction operations fortuitously calculate the results across all elements simultaneously, so when one of those elements is extracted, we still get the right answer. One notable exception to this though is the movmisalign<mode> patterns: these are implemented using the vld1 and vst1 instructions, which load elements in "array" order (increasing elements from increasing memory locations), even in big-endian mode. Since vectors loaded using those instructions are "incompatible" with the above scheme, such misaligned accesses are simply disabled in big-endian mode. Of course, generally, sticking with the current non-solution in big-endian mode is not sustainable (and is probably already broken in various cases). So it might be worth thinking about whether supporting big-endian mode properly, as well as handling the more complex load and store element/structure instructions, can be done using some generalised solution. I'm thinking (without having much idea about how feasible such an idea is) of something along the lines of a function (in the mathematical sense) attached to each vector value manipulated by the vectorizer, to map that value's element numberings to and from memory offsets. So then the quad-word vector of 16-bit elements discussed above would look like, in big-endian mode: foo, {6, 4, 2, 0, 14, 12, 10, 8} Whereas in little-endian mode (or in big-endian mode, for vectors loaded using vld1), it would look like: foo, {0, 2, 4, 6, 8, 10, 12, 14} And then, perhaps more interestingly, a vector loaded using e.g. a "multiple 3-element structures" load, vld3.16 {d1, d2, d3}, [rN] Might look like (in either endianness, assuming we can represent a vector of such size in our hypothetical scheme): foo, {0, 6, 12, 18, 2, 8, 14, 20, 4, 10, 16, 22} Though it's not clear that such a scheme would be powerful enough to represent the whole range of element/structure loads/stores available (you'd probably need to be able to specify skipped or don't-care elements to do that, at least).

14 years, 7 months

3
4
0 0

Plan of CS304: Thumb2 tuning investigation

by Yao Qi

First of all, the goal of this work is about investigation on speed improvement on linaro gcc 4.5. Finally, the output/result of this work is to list all possible recommendations/actions to improve speed on linaro 4.5. Comments to this plan are welcome. So far, we can improve speed in three ways, 1. Backport patches from FSF GCC 4.6. Note that we don't want to backport the whole 4.6. 2. Benchmark with FSF GCC 4.5.0. Fix performance regressions if there are on linaro gcc 4.5. Output is the reason of performance regression, or even further, give recommendations on how to fix it. 3. Study the code generated by other ARM compilers, and give recommendations on how to improve GCC to do better job. I'll describe these three ways in details in the following sections, - Backport patches from FSF GCC 4.6 I went through gcc-patches archive, and select several patches that are helpful to code improvements. 1 ifcvt optimization. Target independent. http://gcc.gnu.org/ml/gcc-patches/2010-04/msg00832.html 2 redundant register move for sign extending. Thumb2. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43137 3. PR 45335 Use ldrd and strd to access two consecutive words. Not yet approved. http://gcc.gnu.org/ml/gcc-patches/2010-09/msg00059.html 4. Fix an if statement in arm_rtx_costs_1. http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html 5. Reduce code duplication for Thumb2 move patterns http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00624.html 6. ARM ldm/stm peepholes http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00512.html 7. PR44999 Replace "and r0, r0, #255" with uxtb in thumb2 http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01700.html 8. Improve optimization to transform TST into LSLS http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html 9. Fix bswap patterns for ARM / Thumb and Thumb2. http://gcc.gnu.org/ml/gcc-patches/2010-01/msg01238.html - Fix speed regression I found speed regression on EEMBC on linaro 4.5, compared with FSF GCC 4.5.0, and I'll investigate why speed regression happens on these cases. Here is a table below about speed regression compared between FSF GCC 4.5.0 and Linaro GCC 4.5 (revno:99398) O2 O3 puwmod01, -5.5 -3.5 bitmnp01, -7.9 -0.7 routelookup, -6.4 -8.2 conven00data_1, -7.2 -5.8 conven00data_2, -8.1 -7.3 conven00data_3, -6.6 -5.5 viterb00data_1, -1.7 +5.9 viterb00data_2, -4.3 +2.6 viterb00data_3, -2.3 +1.8 viterb00data_4, -5.3 -0.3 - Study the code generated by other ARM compilers. In this part, I'll study the binary generated by other ARM compilers, and try to teach GCC smart enough to do the same thing. This piece of work is quite open, and hard to estimate how much output we could get. -- Yao Qi CodeSourcery yao(a)codesourcery.com (650) 331-3385 x739

14 years, 7 months

4
6
0 0

Gcc test suite failures

by Nicolas Pitre

People here might want to have a look at this bug: http://gcc.gnu.org/bugzilla/show_bg.cgi?id=45979 Note that the heap randomization feature added to the kernel was part of a Linaro security blueprint. Nicolas

14 years, 7 months

3
2
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain