Hi there. Some of the tr-* blueprints had work items in them and this
was interfering with the tools that the PM guys use. I've created new
engineering blueprints, pulled the work items across into them, and
added the new engineering blueprint as a dependency of the old TR.
Sorry for the blueprint spam. In most cases the new blueprint has the
same name and subject as the TR one, such as the TR:
https://blueprints.launchpad.net/linaro/+spec/tr-toolchain-4.5-in-distros
which is backed by the engineering blueprint:
https://blueprints.launchpad.net/gcc-linaro/+spec/4.5-in-distros
-- Michael
Hi Richard,
Recapping on this earlier conversation:
http://lists.linaro.org/pipermail/linaro-toolchain/2010-July/000030.htmlhttp://lists.linaro.org/pipermail/linaro-toolchain/2010-July/000035.html
Is it worth another attempt to make a case to upstream for supporting
passing -mimplicit-it=thumb by default to gas?
According to my understanding of this issue, my argument would go as follows:
* gcc currently estimates the size of asm blocks, rather than
determining the size accurately.
* gcc cannot guarantee the right answer for asm block size when asm
blocks contain directives etc., however use of directives in asm
blocks is widespread
* gcc cannot guarantee the right answer for asm block size in
Thumb-2. gcc conservatively overestimates the size by assuming that
each statement of the asm block expands to 4 bytes.
* All of Ubuntu lucid and maverick has been built with
-mimplicit-it=thumb passwd by default, with no known build or runtime
failures arising from this (so size issues aside, we have confidence
that the resulting code generation is sound)
* -mimplicit-it=thumb -mthumb makes the asm block size estimation
unsafe: the asm block can exceed the estimated size even in the
absence of directives, which may lead to fixup range errors during
assembly.
* Following the principles already established for Thumb-2 in
general the estimation can be made safe (or, as safe as the
established Thumb-2 behaviour) by raising the assumed maximum
statement expansion size for asm blocks to 6 bytes, since
-mimplicit-it will add as most a single (16-bit) IT instruction to
each statement.
* The vast majority of all asm blocks are small (< 20 instructions,
say), so the overall overestimate in sizes will generally be modest
for any given compilation unit.
* -mimplicit-it is already _required_ by the Linux kernel and
possible other projects.
...so...
* With -mimplicit-it=thumb and a 6-byte asm block statement
expansion size estimate, we have toolchain behaviour which is as
reliable, and as correct, as it is in upstream at present.
* Layout of data in the compiler output will be more optimal in some
cases, and less optimal in other cases, compared with the the current
Thumb-2 behaviour, due to differing asm block size estimates. The
exact behaviour will depend on the distribution of conditional
instructions within asm blocks.
* Taken over a whole compilation unit, the total code size
overestimate (and therefore the impact on object layout) will normally
be modest, due to the small typical size of asm blocks.
* Behaviour for -marm will not be impacted at all.
If gcc currently estimated asm block code size accurately, then I
could understand upstream's objection; but as it stands it seems to me
we wouldn't be making anything worse in practice with the proposed
change; and there is no compatibility impact (other than positive
impact).
Of course, I may have some wrong assumptions here, or there may be
some background I'm not aware of...
Comments?
Cheers
---Dave
== Linaro GCC ==
* Finish testing for big-endian/quad-word patch on mainline, and
send upstream. Not yet reviewed by an ARM maintainer, but Joseph
suggested tweaking DejaGnu's target-supports to better reflect the new
capabilities of the vectorizer in big-endian mode. I've not looked into
that yet.
* Started looking at improving element/structure load/store intrinsics.
Made it so that the structs used for loads/stores are created in the
backend so that the types can be used directly by the builtins, but
discovered that the front-end/middle-end would not play along with that
plan as they are. Thought about ways to fix that.
* Some time spent on other CodeSourcery stuff.
The Linaro Toolchain Working Group is pleased to announce the release
of both Linaro GCC 4.4 and Linaro GCC 4.5.
Linaro GCC 4.5 is the fifth release in the 4.5 series. Based off the
latest GCC 4.5.1+svn167157, it includes many ARM-focused performance
improvements and bug fixes.
Linaro GCC 4.4 is the fifth release in the 4.4 series. Based off the
latest GCC 4.4.5, it is a maintenance release that fixes one problem
found through use.
Interesting changes include:
* A new performance focused extension elimination pass
* Speed and size improvements when loading constants
* Performance improvements on compound conditionals
* A range of correctness improvements
The source tarballs are available from:
https://launchpad.net/gcc-linaro/+milestone/4.5-2010.12-0
and
https://launchpad.net/gcc-linaro/+milestone/4.4-2010.12-0
Downloads are available from the Linaro GCC page on Launchpad:
https://launchpad.net/gcc-linaro
No changes have been committed to Linaro GDB 7.2 this month.
-- Michael
* Linaro GCC 4.4/4.5
Merged the latest CS patches and Linaro merge requests into Linaro GCC
(4.4 and 4.5). Ran regression tests. Yao's patch failed so I backed it
out, and made the release tarballs. Uploaded the releases to Michael
Hope for release.
lp:686381: luatex fails to build with gcc-4.5
Fired off a test build to reproduce the problem. Will come back to this
next week.
* GCC 4.6/4.7
Posted my various queued patches up to gcc-patches(a)gcc.gnu.org for review.
Looked at the state of the GCC 4.6 upstream build. There are currently
two problems:
1. libquadmath must be disabled in a cross-build for the bootstrap phases.
2. libstdc++ doesn't build. There is a patch for it on the mail list,
but it's not applied yet.
Once gcc 4.6 builds cleanly, I shall update the Launchpad 4.6 branch,
and declare that the baseline for our development. We'll then have
somewhere to commit and track patches awaiting GCC stage 1 development.
* Other
Caught up with email following my holiday.
Yet again, my IGEPv2 board suffered a corrupt file system. I've now
upgraded the kernel and configured it to use an NFS root. The board is
now somewhat less mobile, but should work more reliably.
Continued organizing the a brain-storming session for GCC optimization
improvements.
Organised flights and hotel for both the Linaro Sprint and CodeSourcery
annual meeting in January.
== Last Week ==
* Spent time tracking down a couple of regressions that appeared in the
new ltrace release-candidate tree. Submitted a bunch of patches to fix
the issues that were discovered during that process; most have been applied.
* Finished writing fairly generic code for handling ARM-specific unwind
tables, from lookup through decoding and dispatch. It uses a few
definitions specific to libunwind, but those probably could be
eliminated with more work.
== This Week ==
* Integrate new ARM-specific bits into libunwind framework.
* Rewrite part of a portability patch for ltrace and hope that those
changes reflect the very last effort that will be required for that
particular task.
--
Zach Welch
CodeSourcery
zwelch(a)codesourcery.com
(650) 331-3385 x743
On Fri, Dec 3, 2010 at 9:06 AM, Yao Qi <yao.qi(a)linaro.org> wrote:
> Hi, Kernel WG,
> Can recent kernel handle NEON registers in corefiles?
>
> Seems we've had plan for this in "Ensure full NEON debug support" in
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Specs/BSPInvestig…
> Any progress on this piece of work? We want to handle NEON registers in
> corefiles from GDB, which required kernel dump them in corefile first.
Hmmm, actually that bullet may have ended up in the wrong place ...
since it's not a BSP-specific feature.
Anyway, looking at the kernel code, it looks like the VFP/NEON state
is not dumped into the core file. If it makes you feel better, the
state of the obsolete FPE extension registers is dumped, if used :/
My guess is that it shouldn't be hard to dump the VFP/NEON state, but
GDB and the kernel need to agree on the format.
Rather that trying to hack the existing register dump format in a
compatible way, I suggest it's simplest if the kernel creates an extra
section in the dump containing something like:
.long format_version /* reserved for future expansion - must be 0 */
.long FPSID
.long FPSCR
.long MVFR0 /* or 0 if not present in the hardware */
.long MVFR1 /* or 0 if not present in the hardware */
.long d0
.long d1
/* ... d2-d14 ... */
.long d15
If 32 D-registers in the hardware [
.long d16
.long d17
/* ... d18-d30 ... */
.long d31
]
I believe we don't need any extra flags to indicate whether the MVFRx
fields are valid, since 0 in these registers indicates the
VFPv2/legacy behaviour anyway. Note that some VFPv2 implementations
(such as ARM1176) do provide these registers, and where the hardware
has them, the kernel can fill them in when doing the coredump.
We _should_ be prepared to ignore these fields (or interpret them
differently) if a vendor-specific VFP subarchitecture is specified (by
(FPSID & 0x4000) == 0x4000)
The number of D-registers can be deduced from the FPSID and MVFRx
registers, so we don't need to record it explicitly.
When MVFRx are not present, there are 16 D-registers.
When MVFRx are present, and (MVFR0 & 0xF) >= 2, there are 32 (or more)
D-registers
This is just a sketch -- the ARM ARM is the authoritative reference on
the meanings of these bitfields.
Any views on this?
Cheers
---Dave
>
> _______________________________________________
> linaro-dev mailing list
> linaro-dev(a)lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-dev
>
== GCC ==
* Tracked down root cause of GCC mainline bootstrap failure
on ARM (PR 46040 - "__DTOR_LIST__ undeclared").
== Miscellaneous ==
* Set up IGEP v2 board (w/ local disk, network, ...) as
native GCC / GDB build environment.
* Gave talk on Linaro at the "2010 Linux Community Event" at
Siemens Munich (w/ Arnd Bergmann).
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
You have been invited to the following event.
Title: TWG GCC Optimization
Discuss ideas for improving GCC optimization for ARM. Open to anybody who
wants to contribute.
https://wiki.linaro.org/AndrewStubbs/Sandbox/GCCoptimizations
International: +44 1452 567 588
UK: 0844 493 3801
Brazil: 08008912092
China: 108007121533
India: 0008001006354
Taiwan: 00801126472
United States: 18666161738
Conference code 2634417169#
When: Wed 2010-12-15 9am – 10am London
Where: Conference call code 263 441 7169
Calendar: linaro-toolchain(a)lists.linaro.org
Who:
* andrew.stubbs(a)linaro.org - creator
* Ken Werner - optional
* stevenb.gcc(a)gmail.com - optional
* Michael Hope - optional
* David Gilbert - optional
* Peter Maydell - optional
* ulrich.weigand(a)de.ibm.com - optional
* Yao Qi - optional
* Julian Brown - optional
* paul(a)codesourcery.com - optional
* Ira Rosen - optional
* richard.earnshaw(a)arm.com - optional
* Chung-Lin Tang - optional
* mark(a)codesourcery.com - optional
* linaro-toolchain(a)lists.linaro.org - optional
* Richard Sandiford - optional
* Marcin Juszkiewicz - optional
* zwelch(a)codesourcery.com - optional
Your attendance is optional.
Event details:
https://www.google.com/calendar/event?action=VIEW&eid=aXJlc2Z2bWZmZzQ5cGQ4b…
Invitation from Google Calendar: https://www.google.com/calendar/
You are receiving this courtesy email at the account
linaro-toolchain(a)lists.linaro.org because you are an attendee of this event.
To stop receiving future notifications for this event, decline this event.
Alternatively you can sign up for a Google account at
https://www.google.com/calendar/ and control your notification settings for
your entire calendar.
== GCC related ==
* PR44557, Thumb-1 ICE: found two small needed corrections in the ARM
backend to fix this. Sent patch upstream.
* LP:641397/PR46888: bitfield insert optimization. Posted patch along
with Andrew Stubbs' CSE patch upstream. The two-patch situation seemed
to stir some discussion :) It seems both patches were deemed okay,
though it would be better if there was some testcase that passed through
unhandled by Andrew's CSE patch, but processed by my combine fix later
in the pass pipeline. Both patches queued at GCC bugzilla page, to be
handled in the next stage1.
* LP:687406/PR46865, -save-temps creating different code. Analyzed
problem, though slow to send fix upstream. Had some discussion on the
list on how the fix should be like.
* PR46667, section type conflicts. Tested and sent mail to gcc-patches.
Jan Hubicka picked it up and pinged for an approval again. Hope this get
resolved soon.
* PR45416, ARM code regression. ARM considerations are mostly okay, main
issue remaining is how to solve the x86 regression. The current expand
code does a full DImode shift just to obtain a single bit, might be
point of improvement to solve this.
* LP:685534, ftbfs with gcc-linaro 4.5 on amd64. Found to be another
erroneous inline asm case. Fixed and updated on LP.
== This week ==
* More upstream and Linaro GCC issues.
* Start dealing with January travel.
* Think more about larger (Linaro) GCC optimization projects.
== Linaro GDB ==
* LP:616000 Handle -fstack-protector prologue code
Understand how frame affect expression validation in GDB. Improve i386
prologue parsing to handle 'and/add' sequence. Revise i386 prologue
parsing for stack protector. Patch is not submitted since still lack of
i386 prologue knowledge, and not very confident on that patch.
Understand prologue-value used in ARM prologue parsing and relationship
of symbol and frame in GDB. Make ARM prologue parsing understands stack
protector code by identify the code sequence. Improve the patch by
supporting ARM mode and ARMv5T, in order to make this patch accepted by
upstreams.
* LP:615972 gdb.base/gcore.exp failure.
The failure is about "corefile restored general registers", which is not
related to NEON register support in corefile. It is caused by
inconsistent register types and names between tdesc and arm. The cause
of this failure is found, but upstreams reviewer doesn't agree on one of
my fix. As he suggested, arm-core.xml is modified to add "type=XXX",
but get some errors when regenerate arm-with-iwmmxt.c. Filed GDB PR 12308.
== GCC ==
* register rename improvements (LP:633243)
Finally, got both middle-end part and ARM part approved. Committed to
upstreams mainline. Some benchmarks in EEMBC shows 0.1%~0.2% code size
reduction.
== This Week ==
* Ping my GDB patches.
* Fix GDB PR 12308, which blocks my fix for LP:615972.
* Backport my approved patches to Linaro GDB if any. Fix other GDB bugs.
* Pass one GCC patch in my queue to review, if I have extra time.
== Vacation ==
Take vacation on Dec 30th and 31st. Travel to ChengDu, and back on 3rd
Jan (It is public holiday in China). Back to work on 4th Jan.
--
Yao (齐尧)
== This week ==
* Away Monday and Tuesday.
* Very little the rest of the week due to other IBM commitments.
I've just finished the main part of that work, so all being well,
it should only need a bit of nannying next week. Most of the week
should be Linaro.
* Started trying to reproduce #641126, but realised that I'd need
to set myself up for general Ubuntu cross package building first.
Started to look at what's involved.
== Next week ==
* Get stuck into #641126.
* More STT_GNU_IFUNC and vectors.
Richard
RAG:
Red:
Amber:
Green:
Milestones:
| Planned | Estimate | Actual |
finish virtio-system | 2010-08-27 | postponed | |
get valgrind into linaro PPA | 2010-09-15 | 2010-09-28 | 2010-09-28 |
complete a qemu-maemo update | 2010-09-24 | 2010-09-22 | 2010-09-22 |
finish testing PCI patches | 2010-10-01 | 2010-10-22 | 2010-10-18 |
Progress:
* merge-correctness-fixes:
** I have sent out an updated ARM fixes pull request, all
of whose components have been Reviewed-by: Nathan Froyd;
I expect this to be merged shortly.
** vqshl(reg) patch posted to list; I have an update which
also addresses vqshl{,u}(imm) which I'll send out as v2
once the first part has been reviewed.
** reviewed and retransmitted Wolfgang's semihosting commandline
patches since he is having trouble sending unmangled mail to
the list
** went through the monster qemu-maemo commit "Lots of ARM TCG changes"
http://meego.gitorious.org/qemu-maemo/qemu/commit/3f17d4e1cb
identifying what fixes it includes
** started looking at a VRSQRTS patch. This uncovered a number
of qemu issues: NaN propagation is wrong, flush-to-zero handling
is only flushing output denormals, not input denormals, and we
don't handle the Neon "standard FPSCR value" but always use the
real FPSCR. I have some preliminary patches for at least some
of this, but since they affect a number of the same bits of
code that are touched by existing not-yet-committed patches I'm
waiting for those to be committed first.
* verify-emulation:
** wrote a README for risu and made it public at
http://git.linaro.org/gitweb?p=people/pmaydell/risu.git;a=summary
* maintain-beagle-models:
** the ubuntu maverick netbook image doesn't boot on qemu because
it uses the OMAP NAND prefetch/DMA, which isn't modelled
https://bugs.launchpad.net/qemu-maemo/+bug/645311
I've started on this and am perhaps halfway through (basic
prefetch code implemented, but DMA and debugging still to go)
* other:
** took part in an OBS mini-sprint where we were walked through
how the OBS buildsystem works and can be used to do test rebuilds
of Meego with new versions of the toolchain.
Meetings: toolchain, pdsw-tools
Plans:
* finish omap NAND prefetch engine work
* make sure ARM changes get committed to qemu...
Absences: (complete to end of 2010)
Fri 17 Dec - Tue 4 Jan inclusive.
2011: Dallas Linaro sprint 9-15 Jan. Holiday 22 Apr - 2 May.
Hi,
* created custom kernel deb packages from the linaro-linux tree in order to
* test the various ftrace tracers and profilers available on ARM
* results at: https://wiki.linaro.org/KenWerner/Sandbox/ftrace
* started to look into crash (kexec, kdump) but wasn't able to generate a
kernel dump yet
Regards
Ken
Dear All
Our team in Samsung collected some performance metrics for the following 3 GCC cross compilers
1.. Gentoo Complier(part of Chrome OS Build Environment)
2.. GCC 4.4.1 (Code Sourcery).
3.. Linaro (gcc-linaro-4.5-2010.11-1)
Flags used to Build Linaro Tool chain
used Michael Hope Script .Just modified "GCCFLAGS = --with-mode=thumb --with-arch=armv7-a --with-float=softfp --with-fpu=neon --with-fpu=vfpv3-d16"
a.. Using the above three tool chains we compiled the kernel of Chrome OS and did Coremark Performance test.(With same optimisation flag mentioned in the attachment)
b.. Test Environment for all the three are the same.
My Questions
1.. Is there any build options that I am missing while I am building the Cross Compiler?
2.. Else is this performance degradation is a know issue and is the tool chain group working on it?.(If so whom to contact?)
Any Pointers from you would be of great help to me.
If you need any further details also do ping me
Regards
Prashanth S
Hi All,
As we discussed on Monday, I think it might be helpful to get a number
of knowledgeable people together on a call to discuss GCC optimization
opportunities.
So, I'd like to get some idea of who would like to attend, and we'll try
to find a slot we can all make. I'm on vacation next week, so I expect
it'll be in two or three week's time.
Before we get there, I'd like to have a list of ideas to discuss. Partly
so that we don't forget anything, and partly so that people can have a
think about them before the day.
I'm really looking for bigger picture stuff, rather than individual poor
code generation bugs.
So here's a few to kick off:
* Costs tuning.
- GCC 4.6 has a new costs model, but are we making full use of it?
- What about optimizing for size?
- Do the optimizers take any notice? [1]
* Instruction set coverage.
- Are there any ARM/Thumb2 instructions that we are not taking
advantage of? [2]
- Do we test that we use the instructions we do have? [3]
* Constant pools
- it might be a very handy space optimization to have small
functions share one constant pool, but the way the passes work one
function at a time makes this hard. (LP:625233)
* NEON
- There's already a lot of work going on here, and I don't want it
to hog all our time, but it might be worth touching on.
What else? I'm not the most experienced person with GCC internals, and
I'm relatively new to the ARM specific parts of those, so somebody else
must be able to come up with something far more exciting!
So, please, get brain-storming!
Andrew
[1] We discovered recently that combine is happy to take two insns and
combine them into a pattern that matches a splitter that then explodes
into three insns (partly due to being no longer able to generate
pseudo-registers).
[2] For example, I just wrote a patch to add addw and subw support (not
yet submitted).
[3] LP:643479 is an example of a case where we don't.
Mostly more working with libffi; swapping some ideas back and forwards with
Marcus Shawcroft and it looks like we have
a good way forward.
Got an armhf chroot going, libffi built.
Got a testcase failing as expected.
Trying to look at other processors ABIs to understand why varargs works for
anyone else.
Cut through one layer of red tape; can now do the next level of comparison
in the string routine work.
Started looking at SPEC; hit problems with network stability on VExpress
(turns out to be bug 673820)
long long weekend; short weeks=2;
Back in on Tuesday.
Dave
Hi,
Those of you use silverbell may be glad to know it's back up.
Be a little careful, if you shovel large amounts of stuff over it's network
the network tends to disappear.
(Not sure if this is hardware or driver)
Dave
Hi,
As mentioned on the standup, I just got an armhf chroot going, thanks to
markos for pointing me at using multistrap
I put the following in a armhfmultistrap.conf and did
multistrap -f armhfmultistrap.conf
Once that's done, chroot in and then do
dpkg --configure -a
it's pretty sparse in there, but it's enough to get going.
Dave
==============================================
[General]
arch=armhf
directory=/discs/more/armhf
cleanup=true
noauth=true
unpack=true
explicitsuite=false
aptsources=unstable unreleased
bootstrap=unstable unreleased
[unstable]
packages=
source=http://ftp.de.debian.org/debian-ports/
keyring=debian-archive-keyring
suite=unstable
omitdebsrc=true
[unreleased]
packages=
source=http://ftp.de.debian.org/debian-ports/
keyring=debian-archive-keyring
suite=unreleased
omitdebsrc=true
Hi. As part of my work on qemu I've written a simplistic random instruction
sequence generator and test harness. To quote the README:
risu is a tool intended to assist in testing the implementation of
models of the ARM architecture such as qemu and valgrind. In particular
it restricts itself to considering the parts of the architecture
visible from Linux userspace, so it can be used to test programs
which only implement userspace, like valgrind and qemu's linux-user
mode.
I don't particularly expect this tool to be of much general interest outside
people developing either qemu or valgrind or similar models, but I have
in any case made it publicly available now:
http://git.linaro.org/gitweb?p=people/pmaydell/risu.git;a=tree
-- PMM
Hi all,
I'd be interested in people's views on the following idea-- feel free
to ignore if it doesn't interest you.
For power-management purposes, it's useful to be able to turn off
functional blocks on the SoC.
For on-SoC peripherals, this can be managed through the driver
framework in the kernel, but for functional blocks of the CPU itself
which are used by instruction set extensions, such as NEON or other
media accelerators, it would be interesting if processes could adapt
to these units appearing and disappearing at runtime. This would mean
that user processes would need to select dynamically between different
implementations of accelerated functionality at runtime.
This allows for more active power management of such functional
blocks: if the CPU is not fully loaded, you can turn them off -- the
kernel can spot when there is significant idle time and do this. If
the CPU becomes fully loaded, applications which have soft-realtime
constraints can notice this and switch to their accelerated code
(which will cause the kernel to switch the functional unit(s) on).
Or, the kernel can react to increasing CPU load by speculatively turn
it on instead. This is analogous to the behaviour of other power
governors in the system. Non-aware applications will still work
seamlessly -- these may simply run accelerated code if the hardware
supports it, causing the kernel to turn the affected functional
block(s) on.
In order for this to work, some dynamic status information would need
to be visible to each user process, and polled each time a function
with a dynamically switchable choice of implementations gets called.
You probably don't need to worry about race conditions either-- if the
process accidentally tries to use a turned-off feature, you will take
a fault which gives the kernel the chance to turn the feature back on.
Generally, this should be a rare occurrence.
The dynamic feature status information should ideally be per-CPU
global, though we could have a separate copy per thread, at the cost
of more memory. It can't be system-global, since different CPUs may
have a different set of functional blocks active at any one time --
for this reason, the information can't be stored in an existing
mapping such as the vectors page. Conversely, existing mechanisms
such sysfs probably involve too much overhead to be polled every time
you call copy_pixmap() or whatever.
Alternatively, each thread could register a userspace buffer (a single
word is probably adequate) into which the CPU pokes the hardware
status flags each time it returns to userspace, if the hardware status
has changed or if the thread has been migrated.
Either of the above approaches could be prototyped as an mmap'able
driver, though this may not be the best approach in the long run.
Does anyone have a view on whether this is a worthwhile idea, or what
the best approach would be?
Cheers
---Dave
== Linaro GCC ==
* Worked on quad-word/big-endian fixes patch. Sent off a version
on Tuesday which worked OK, but which made some awkward changes to the
middle-end. Tried to re-think those parts, but without much luck: came
to the conclusion that spending more time trying to fix
element-ordering-dependent operations on quad-word vectors in
big-endian mode was probably not worth the effort (since we plan to be
changing things in that area anyway). Wrote a much-simplified patch
which simply disables those patterns, and ported it to mainline.
* Then, spent some time trying to set up big-endian testing with a
mainline build, since the lack of such an option is partly why we got
into this mess to start with. My current plan (as well as testing the
above patch) is to create an upstreamable patch to easily enable
big-endian (Linux) multilibs, in the hope that it'll generally make
big-endian testing easier. (Of course people will still need test
harness configurations which will allow running big & little-endian
code, which most won't have.)
* Also, ping lp675347 (volatile bitfields vs. QT atomics), and do some
some extra checks suggested by DJ Delorie, which seemed to work out
fine. Backported patch for lp629671 to Linaro 4.4 branch, and ran tests
(also fine).
* Continued discussion of internal representations for fancy vector
loads/stores in GIMPLE/RTL on linaro-toolchain.
== Last Week ==
* Continued implementing support for ARM unwind tables in libunwind.
* Sent patches upstream to improve binutils's readelf, adding support
for all remaining unwind table instructions (i.e. VFP/NEON and WMMX).
When used on ARMv7a, provides meaningful output for previously
'unsupported' opcodes that get used in some libraries (e.g. glibc).
== This Week ==
* Continue working on libunwind.
--
Zach Welch
CodeSourcery
zwelch(a)codesourcery.com
(650) 331-3385 x743
== GDB ==
* Posted updated implementation of #661253 (Improve
backtrace by using ARM exception tables) to gdb-patches,
which includes several changes requested by reviewers
* Posted updated patch to further improve backtrace
(in the absence of debug info) to gdb-patches
* Commented on a couple of GDB LP bugs
== Miscellaneous ==
* Started setting up IGEP v2 board
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Since it came up in the toolchain meeting this morning, some links to
issues people are having doing ARM scratchbox-style builds because of
generic linux-user issues:
http://bugs.meego.com/show_bug.cgi?id=10529 # linux-user's mmap
implementation isn't very smart
https://bugs.launchpad.net/qemu/+bug/668799 # qemu locking issues
which can cause build failures (sometimes)
-- PMM
https://blueprints.launchpad.net/ubuntu/+spec/other-linaro-n-cross-compilers:
- wrote patches for creating backports PPA
- each component [1] generates versioned -source binary package
(eglibc-2.12.1-source etc)
- a-c-t-b [2] got "PPA" boolean variable in rules to have one source package
for archive and for backports
- I built a-c-t-b with all components backports from natty in lucid pbuilder
Bugs:
- 684625 - libc6 is compiled for armv5 instead of armv7a
- confirmed, wrote fix, will sent for review and merge
- 683832 - gcc fails to cross compile Qt
- confirmed in maverick for cross gcc 4.4/4.5
- need to check with fixed (bug 684625) toolchain
- FTFBS of armel-cross-toolchain-base 1.53/natty
- issue is lack of LTO plugin built in gcc/stage2
- have first patches for it, need to test
1. component = eglibc, gcc-4.4/4.5, binutils, linux
2. a-c-t-b = armel-cross-toolchain-base
Regards,
--
JID: hrw(a)jabber.org
Website: http://marcin.juszkiewicz.com.pl/
LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz
== GCC related ==
* PR44557, Thumb-1 ICE: originally thought a fix of constraint will
work, however after simplifying the testcase, received another ICE in
postreload, due to a load of IP, which is not permitted in Thumb-1.
Looking at some reload internals as part of fixing this.
* PR45416, ARM code generation regression. First fix from last week hit
an assert FAIL in the alias-oracle due to ARRAY_REFs not being handled
there. Also, further found some expand code quality regressions due to
this change. Turned to a more conservative fix by adding the related TER
substitution to expr.c:do_store_flag(), which produced more focused
results. However, 32-bit x86 slightly regressed in the same flag storing
code (did not use the 'testl' insn after the change). Still WIP.
* PR46667, submitted a section type conflict bug fix upstream, see
http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00137.html , which
supposedly fixes the upstream ARM-Linux C++ build. Jan Hubicka later
gave another fix, so still in discussion.
* PR45886, this PR is call for backporting the __ARM_PCS* preprocessor
symbols to gcc-4_5-branch. Submitted a mail to ask for approval, no
response.
== libffi VFP hard-float ==
* PR46508, libffi VFP assembly error. I missed this earlier due to using
a compiler configured with --with-fpu=vfp. Submitted assembly fix to add
the needed FPU directives. Committed to upstream trunk.
== This week ==
* Hope to wrap up the above in-progress PRs, as well as continue to look
at other PRs of interest.
* LP #685534 popped up on Sunday, and manifests on upstream trunk too.
Add this to queue.
* Think about GCC performance opportunities (Linaro)
== Linaro GCC ==
* Reproduce regression of my ldm/stm backport on 4.5, which is caused by
the other two merged patches in ifcvt.c. Fix them. Propose merge
request again. Learn how to sync/merge changes from one branch to the
other branch.
* Fix VFP_D0_D7 handling in predicate vfp_register_operand. Approved
and committed upstreams.
* Test new regrename improvement patch on x86_64, and measure
effectiveness of it on ARM. Code size of bash-3.2 is reduced 0.2% with
option "-march=armv7a -mthumb -O2 -frename-registers". Eric B. is
almost OK with this patch except some wording in comments.
== Linaro GDB ==
As discussed in UDS, I'll move to GDB work for gdb correctness.
http://ex.seabright.co.nz/helpers/planner#tr-toolchain-gdb-correctness
In this month, I'll focus on GDB testsuite failures fixing.
* Analyze LP:615978, failures in gdb.base/annota3.exp.
Signal is not delivered to child while software single-stepping. The
same as LP:649121.
* Fix failure in gdb.xml/tdesc-regs.exp. LP:685494
It is caused by a target triplet matching error, when target is set to
"armv7l-linux-gnueabi". Target triplet matching in test cases should be
changed. Patch is being reviewed in upstreams.
* LP:616000 failures caused by -fstack-protector.
Homework to understand frame-related code in GDB. Got some big picture
of usage of some key data structures inside GDB on frame. Compared with
prologue with and without stack-protector, find some difference there.
Still no clue on how to educate GDB to identify whether stack-protector
is turned on or off.
* Fix one failure in printcmds.exp. LP:685702
This test case on 7.2 branch is a little bit out of date, compared with
GDB trunk. Backport one patch on trunk to 7.2 branch can fix this
problem. Backport patch is being reviewed in upstreams.
* Neon registers in kernel dump file. LP:615972
Ask Linaro kernel WG to see how to move forward on this. Discussion is
still ongoing.
== This Week ==
* Report the rest of GDB testsuite failures.
* Pick up some of them, and fix.
* Pass gcc patches in my queue one by one to gcc-patches to review.
--
Yao (齐尧)
RAG:
Red:
Amber:
Green:
Milestones:
| Planned | Estimate | Actual |
finish virtio-system | 2010-08-27 | postponed | |
get valgrind into linaro PPA | 2010-09-15 | 2010-09-28 | 2010-09-28 |
complete a qemu-maemo update | 2010-09-24 | 2010-09-22 | 2010-09-22 |
finish testing PCI patches | 2010-10-01 | 2010-10-22 | 2010-10-18 |
Progress:
* merge-correctness-fixes:
** Nathan Froyd (CodeSourcery) has reviewed a lot of
my ARM patches. Most were OK, one or two needed tweaking
We seem to have come to agreement on how best to treat
the API between qemu and the softfloat library, and I
have a V2 patchset ready to mail as soon as Nathan has
commented on the final patch.
** posted a patch to rename a very misleading _is_nan()
function
** identified list of correctness patches in meego and
samsung trees and issues noted within ARM
** qemu: posted patch to remove an unused function
** started looking at the first patch in the meego tree,
which fixes VQSHL. I have already discovered a bug in
this insn not covered by the meego patch...
Meetings: toolchain, PD update, ARM 20th birthday party
Plans:
- qemu consolidation
Issues:
* Locking in qemu is definitely insufficient, especially
(but not exclusively) when running multi-threaded
programs in linux-user mode.
https://bugs.launchpad.net/qemu/+bug/668799
has an example problem and some discussion; I'm hoping
some other qemu developers have an opinion, but the
nicest approach IMHO would involve fairly invasive
changes to how qemu implements interrupting a cpu
which is executing TCG code.
Not sure where this should sit in the priority list.
Things of note:
- there has been some discussion of broadening the "KVM Forum"
conference to include other virtualisation related topics including
Xen and also the TCG aspects of Qemu. Still all up in the air but
possibly colocated with LinuxCon in Vancouver in August. See:
http://www.linux-kvm.org/page/KVM_Forum_2011
Absences: (complete to end of 2010)
Fri 17 Dec - Tue 4 Jan inclusive.
2011: Dallas Linaro sprint 9-15 Jan. Holiday 22 Apr - 2 May.
== This week ==
* Looked at a generic bug in GAS's handling of ifuncs. Sent a patch upstream:
http://sourceware.org/ml/binutils/2010-11/msg00495.html
Alan quite reasonably wanted me to test on a variety of targets. For want
of anything better, I wrote a script to test Alan's list of 118 targets.
Tests went OK, patch committed upstream.
* Wrote more IFUNC tests. Found another problem (as yet unresolved).
* Looked at vector stuff, but nothing tangible yet.
(I also had to spend some time on other IBM things, sorry.)
== Next week ==
* Away Monday and Tuesday.
* More STT_GNU_IFUNC and vectors.
Richard
* Benchmarking of simple package builds with various string routine
versions; not finding enough difference in the noise to make any large
conclusions
* Looking at the string routine behaviour with perf to see where the time
is going
- getting hit by the Linaro kernels on silverbell missing Perf
enablement in the config
- Useful amount of time does seem to be spent outside the main 'fast
aligned' chunks of code
- pushing/popping registers does seem to be pretty expensive
* Started looking at libffi and hard float
- Started writing a spec
https://wiki.linaro.org/WorkingGroups/ToolChain/Specs/LibFFI-variadic
- It's going to need an API change to libffi, although the change
shouldn't break any existing code on existing platforms where they work.
* Helping with the image testing
Dave
Hi,
* got llvm+clang working on ARM:
https://wiki.linaro.org/KenWerner/Sandbox/HowToBuildToolchainComponents#llv…
* checked whether llvm inlines the __sync_* builtins on ARM or not:
https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations#LLVM
* developed a patch for #681138 (tested with current gcc-linaro)
* spent some time for bootstrapping the GCC trunk in order to test and post
that patch on the ml but wasn't successful
(finally ran into the issues discussed at #659713)
* did some verification work on #674090
* preparing to work on the "investigate current developer tools" item
Regards
Ken
Hi there. Currently you can't use NEON instructions in inline
assembly if the compiler is set to -mfpu=vfp such as Ubuntu's
-mfpu=vfpv3-d16. Trying code like this:
int main()
{
asm("veor d1, d2, d3");
return 0;
}
gives an error message like:
test.s: Assembler messages:
test.s:29: Error: selected processor does not support Thumb mode `veor d1,d2,d3'
The problem is that -mfpu=vfpv3-d16 has two jobs: it tells the
compiler what instructions to use, and also tells the assembler what
instructions are valid. We might want the compiler to use the VFP for
compatibility or power reasons, but still be able to use NEON
instructions in inline assembler without passing extra flags.
Inserting ".fpu neon" to the start of the inline assembly fixes the
problem. Is this valid? Are assembly files with multiple .fpu
statements allowed? Passing '-Wa,-mfpu=neon' to GCC doesn't work as
gas seems to ignore the second -mfpu.
What's the best way to handle this? Some options are:
* Add '.fpu neon' directives to the start of any inline assembly
* Separate out the features, so you can specify the capabilities with
one option and restrict the compiler to a subset with another.
Something like '-mfpu=neon -mfpu-tune=vfpv3-d16'
* Relax the assembler so that any instructions are accepted. We'd
lose some checking of GCC's output though.
-- Michael
- Continued looking into NEON special loads and stores.
- Benchmarks: concentrated on EEMBC Telecom:
- autcor gets vectorized
- viterbi, besides strided data accesses, needs to sink conditional
stores to allow if-conversion and make the main loop vectorizable.
Since the potential here is 4x, I think it's worthwhile to work on
this.
- conven, fbital also have control-flow issue, but much more
complicated than viterbi
- fft has a problem with loop count, I would like to investigate
this a bit more
- diffmeasure doesn't seem to have vectorization potential
- Fixed GCC PR 46663 on trunk, testing the fix for 4.3, 4.4, 4.5.
Hi,
Here's a work-in-progress patch which fixes many execution failures
seen in big-endian mode when -mvectorize-with-neon-quad is in effect
(which is soon to be the default, if we stick to the current plan).
But, it's pretty hairy, and I'm not at all convinced it's not working
"for the wrong reason" in a couple of places.
I'm mainly posting to gauge opinions on what we should do in big-endian
mode. This patch works with the assumption that quad-word vectors in
big-endian mode are in "vldm" order (i.e. with constituent double-words
in little-endian order: see previous discussions). But, that's pretty
confusing, leads to less than optimal code, and is bound to cause more
problems in the future. So I'm not sure how much effort to expend on
making it work right, given that we might be throwing that vector
ordering away in the future (at least in some cases: see below).
The "problem" patterns are as follows.
* Full-vector shifts: these don't work with big-endian vldm-order quad
vectors. For now, I've disabled them, although they could
potentially be implemented using vtbl (at some cost).
* Widening moves (unpacks) & widening multiplies: when widening from
D-reg to Q-reg size, we must swap double-words in the result (I've
done this with vext). This seems to work fine, but what "hi" and "lo"
refer to is rather muddled (in my head!). Also they should be
expanders instead of emitting multiple assembler insns.
* Narrowing moves: implemented by "open-coded" permute & vmovn (for 2x
D-reg -> D-reg), or 2x vmovn and vrev64.32 for Q-regs (as
suggested by Paul). These seem to work fine.
* Reduction operations: when reducing Q-reg values, GCC currently
tries to extract the result from the "wrong half" of the reduced
vector. The fix in the attached patch is rather dubious, but seems
to work (I'd like to understand why better).
We can sort those bits out, but the question is, do we want to go that
route? Vectors are used in three quite distinct ways by GCC:
1. By the vectorizer.
2. By the NEON intrinsics.
3. By the "generic vector" support.
For the first of these, I think we can get away with changing the
vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so
that vector registers will hold elements in memory order -- so, all the
contortions in the attached patch will be unnecessary. ABI issues are
irrelevant, since vectors are "invisible" at the source code layer
generally, including at ABI boundaries.
For the second, intrinsics, we should do exactly what the user
requests: so, vectors are essentially treated as opaque objects. This
isn't a problem as such, but might mean that instruction patterns
written using "canonical" RTL for the vectorizer can't be shared with
intrinsics when the order of elements matters. (I'm not sure how many
patterns this would refer to at present; possibly none.)
The third case would continue to use "vldm" ordering, so if users
inadvertantly write code such as:
res = vaddq_u32 (*foo, bar);
instead of writing an explicit vld* intrinsic (for the load of *foo),
the result might be different from what they expect. It'd be nice to
diagnose such code as erroneous, but that's another issue.
The important observation is that vectors from case 1 and from cases 2/3
never interact: it's quite safe for them to use different element
orderings, without extensive changes to GCC infrastructure (i.e.,
multiple internal representations). I don't think I quite realised this
previously.
So, anyway, back to the patch in question. The choices are, I think:
1. Apply as-is (after I've ironed out the wrinkles), and then remove
the "ugly" bits at a later point when vectorizer "array load/store"
support is implemented.
2. Apply a version which simply disables all the troublesome
patterns until the same support appears.
Apologies if I'm retreading old ground ;-).
(The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good
code for the quad-word vec_pack_trunc_<mode> pattern. It would
eventually be applied as a separate patch.)
Thoughts?
Julian
ChangeLog
gcc/
* config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode
of vector registers.
* config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in
big-endian mode.
(reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>)
(reduc_umin_<mode>, reduc_umax_<mode>)
(neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>)
(neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>)
(vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle
big-endian mode for quad-word vectors.
Hi there. I've had a few questions recently about how to build a
cross compiler, so I took a stab at writing the steps down in a
Makefile. See:
https://code.launchpad.net/~michaelh1/+junk/cross-build
Hopefully it's easy to follow. It uses a binary sysroot and gives you
vanilla binutils 2.20 and Linaro GCC 2010.11 in a good enough way that
you can cross-compile for Maverick. The script is minimal and trades
readability for flexibility.
Note that Marcin's cross compiler packages or the Embedian toolchains
are a better way to go, but if you want to see the steps involved
check out the script.
Marcin or Matthias, would you mind reviewing it?
-- Michael
Hi,
- the struggle with the board took a lot of time
- continued to investigate special loads/stores
- looked for benchmarks:
EEMBC Consumer filters rgbcmy and rgbyiq should be vectorizable
once vld3, vst3/4 are supported
EEMBC Telecom viterbi is supposed to give 4x on NEON once
vectorized (according to
http://www.jp.arm.com/event/pdf/forum2007/t1-5.pdf slide 29). My old
version of viterbi is not vectorizable because of if-conversion
problems. I'd be really happy to check the new version (it is supposed
to be slightly different).
Looking into other EEMBC benchmarks.
FFMPEG http://www.ffmpeg.org/ (got this from Rony Nandy from
User Platforms). It contains hand-vectorized code for NEON.
Investigating.
I am probably taking a day off on Sunday.
Ira
This wiki page came up during the toolchain call:
https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/
It gives the code generated for __sync_val_compare_and_swap
as including a push {r4} / pop {r4} pair because it uses too many
temporaries to fit them all in callee-saves registers. I think you
can tweak it a bit to get rid of that:
# int __sync_val_compare_and_swap (int *mem, int old, int new);
# if the current value of *mem is old, then write new into *mem
# r0: mem, r1 old, r2 new
mov r3, r0 # move r0 into r3
dmb sy # full memory barrier
.LSYT7:
ldrex r0, [r3] # load (exclusive) from memory pointed to
by r3 into r0
cmp r0, r1 # compare contents of r0 (mem) with r1
(old) -> updates the condition flag
bne .LSYB7 # branch to LSYB7 if mem != old
# This strex trashes the r0 we just loaded, but since we didn't take
# the branch we know that r0 == r1
strex r0, r2, [r3] # store r2 (new) into memory pointed to
by r3 (mem)
# r0 contains 0 if the store was
successful, otherwise 1
teq r0, #0 # compares contents of r0 with zero ->
updates the condition flag
bne .LSYT7 # branch to LSYT7 if r0 != 0 (if the
store wasn't successful)
# Move the value that was in memory into the right register to return it
mov r0, r1
dmb sy # full memory barrier
.LSYB7:
bx lr # return
I think you can do a similar trick with __sync_fetch_and_add
(although you have to use a subtract to regenerate r0 from
r1 and r2).
On the other hand I just looked at the gcc code that does this
and it's not simply dumping canned sequences out to the
assembler, so maybe it's not worth the effort just to drop a
stack push/pop.
-- PMM
== Linaro GCC ==
* Finished testing patch for lp675347 (QT inline-asm atomics), and
send upstream for comments (no response yet). Suggested reverting a
patch (which enabled -fstrict-volatile-bitfields by default on ARM)
locally for Ubuntu in the bug log.
* Continued working on NEON quad-word vectors/big-endian patch. This
turned out to be slightly fiddlier than I expected: I think I now have
semantics which make sense, though my patch requires (a) slight
middle-end changes, and (b) workarounds for unexpected combiner
behaviour re: subregs & sign/zero-extend ops. I will send a new version
of the patch to linaro-toolchain fairly soon for comments.
Hello,
I have a question about cbnz/cbz thumb-2 instruction implementation in
thumb2.md file:
I have an example where we jump to a label which appears before the branch;
for example:
L4
...
cmp r3, 0
bne .L4
It seems that cbnz instruction should be applied in this loop; replacing
cmp+bne; however, cbnz fails to be applied as diff = ADDRESS (L4) - ADDRESS
(bne .L4) is negative and according to thumb2_cbz in thumb2.md it should
be 2<=diff<=128 (please see snippet below taken from thumb2_cbz).
So I want to double check if the current implementation of thumb2_cbnz
in thumb2.md needs to be changed to enable it.
The following is from thumb2_cbnz in thumb2.md:
[(set (attr "length")
(if_then_else
(and (ge (minus (match_dup 1) (pc)) (const_int 2))
(le (minus (match_dup 1) (pc)) (const_int 128))
(eq (symbol_ref ("which_alternative")) (const_int 0)))
(const_int 2)
(const_int 8)))]
Thanks,
Revital
== This week ==
* More ARM testing of binutils support for STT_GNU_IFUNC.
* Implemented the GLIBC support for STT_GNU_IFUNC. Simple ARM testcases
seem to run correctly.
* Ran the GLIBC testsuite -- which includes some ifunc coverage --
but haven't analysed the results yet.
* Started looking at Thumb for STT_GNU_IFUNC. The problem is that
BFD internally represents Thumb symbols with an even value and
a special st_type (STT_ARM_TFUNC); this is also the old, pre-EABI
external representation. We need something different for STT_GNU_IFUNC.
* Tried making BFD represent Thumb symbols as odd-value functions
internally. I got it to "work", but I wasn't really happy with
the results.
* Looked at alternatives, and in the end decided that it would be
better to have extra internal-only information in Elf_Internal_Sym.
This "works" too, and seems cleaner to me. Sent an RFC upstream:
http://sourceware.org/ml/binutils/2010-11/msg00475.html
* Started writing some Thumb tests for STT_GNU_IFUNC.
* Investigated #618684. Turned out to be something that Bernd
had already fixed upstream. Tested a backport.
== Next week ==
* More IFUNC tests (ARM and Thumb, EGLIBC and binutils).
Richard
== Linaro and upstream GCC ==
* LP #674146, dpkg segfaults during debootstrap on natty armel: analyzed
and found this should be a case of PR44768, backported mainline revision
161947 to fix this.
* LP #641379, bitfields poorly optimized. Discussed some with Andrew Stubbs.
* GCC bugzilla PR45416: Code generation regression on ARM. Been looking
at this regression, that started from the expand from SSA changes since
4.5-experimental. The problem seems to be TER not properly being
substituted during expand (compared to prior "convert to GENERIC then
expand"). I now have a patch for this, which fixes the PR's testcase,
but when testing current upstream trunk, hit an assert fail ICE on
several testcases in the alias-oracle; it does however, test without
regressions on a 4.5 based compiler. I am still looking at the upstream
failures.
== This week ==
* Continue with GCC issues and PRs.
* Think about GCC performance opportunities (Linaro)
== Last Week ==
* Started writing libunwind support for ARM-specific unwinding, but
realized that the native ARM toolchain may be causing problems. Spent time
trying to isolate the issues, but haven't found the culprit yet.
* Began to consider the possibility of developing an ARM-specific
unwinding library which would integrate into libunwind (and be reusable
elsewhere). Basically, the ARM.ex{idx,tbl} sections are unique to ARM.
* Looked at other applications where unwinding functionality already
exists to see how ARM unwinding is done (e.g. GDB).
Could/should that functionality be replaced with calls to libunwind (or
to the aforementioned ARM-specific helper library)?
== This Week ==
* Continue to implement ARM-specific unwinding in libunwind.
--
Zach Welch
CodeSourcery
zwelch(a)codesourcery.com
(650) 331-3385 x743
== Linaro GCC ==
* LP:634738: Firstly, fix this in combiner. Try the other approach
(without changes to arm.md) suggested by Andrew S, to fix
arm_gen_constant in some cases to generate lsl/lsr ranther loading
constant. Some piece of code in arm.c was written in 1998, hard to
understand with few comments. During this, find some lsl/lsr can be
replaced by ubfx. Use gen_extzv_t2 when arm_arch_thumb2 is true to
transform lsl + lsr to ubfx. Two patches are ready.
* LP:633243: Got build failures on FSF trunk for
arm-none-linux-gnueabi. Test patch on FSF trunk 2010-10-21. No regression.
* LP:638935: predicate "vfp_register_operand" should return true for
VFP_D0_D7_REGS registers. Fixed.
predicates {store,load}_multiple_operation assumes mode is SImode, and
size of data is 4. Fix them to accept multiple VFP operations. Write
three new test cases for stm/fldm/fstm pattern. Test patch on FSF trunk
2010-10-21. No regression.
* SMS on thumb2. Discussed with Revital Eres back and forth on doloop
pattern for thumb2. doloop pattern is not recognized so far on thumb2.
Revital has a fix to thumb2_cbz pattern. After this fix, doloop
pattern should be recognized.
* Ping ARM fix PR45701 in gcc-patches for the fifth time. Still no reply.
== This Week ==
* Look at regressions of ldm/stm backport on Linaro GCC 4.5.
* Internal review of patch to LP:638935
* Try SMS on thumb2 for EEMBC, if Revital's thumb2_cbz pattern fix is
accepted by upstreams.
--
Yao (齐尧)
Hi there. Our Versatile Express has been installed in the data centre
and is available for use. See:
https://wiki.linaro.org/WorkingGroups/ToolChain/Hardware
for the details. If you're a member of the Toolchain WG then you
should already have an account.
Dave is currently using this machine for benchmarking. Until we get
more hardware, please use IRC or email to manage access.
-- Michael
Reviewed Yao's patch for AND optimization. Some back on forth on the
best way to tackle this problem.
LP:663939 - thumb2 constant loading
- backported my patches to GCC 4.5
- awaiting review
LP:595479 - .eh_frame broken.
- Discovered this problem had been fixed (with Thomas' patch) since
August, and has also been fixed upstream, albeit with an alternative
patch. Nothing to do here.
LP:641379 - bitfields poorly optimized.
- analysed the problem. The code in cse.c that is supposed to fix this
does not recognise the case.
- created a patch and tested it for both GCC 4.6 and 4.5.
- awaiting review
LP:674146 - dpkg segfault.
- started looking at this, but Chung-Lin took it first.
While trying to reproduce lp:674146, I discovered that my IGEPv2 had a
corrupted rootfs, again. I only fixed it last week, so I looked into it
more deeply. It seems the SD card has developed at least one bad block.
Reformatted, scanned and reinstalled the files from backup. I think the
problem was caused by the daily apt package download (it was always
those files that were corrupt), so I've disabled that. I've also
disabled access-time-stamps. If it happens again I will have to consider
using a different underlying filesystem format.
LP:643479 / CS Issue:8610 - Multiply and accumulate optimization
- created patches for both issues.
- both were machine description subtleties.
- backported the patches to 4.5
- the patches apply and work fine, but ...
- found an extra problem with redundant moves
- awaiting review
GCC 4.6
- Created a new Launchpad series and branch to track GCC 4.6 development.
- Set up the CS internal build config.
- Tried to build the latest checkout and failed
- glibc problem still unfixed - Jie has reported it now.
- libquadmath build fails
Merged FSF GCC trunk (pre-4.5.2) into Linaro GCC 4.5 tree.
Merged the outstanding Launchpad merge requests into GCC 4.5. The
testing showed regressions, so I backed out most of the merges and did
them in smaller batches. Chung-Lin and Richard's patches passed the
testing, so that leaves Yao's as the problem patch. I didn't get time to
test this assertion this week.
----------------------------------
Next week: Vacation.
Andrew Stubbs wrote:
> * Instruction set coverage.
> - Are there any ARM/Thumb2 instructions that we are not taking
> advantage of? [2]
> - Do we test that we use the instructions we do have? [3]
There is no general frame work to test instruction set coverage. The only
way to find out, really, is to create some test cases where you expect
the compiler to produce a certain insn.
Is there a list of all ARM/Thumb2 instructions and the ones implemented in
the GCC ARM machine descriptions?
> * Constant pools
> - it might be a very handy space optimization to have small
> functions share one constant pool, but the way the passes work one
> function at a time makes this hard. (LP:625233)
There are also passes working on the entire program, or a partition.
Isn't it more a question of how to group and process functions that are
candidates for sharing a constant pool with a neighbor?
Are there algorithms for this kind of pool sharing in the academic
or ARM-specific literature?
Other suggestions for the discussion:
* Better use of conditional execution.
- No idea how much this really helps for ARM, but there are bug
reports about missed opportunities from time to time, so...
- How to model conditional execution before register allocation?
- How exploit opportunities better in GCC (ifcvt is inadequate
and too late in the pipeline).
- Also look at LLVM here, it appears to have a better cost model
for if-conversion than GCC (taking into account a target-dependent
branch misprediction
penalty, for example).
* Basic block re-ordering for speed/size.
- The existing basic block reordering pass in GCC implements only
a reordering strategy for speed.
- The pass does not run at all for functions optimized for size.
* Comparing ARM cost models and param settings to x86_64
- Compare, for some set of functions/benchmarks, the results of
estimate_num_insns, estimate_operator_cost, and
estimate_move_cost, between ARM and x86_64. Rationalize or fix
any significant differences. See whether heuristics based on
these functions require tuning for ARM.
- Go through params.def and see if there are further ARM tuning
opportunities. There are more than 100 DEFPARAMs and many of
them guide heuristics but have only been tuned on x86_64.
(There is set_default_param_value, but most backends do not
change the defaults.)
Hoping this is helpful,
Ciao!
Steven
David G. requested a few packages installed on silverbell today (the
quad-A9 VE porter machine we host in the datacenter). We got dchroots
instead:
----- Forwarded message from LaMont Jones via RT <rt(a)admin.canonical.com> -----
Date: Fri, 26 Nov 2010 21:02:22 +0000
Subject: [rt.admin.canonical.com #42662] Simple package installs on silverbell
On Fri Nov 26 17:08:28 2010, kiko(a)canonical.com wrote:
> Hi there,
>
> Could we get installed on silverbell:
>
> build-essential
> debhelper
> fakeroot
>
> And could we get deb-src's added to sources.list and do an apt-get
> update to allow us to apt-get source certain packages? This is for
> simple compilation benchmarks. Thanks!
Dchroot environments have been created on silverbell for both maverick
and natty. Within the chroot, you can sudo apt-get install to install
packages. apt-get update/dist-upgrade and installs that cause package
removal will require a GSA to do them.
To build for maverick: dchroot -c maverick (and then build however you want...)
lamont
----- End forwarded message -----
--
Christian Robottom Reis | [+55] 16 9112 6430 | http://launchpad.net/~kiko
Linaro Engineering VP | [ +1] 612 216 4935 | http://async.com.br/~kiko
Hi,
* the ARM __sync_* glibc-ports patch was accepted upstream
* posted proposal for consolidating sync primitives but stdatomic seems to be
the future
* used my small gcc testsuite patch to verify __sync_* support of the gcc-
linaro
* created:
https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations
* looked into GOMP support on ARM:
- #pragma omp atomic results in proper asm code (dmb, ldrex, strex, dmb)
- #pragma omp flush results in a DMB instruction
- #pragma omp barrier results to a call to GOMP_barrier (I'm not sure if
this is the desired behavior)
* started to look into #681138
Regards
Ken
Hand crafted a simple strchr and comparing it with Libc:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr
It's interesting it's significantly faster than libc's on A9's, but on
A8's it's slower for large sizes. I've not really looked why yet; my
implementation is just the absolute simplest thumb-2 version.
Did some ltrace profiling to see what typical strchr and strlen sizes were,
and got a bit surprised at some of the typical behaviours
(Lots of cases where strchr is being used in loops to see if another string
contains anyone of a set of characters, a few cases
of strchr being called with Null strings, and the corner case in the spec
that allows you to call strchr with \0 as the character
to search for).
Trying some other benchmarks (pybench spends very little time in
libc,package builds of simple packages seem to have a more interesting
mix of libc use).
Sorting out some of the red tape for contributing.
Dave