Hi there. I've updated the list of potential Summit sessions based on
yesterdays call. Could people please check the Sessions table on
https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2010-10-18
and flesh out the agenda for sessions that have your name against them.
The agenda should be five to ten discussion points, preferably of
things that are not well understood and could use input from the
group.
There's a good discussion on what to expect at a Summit here:
http://oubiwann.blogspot.com/2010/10/q-the-ubuntu-developer-summits.html
You can check the already-approved sessions here:
http://summit.ubuntu.com/uds-n/
Feel free to join in to any other sessions you might find interesting.
There will be quite a few people with diverse backgrounds there,
including ~80 people from Linaro, ~400 from Ubuntu, ~200 from the
community, and ~200 remote. The overlap between Toolchain and Ubuntu
interests might not be great, so I'll make sure a common work room is
available for idle time.
-- Michael
Some of Linaro developers works with ARM devices older then ARMv7-a
architecture. Other people experiments with hard-float ABI. Each of them has
to rebuild toolchain for own use and that means playing with components to
have them build properly.
But it is no more - I made some patches and armel-cross-toolchain-base since
1.53 version + newer source packages for gcc-4.[45]-armel-cross have support
for "debian/flavour" file which allows to set some flags related to toolchain
build.
So far supported things are:
- ARM architecture
- float ABI
- FPU mode
- Thumb mode
This feature is not merged into regular Ubuntu packages yet as this is work in
progress which needs to be cleaned first.
http://people.linaro.org/~hrw/armel-cross-toolchain/ has all source packages
needed.
Regards,
--
JID: hrw(a)jabber.org
Website: http://marcin.juszkiewicz.com.pl/
LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz
Some of Linaro developers works with ARM devices older then ARMv7-a
architecture. Other people experiments with hard-float ABI. Each of them has
to rebuild toolchain for own use and that means playing with components to
have them build properly.
But it is no more - I made some patches and armel-cross-toolchain-base since
1.53 version + newer source packages for gcc-4.[45]-armel-cross have support
for "debian/flavour" file which allows to set some flags related to toolchain
build.
So far supported things are:
- ARM architecture
- float ABI
- FPU mode
- Thumb mode
This feature is not merged into regular Ubuntu packages yet as this is work in
progress which needs to be cleaned first.
http://people.linaro.org/~hrw/armel-cross-toolchain/ has all source packages
needed.
Regards,
--
JID: hrw(a)jabber.org
Website: http://marcin.juszkiewicz.com.pl/
LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz
I meant to send this to the "external" Linaro toolchain mailing list,
not the internal CS one. Apologies to those who receive it twice!
In a follow-up message, Joseph Myers pointed out a post he'd written
previously on the same subject:
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html
In further followups (at the risk of misrepresenting Joseph & Paul
Brook's opinions!), there seemed to be general agreement that a scheme
something like that outlined below, with "permuting" loads/stores and
some way of handling multiple in-register layouts for vectors seems
like it will be a necessary addition to the vectorizer, going forward.
Julian
Begin forwarded message:
Date: Thu, 7 Oct 2010 16:45:17 +0100
From: Julian Brown <julian(a)codesourcery.com>
To: Ira Rosen <IRAR(a)il.ibm.com>
Cc: Tejas Belagod <Tejas.Belagod(a)arm.com>, Linaro List
<gnu-linaro-tools(a)codesourcery.com> Subject: [gnu-linaro-tools] NEON
vectorization: use of specialized load/store instructions
Hi,
We're having some system issues, so I thought I'd take the chance to
write down some things I've been thinking about re: utilising the NEON
load/store instructions more effectively. I've also attempted to
summarize the problems with big-endian mode. All unverified as of yet,
so please take with a pinch of salt :-). Comments appreciated. It's
been a while since I last thought about some of this stuff...
Cheers,
Julian
Use of specialized load instructions
====================================
To provide good support for NEON's element and structure load/store
instructions, GCC lacks support for a couple of key features:
1. A good way of representing a set of two, three or four vector
registers (either D- or Q-sized), possibly with non-unit stride.
2. A generalised mapping between memory locations and lane numbers.
To start with point 1: currently the element and structure load/store
instructions are only supported via intrinsics. These are specified to
load and store as if going via an array embedded in a union, i.e.:
typedef struct int8x8x2_t
{
int8x8_t val[2];
} int8x8x2_t;
__extension__ static __inline int8x8x2_t __attribute__
((__always_inline__)) vld2_s8 (const int8_t * __a)
{
union { int8x8x2_t __i; __builtin_neon_ti __o; } __rv;
__rv.__o = __builtin_neon_vld2v8qi ((const __builtin_neon_qi *) __a);
return __rv.__i;
}
Even for a trivial test program, e.g.:
#include <arm_neon.h>
int foo (int8_t *x)
{
int8x8x2_t result = vld2_s8 (x);
return vget_lane_s8 (result.val[0], 1);
}
We will generate code like so:
sub sp, sp, #32
vld2.8 {d16-d17}, [r0]
mov r3, sp
vstmia sp, {d16-d17}
add ip, sp, #16
ldmia r3, {r0, r1, r2, r3}
stmia ip, {r0, r1, r2, r3}
fldd d16, [sp, #16]
vmov.s8 r0, d16[1]
add sp, sp, #32
bx lr
I.e., rather than being used directly, the registers loaded by vld2
will always be spilled to the stack then reloaded. This obviously
reduces the usefulness of these intrinsics by a large factor. With some
planning, it'd be good to find a powerful enough solution to this
problem so that the same representation for multiple registers can be
used by the autovectorizer as well as the intrinsic-handling code.
(One difficulty is that the "foo.val[X]" interface should still be
available to user code. There's probably no need for "val" to literally
be an array, though other representations would require front-end
changes).
Assuming it's hard for the register allocator to deal with
highly-constrained situations like requiring four consecutive
registers, one (ugly) possibility might be to run a pass before
register allocation, looking for "big" multi-register vectors and
pre-allocating them to hard registers. Even using a fixed allocation of
a single set of registers (e.g. make it so that all multi-reg
loads/stores larger than a Q register must use d0-d7, or whatever)
would probably give better code than what we produce at present, in
most cases.
Now, point 2. To start with, an aside: AIUI, there is currently an
assumption in the vectoriser code that increasing element numbers in
vector registers correspond to increasing addresses when those
registers are loaded from and stored to memory (as if the vector was a
short array, or alternatively as if a union of the vector register and
an array of element-types had the same numberings for lanes and array
indices corresponding to the same elements). Unfortunately that is only
true for NEON in little-endian mode: in big-endian mode, the story is
more complicated, for reasons I will try to explain.
To remain compliant with the soft-float variant of the ARM EABI, we
must pass vector register arguments in ARM registers (or the stack),
not vector registers. This means that we must be very careful with the
ordering of elements for values passed to functions. Consider the
trivial function:
int __attribute__((noinline)) qux (int16x8_t x)
{
x = vaddq_s16 (x, x);
return vgetq_lane_s16 (x, 1);
}
This is compiled by GCC to the following (slightly unimpressively):
vmov d18, r1, r0 @ v8hi
vmov d19, r3, r2
vmov d20, r1, r0 @ v8hi
vmov d21, r3, r2
vadd.i16 q8, q9, q10
vmov.s16 r0, d16[1]
bx lr
Which may then be called like, e.g.:
ldmia sp, {r0-r3}
blx qux
So: notice that we're careful that when vector values are transferred
from NEON registers to core registers, the same result will be
transferred to/from memory when we use ldm/stm (core registers) or
vldm/vstm (vector registers) -- i.e. we might use "vldm rX, {d18-d19}",
storing d18 and d19 in consecutive increasing addresses, or "ldmia rX,
{r0-r3}", again with consecutive registers in increasing memory
locations, and we get the same outcome. The fact that we can use the
multiple-register loads/stores is also important for spilling/reloading
between vector and core registers, which inevitably happens
occasionally.
Notice also that when we call the above function like so:
typedef union {
int16x8_t quadvec;
int16_t half[8];
} u;
int foo (int8_t *x)
{
u bar;
int i;
for (i = 0; i < 8; i++)
bar.half[i] = i;
qux (bar.quadvec);
}
The value returned from "qux" is NOT 2 (1+1), as it would be if we were
accessing the value at index 1 in the superimposed array in the union
"u". The vgetq_lane_s16 call still interprets the array as if it had
been loaded in little-endian element order. But we don't get the result
we would have if the vector had been interpreted in purely big-endian
order either (i.e. 12, 6+6)! In fact from the perspective of the
element numbering used by vgetq_lane_s16, the vector elements we see
for each of the (equal) operands of the "vadd" instruction in the qux
function are:
equiv. core register
lane number (at function entry) value
----------- -------------------- -----
[0] high part of r1 3
[1] low part of r1 2
[2] high part of r0 1
[3] low part of r0 0
[4] high part of r3 7
[5] low part of r3 6
[6] high part of r2 5
[7] low part of r2 4
So the value returned will be 2+2, 4.
Now, coming back to the vectorizer. Current practice means that
increasing element numbers should correspond to increasing memory
locations: i.e., that "array ordering" is in effect, just as in the
call to vgetq_lane_s16 in the above example. This leads to an anomaly:
it means that when the vectorizer asks for a particular element, it
will generally get a different one. Most of the time we get away with
this, since the vectorizer mostly deals with "opaque" vectors which are
operated on element-wise: i.e. we only deal with data at the
granularity of whole vectors, so it doesn't matter which order the
elements are in. The ARM implementations of reduction operations
fortuitously calculate the results across all elements simultaneously,
so when one of those elements is extracted, we still get the right
answer.
One notable exception to this though is the movmisalign<mode> patterns:
these are implemented using the vld1 and vst1 instructions, which load
elements in "array" order (increasing elements from increasing memory
locations), even in big-endian mode. Since vectors loaded using those
instructions are "incompatible" with the above scheme, such misaligned
accesses are simply disabled in big-endian mode.
Of course, generally, sticking with the current non-solution in
big-endian mode is not sustainable (and is probably already broken in
various cases). So it might be worth thinking about whether supporting
big-endian mode properly, as well as handling the more complex load and
store element/structure instructions, can be done using some
generalised solution.
I'm thinking (without having much idea about how feasible such an idea
is) of something along the lines of a function (in the mathematical
sense) attached to each vector value manipulated by the vectorizer, to
map that value's element numberings to and from memory offsets. So then
the quad-word vector of 16-bit elements discussed above would look
like, in big-endian mode:
foo, {6, 4, 2, 0, 14, 12, 10, 8}
Whereas in little-endian mode (or in big-endian mode, for vectors
loaded using vld1), it would look like:
foo, {0, 2, 4, 6, 8, 10, 12, 14}
And then, perhaps more interestingly, a vector loaded using e.g. a
"multiple 3-element structures" load,
vld3.16 {d1, d2, d3}, [rN]
Might look like (in either endianness, assuming we can represent a
vector of such size in our hypothetical scheme):
foo, {0, 6, 12, 18, 2, 8, 14, 20, 4, 10, 16, 22}
Though it's not clear that such a scheme would be powerful enough to
represent the whole range of element/structure loads/stores available
(you'd probably need to be able to specify skipped or don't-care
elements to do that, at least).
First of all, the goal of this work is about investigation on speed
improvement on linaro gcc 4.5. Finally, the output/result of this work
is to list all possible recommendations/actions to improve speed on
linaro 4.5. Comments to this plan are welcome.
So far, we can improve speed in three ways,
1. Backport patches from FSF GCC 4.6. Note that we don't want to
backport the whole 4.6.
2. Benchmark with FSF GCC 4.5.0. Fix performance regressions if there
are on linaro gcc 4.5. Output is the reason of performance regression,
or even further, give recommendations on how to fix it.
3. Study the code generated by other ARM compilers, and give
recommendations on how to improve GCC to do better job.
I'll describe these three ways in details in the following sections,
- Backport patches from FSF GCC 4.6
I went through gcc-patches archive, and select several patches that are
helpful to code improvements.
1 ifcvt optimization. Target independent.
http://gcc.gnu.org/ml/gcc-patches/2010-04/msg00832.html
2 redundant register move for sign extending. Thumb2.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43137
3. PR 45335 Use ldrd and strd to access two consecutive words.
Not yet approved.
http://gcc.gnu.org/ml/gcc-patches/2010-09/msg00059.html
4. Fix an if statement in arm_rtx_costs_1.
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html
5. Reduce code duplication for Thumb2 move patterns
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00624.html
6. ARM ldm/stm peepholes
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00512.html
7. PR44999 Replace "and r0, r0, #255" with uxtb in thumb2
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01700.html
8. Improve optimization to transform TST into LSLS
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html
9. Fix bswap patterns for ARM / Thumb and Thumb2.
http://gcc.gnu.org/ml/gcc-patches/2010-01/msg01238.html
- Fix speed regression
I found speed regression on EEMBC on linaro 4.5, compared with FSF GCC
4.5.0, and I'll investigate why speed regression happens on these cases.
Here is a table below about speed regression compared between FSF GCC
4.5.0 and Linaro GCC 4.5 (revno:99398)
O2 O3
puwmod01, -5.5 -3.5
bitmnp01, -7.9 -0.7
routelookup, -6.4 -8.2
conven00data_1, -7.2 -5.8
conven00data_2, -8.1 -7.3
conven00data_3, -6.6 -5.5
viterb00data_1, -1.7 +5.9
viterb00data_2, -4.3 +2.6
viterb00data_3, -2.3 +1.8
viterb00data_4, -5.3 -0.3
- Study the code generated by other ARM compilers.
In this part, I'll study the binary generated by other ARM compilers,
and try to teach GCC smart enough to do the same thing. This piece of
work is quite open, and hard to estimate how much output we could get.
--
Yao Qi
CodeSourcery
yao(a)codesourcery.com
(650) 331-3385 x739
People here might want to have a look at this bug:
http://gcc.gnu.org/bugzilla/show_bg.cgi?id=45979
Note that the heap randomization feature added to the kernel was part of
a Linaro security blueprint.
Nicolas
The Linaro Toolchain Working Group is pleased to announce the 2010.10
consolidation release including Linaro GCC 4.4, Linaro GCC 4.5, and
the first version of Linaro GDB 7.2.
Linaro GDB 7.2 2010.10-0 is the first release in the 7.2 series. Based
off the latest GDB 7.2, it includes a number of ARM-focused bug fixes
and enhancements.
Interesting changes include:
* Backtraces in Thumb-2 code are significantly improved
* Much better prologue and epilogue parsing
* Improved software watchpoint support
* Many test suite tidy-ups
Linaro GCC 4.5 is the third release in the 4.5 series. Based off the
latest GCC 4.5.1+svn, it includes many ARM-focused performance
improvements and bug fixes.
Linaro GCC 4.4 is the fourth release in the 4.4 series. Based off the
latest GCC 4.4.5, it fixes many of the issues found during building
Ubuntu over the last few months.
Interesting changes include:
* Linaro GCC 4.4 is now based off FSF GCC 4.4.5
* Cortex A8 and Cortex A9 scheduler NEON improvements
* Better code generation for constant addresses with inline assembly
* Better code for copying small constant strings
* Various correctness improvements
Downloads are available from the Linaro GCC and GDB pages on Launchpad:
https://launchpad.net/gcc-linarohttps://launchpad.net/gdb-linaro
-- Michael
Hi all,
I was wondering someone knows about a ARM DCC (debug
communications channel) device driver.
The idea is to run gdbserver on /dev/dcc such that application
debugging does not hog a serial/ethernet port.
I'd modify OpenOCD to forward the DCC onto a TCP/IP port
to connect GDB to the gdbserver.
--
Øyvind Harboe
US toll free 1-866-980-3434 / International +47 51 63 25 00
http://www.zylin.com/zy1000.html
ARM7 ARM9 ARM11 XScale Cortex
JTAG debugger and flash programmer
(cc'ed to linaro-toolchain, bcc'ed to others who may be interested)
I'm considering adding a new Linaro Toolchain meeting to cover people
in the North/South American timezones. We've got quite a few people
in that area who are interested in the toolchain but can't make the
current 0900 UTC calls.
How about a weekly half-hour call on Wednesdays at 1800 UTC? Once
daylight savings drops out on the 7th of November, this would be 1000
Sacramento/PST, 1200 Houston/CST, and a reasonable evening time for
those in Europe who wish to join.
This will be a technical call and can cover topics such as status
updates, release plans, reported problems, and any input from
toolchain users.
Please send me an email if you are interested,
-- Michael
Hi Marcin. Would you consider passing
--enable-poison-system-directories to the cross compiler configure?
This makes the '-Wpoison-system-directories' option available which
warns you if the cross compiler picks up a library or header file from
/usr instead of the cross-build environment.
I'm talking with someone who's looking at using the Linaro compiler
and had a strange error due to picking up the host crtn.o. Having
this warning would of tracked down the problem faster.
-- Michael
I believe that the libgcc.a in our toolchain contains Thumb-2 code. I
verified this by doing objdump on libgcc.a and I see combinations of
16 and 32 bit instructions. So does that mean that the toolchain is
only usable for ARM versions that support Thumb-2?
Thanks,
John
As discussed in the meeting yesterday, CodeSourcery has a few MinGW
patches that I had not merged into Linaro GCC.
I have now investigated these patches, and I'm fairly happy that most
are not necessary for Linaro. They're mainly about interworking with Cygwin.
The one exception is this one:
http://gcc.gnu.org/ml/gcc-patches/2010-04/msg01214.html
(and even that is primarily a GDB issue).
Andrew
I made a patch for ltrace that adds support for Thumb-2. There's not
much to it, but it allows me to trace applications built for Cortex-A8.
Without it, users will experience this bug:
https://bugs.launchpad.net/ubuntu/+source/ltrace/+bug/639796
Unfortunately, it appears that the upstream tree is not well-maintained.
I posted it to the mailing list for the project, but others' patches
have been ignored for many months. However, my post precipitated another
contributor to offer to maintain the package.
I also posted this patch as the proposed solution for the above LP bug,
which should allow Linaro to benefit from the work without worrying
about upstream. In fact, a new version of the package appears to have
been released that includes my patch (0.5.3-2ubuntu6). Please give this
updated package a whirl and let me know if there is more work to be done.
Thoughts? Unless I hear feedback from others, I will assume that this
tool now works for Cortex-A[89] and move on to other tasks.
--
Zach Welch
CodeSourcery
zwelch(a)codesourcery.com
(650) 331-3385 x743
(this is for current Toolchain WG members. Sorry if I got anyone
else's hopes up)
We'll soon be coming into some decent dual-core Cortex-A9 boards that
have 1 GB of RAM and a good set of USB ports. I've asked for four of
them with hard drives to go into the data centre for general use.
Would anyone also like one for their desk? Note that you're generally
better off using a data centre board as it's one less thing to
maintain.
-- Michael
Hi
I finally built armel cross compiler packages for Ubuntu 10.04 'Lucid' LTS.
They are available in unsigned APT repository:
deb http://people.canonical.com/~hrw/ubuntu-lucid-armel-cross-compilers/ ./
They are built from Maverick packages:
- binutils-source
- eglibc-source
- gcc-4.4-source
- gcc-4.5-source
- linux-source-2.6.35
- armel-cross-toolchain-base
- gcc-4.4-armel-cross
- gcc-4.5-armel-cross
So they do not give exactly same versions as compilers used in 10.04 - please
remember about it while doing cross builds.
Regards,
--
JID: hrw(a)jabber.org
Website: http://marcin.juszkiewicz.com.pl/
LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz
Hi folks
apparently some tool calls "strip" instead of "$triplet-strip" when
cross-building; this is something we shall fix, but it is apparently
corrupting the binaries in some cases:
https://bugs.launchpad.net/ubuntu/+source/binutils/+bug/615765
It seems the ELF architecture isn't set properly, or so I'm told.
Which component is to blame here? Are we looking at a binutils or a
gcc bug for not being able to set or read enough data that the
architecture mismatch isn't detected? What could we do about it?
Thanks!
--
Loïc Minier
The Linaro Toolchain Working Group is pleased to announce
the availability of a "developer preview" of Valgrind
which includes the support for ARM and Thumb which has
recently been added by the Valgrind developers.
Our aim with this preview release is to advertise
Valgrind's improved ARM support and encourage people
to try it out and find bugs before the official 3.6.0
release. Please report bugs via upstream's BTS:
http://valgrind.org/support/bug_reports.html
or you can ask on linaro-toolchain(a)lists.linaro.org
if you have any problems.
This release is a snapshot of upstream subversion; it
should generally work but you may encounter bugs, especially
if you run it on hand-optimised assembly that uses obscure
instructions.
New (upstream) features in this snapshot include:
* Greatly improved support for ARM
* Support for the Thumb instruction set
* Support for NEON and VFPv3 instructions
Known issues:
* callgrind has difficulty identifying ARM function
call and return so may not produce useful results
Downloads are available from the Linaro Overlay PPA:
https://launchpad.net/~linaro-maintainers/+archive/overlay
...so if you're running Linaro on an ARM system you
should be able to just install it with
'apt-get install valgrind'.
-- Peter Maydell
To All Ye Linaro Toolchain Folk, (and OpenOCD developers too)
After a week of reading specifications and code, I am ready to start
doing some serious hacking on OpenOCD. The following outlines my present
plans and expectations, with the caveat that time can change everything.
Last week, I started testing my BeagleBoard with OpenOCD, so I have
begun trying to validate and improve the Cortex-A8 support. Indeed, I
have already committed a minor patch that fixed a bug in the trunk
caused by new command syntax required to distinguish physical memory
addresses from virtual ones. That bug had been preventing the BeagleBoard
support from working for several months, so this seems to show that
nobody has been using (or even testing) the latest code with that board.
It seems that much of the debug architecture can be shared between these
two cores, so features added and bugs fixed for A8 should help me
implement A9 faster. Indeed, A9 support may be more a matter of
refactoring the existing code than developing new code. In this respect,
the lists of tasks for A8 and A9 may end up proceeding in parallel.
Cortex-A8:
1) Add missing topology detection for determining location of AHB-AP
(for system memory access), APB-AP (for DAP and other CoreSight
components), and register address range for accessing the DAP.
2) Fix Halt After Reset functionality (using vector catch magic).
3) Expose missing VFP3/NEON registers (only when present).
4) Fix various memory and resource leaks.
Cortex-A9:
1) Basic bring-up to successful attachment with debugger.
2) Develop board scripts for common evaluation boards.
3) Work on advanced features:
- download and run algorithms out of memory,
- breakpoints/watchpoints,
- tracing and performance monitoring,
4) Ensure SMP support works out-of-the-box.
Finally, it would be good to produce a new release when all of these
changes have made it into the tree. Due to various factors, the project
has not achieved a regular release schedule, but these features would
help to justify the effort from the community.
P.S. I have cc'd the openocd-development list in the hope of generating
useful feedback, but it requires subscribing to post (last I checked).
Sorry for the bad netiquette.
--
Zach Welch
CodeSourcery
zwelch(a)codesourcery.com
(650) 331-3385 x743
Hello,
I've now checked the Linaro branding changes in to the gdb-linaro Bazaar
repository.
I've created a Wiki page describing the Linaro GDB release process based on
that repository:
http://wiki.linaro.org/WorkingGroups/ToolChain/GDBReleaseProcess
(modeled after Andrew's GCCReleaseProcess page)
Review and comments are welcome!
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Hi,
In case this is useful in its current (unfinished!) form: here are some
notes I made whilst looking at a couple of the items listed for CS308
here:
https://wiki.linaro.org/Internal/Contractors/CodeSourcery
Namely:
* automatic vector size selection (it's currently selected by command
line switch)
* also consider ARMv6 SIMD vectors (see CS309)
* mixed size vectors (using to most appropriate size in each case)
* ensure that all gcc vectorizer pattern names are implemented in the
machine description (those that can be).
I've not even started on looking at:
* loops with more than two basic blocks (caused by if statements
(anything else?))
* use of specialized load instructions
* Conversly, perhaps identify NEON capabilities not covered by GCC
patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns)
* any other missed opportunities (identify common idioms and teach the
compiler to deal with them)
I'm not likely to have time to restart work on the vectorization study
for at least a couple of days, because of other CodeSourcery work. But
perhaps the attached will still be useful in the meantime.
Do you (Ira) have access to the ARM ISA docs detailing the NEON
instructions?
Cheers,
Julian
While trying out the u-boot-next branch I found a problem. First some
explanation. On most platforms, u-boot is linked to the address it
will first start running. For example when using NOR flash U-Boot
will be linked to an address in flash. Very early in the boot
process, U-Boot copies itself to the top and ram and jumps there.
This relocation has worked for years on powerpc and other arches. The
-next tree adds this for arm and it almost works.
The part that does not work is that some veneer routines do not get fixed up.
Here is an example. A routine called i2c_init calls __aeabi_idiv.
Here is the disassembly:
...
288: e59f0148 ldr r0, [pc, #328] ; 3d8 <i2c_init+0x1a4>
28c: e1a01083 lsl r1, r3, #1
290: ebfffffe bl 0 <__aeabi_idiv>
294: e2507006 subs r7, r0, #6
298: 4a000001 bmi 2a4 <i2c_init+0x70>
Later after this .o is linked with everything else and libgcc that morphs to:
8000b384: e59f0148 ldr r0, [pc, #328] ; 8000b4d4
<_end+0xfff97c98>
8000b388: e1a01083 lsl r1, r3, #1
8000b38c: eb00aa43 bl 80035ca0 <____aeabi_idiv_veneer>
8000b390: e2507006 subs r7, r0, #6
8000b394: 4a000001 bmi 8000b3a0 <i2c_init+0x70>
and the veneer version is at the end of text with other veneers:
80035ca0 <____aeabi_idiv_veneer>:
80035ca0: e51ff004 ldr pc, [pc, #-4] ; 80035ca4
<_end+0xfffc2468>
80035ca4: 80035999 .word 0x80035999
80035ca8 <____aeabi_llsl_veneer>:
80035ca8: e51ff004 ldr pc, [pc, #-4] ; 80035cac
<_end+0xfffc2470>
80035cac: 80035c7d .word 0x80035c7d
80035cb0 <____aeabi_lasr_veneer>:
80035cb0: e51ff004 ldr pc, [pc, #-4] ; 80035cb4
<_end+0xfffc2478>
80035cb4: 80035c61 .word 0x80035c61
80035cb8 <____aeabi_llsr_veneer>:
80035cb8: e51ff004 ldr pc, [pc, #-4] ; 80035cbc
<_end+0xfffc2480>
80035cbc: 80035c49 .word 0x80035c49
80035cc0 <____aeabi_uidivmod_veneer>:
80035cc0: e51ff004 ldr pc, [pc, #-4] ; 80035cc4
<_end+0xfffc2488>
80035cc4: 8003597d .word 0x8003597d
80035cc8 <____aeabi_uidiv_veneer>:
80035cc8: e51ff004 ldr pc, [pc, #-4] ; 80035ccc
<_end+0xfffc2490>
80035ccc: 80035721 .word 0x80035721
80035cd0 <____aeabi_idivmod_veneer>:
80035cd0: e51ff004 ldr pc, [pc, #-4] ; 80035cd4
<_end+0xfffc2498>
80035cd4: 80035c2d .word 0x80035c2d
then if we look at 80035998 we see some thumb code.
80035998 <__aeabi_idiv>:
80035998: 2900 cmp r1, #0
8003599a: f000 813e beq.w 80035c1a <.divsi3_nodiv0+0x27c>
When u-boot copies itself to ram it relocates the jump tables it knows
about and could relocate the addresses in the veneer routines if it
knew about them.
There are at least three possible ways to fix these:
1) u-boot has its own private libgcc and if I use it the problem goes away.
2) is there an option for the toolchain to use an arm libgcc instead of thumb?
3) is there a way to find the veneers at runtime and fix them up?
All input welcome.
Thanks,
John
Hello Michael,
I'm looking into "branding" changes needed for a Linaro GDB release. So
far I've made the following changes:
- Set default PKGVERSION to "Linaro GDB" instead of "GDB"
- Set default BUGURL to "http://bugs.launchpad.net/gdb-linaro/" instead of
"http://www.gnu.org/software/gdb/bugs/"
- Set version number according to Linaro version scheme
- Update release script to generate tarballs/directories named
"gdb-linaro-$VERSION" instead of "gdb-$VERSION".
As a result, the default GDB startup output now reads:
GNU gdb (Linaro GDB) 7.2-2010.10-0
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>.
Do you agree that this is the way we should go? Have I overlooked
anything?
Unless there are objections, I'm planning to check these changes in later
this week.
As a related question, the generated files in a standard GDB 7.2 release
seem to have been built on a relatively old system (RHEL 4 ?), which is
visible through the versions of tools like bison, flex, texinfo, and
gettext used to build those files. When building our Linaro GDB release
tarballs, should we:
- just use the tools as installed on a recent build system (say, Ubuntu
Lucid), or
- attempt to rebuild the release with the exact same set of tools used for
the GDB 7.2 release?
The second option has the advantage of reducing the amount of changes, e.g.
visible in a full diff of the release tarballs. However, it has the
disadvantage that reconstructing those exact set of tools (including Red
Hat patches, it seems) is somewhat difficult, and can in addition lead to
somewhat outdated results ...
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294