Hi there. I've looked further into the intermittent
gcc/testsuite/g++.dg/cdce3.C test failures. Taking Ira's
vectoriser-only fix-pr51301-4.6 branch and comparing it with it's
predecessor r106845:
* cdce3.o itself is identical across compilers
* Fault occurs in a parallel test run as part of the normal auto build
* Fault occurs every time
* Fault occurs with a manual 'make check-gcc RUNTESTFLAGS="dg.exp=cdce*'
* Fault doesn't occur when building from the command line
* Fault doesn't occur after updating binutils
I'm suspicious of the linker. The auto builders are Natty based and
come with ld 2.21.0.20110327. Updating them to Oneiric's
2.21.53.20110810 clears the problem.
I've saved the build trees. I see no reason not to commit
~ramana/gcc-linaro/fix-lp-900426 and ~irar/gcc-linaro/fix-pr51301-4.6.
-- Michael
== GDB ==
* Ongoing work on remote support for "info proc" and core file
generation. Implemented initial version of latest solution
via accessing arbitrary files on the remote site.
== GCC ==
* Started familiarizing myself with current status of various
performance patches in programm, in preparation of my taking
on GCC performance work next year.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Summary
* make check-gcc on windows.
* crosstool-ng patches
Details:
1. Two patches for crosstool-ng:
* Fix the compile error when CT_USE_SYSROOT is not "y". With this
fix, we can config crosstool-ng to remove the symbol link for windows
build.
* Add scripts to build manual for newlib.
2. make check-gcc on Windows:
* Wrap gcc/g++ for windows test. testglue.c should be compiled with
gcc not g++.
* Enhance scripts to convert path using "cygpath -w"
3. Analyze and root cause the pseudo new failed cases on windows.
* gcc fail cases (gcc.dg/cpp/assert3.c, gcc.dg/cpp/include7.c and
gcc.dg/cpp/trad/assert3.c)
Root cause: " in options are removed in the test scripts. e.g.
When reading gcc.log, you can find “-Aabc = jkl” in "Executing
on host" as:
Executing on host: …/cpp/assert3.c -A abc=def -A abc(ghi)
"-Aabc = jkl" …
But in spawn: the “” are removed.
spawn …/cpp/assert3.c -A abc=def -A abc(ghi) -Aabc = jkl
* g++ fail cases (dwarf2/lineno-simple1.C, dwarf2/pr44641.C and
dwarf2/pr46527.C)
The assembler codes generated from windows g++ and linux g++ are
same except the PATH string. And all PASS on linux test.
It seams the scripts can not grep the expected string on windows.
* Tests on windows are not stable. For each test, there will have
random fail cases (pass when retesting separately).
Plans:
* Create Makefile for embedded toolchain in linaro crosstool-ng.
Best regards!
-Zhenqiang
Hi,
I was just wondering if anyone knows of any current or future dependencies with the Linaro toolchain (2011.11) and the Linaro release of GDB and the Linux Kernel?
Is it considered safe to use the toolchain with the upstream releases of GDB and the Kernel, assuming that the versions of each are suitably compatible?
Or are there potential dependencies on work that has been done in the toolchain? For example, new instructions supported in the compiler/assembler, or enhancements to the kernel/runtime library on the Linaro branch that would depend on them being in sync.
Thanks,
Dave.
Hi,
OpenEmbedded-Core:
* the CSL 2011.03 recipe works with localization support disabled
* got the OE-Core sato image to built (~250 source packages)
* also built the Qt4 demo image (~100 source packages) to stress the
C++ part of the toolchain
* both are booting using qemu and seem to work just fine.
* all of the Linaro members approved the request to contribute to
OpenEmbedded
* started to post patches onto the mailing list
* briefly looked at the Linaro binary toolchain
* the most recent one is dynamically linked
* while the old one has old binutils (.21) that causes issue with
--gc-sections
* Currently the build is using the OE qemuarm machine configuration that
uses a Yocto kernel and targets armv5te. This is something I'd like to
look at too.
Regards
Ken
Continued work on 64-bit shifts.
- Improved 64-bit shifts without NEON (should benefit all cases).
- Fixed bugs in constant shift code.
- Rewrote 64-bit neon patch to take advantage of the new non-neon
code, in the fall-back case.
- Titied up the code, in general.
- Rewrote SImode shift amount patch for new neon patches.
The code produced now seems pretty good, but it still seems to choose
which mode to use slightly haphazardly. The next step is to figure that
out and benchmark the results.
Had a few more attempts at getting the LAVA system to do something
useful for me. I'm getting closer, but keep hitting new problems. Some
of them my fault, and some are bugs in the system. Paul Larson has been
very kindly helping me out and swatting the bugs.
Didn't get much further with benchmarking for the generic tuning. This
has been put on the back-burner while I work on the Neon shifts, and my
test runs on my IGEPv2 A8 board have all been interrupted by power cuts
or rendered useless by my forgetting to kill background tasks (such as
Xorg).
---- Next weeks
On vacation December 19th - January 3rd (returning January 4th).
== General ==
* Tidying things up and updating my list of statuses
== String routines ==
* Adding strchr and strlen to eglibc; tests running at the moment.
Dave
[short week, three days]
RAG:
Red:
Amber:
Green:
Current Milestones:
|| || Planned || Estimate || Actual ||
||upstream-omap3-cleanup || 2011-11-10 || 2011-12-15 || 2011-12-12 ||
||cp15-rework || 2012-01-06 || 2012-01-17 || ||
||initial-a15-system-model || 2012-01-27 || 2012-01-27 || ||
||qemu-kvm-getting-started || 2012-03-04?|| 2012-03-04?|| ||
(for blueprint definitions: https://wiki.linaro.org/PeterMaydell/QemuKVM)
Historical Milestones:
||add-omap3-networking || 2011-10-13 || 2011-10-13 || 2011-10-13 ||
||a15-systemmode-planning || 2011-10-13 || 2011-10-13 || 2011-09-22 ||
||a15-usermode-support || 2011-11-10 || 2011-11-10 || 2011-10-27 ||
== cp15-rework ==
* estimate pushed back a bit because I've ended up doing this in
parallel with the other blueprints. Also exynos4210 review has
taken some time.
== upstream-omap3-cleanup ==
* split up the last handful of patches in the stack which were doing
several things at once
* this blueprint is now complete, meaning that the next stage of omap3
upstreaming will be cleaning up individual subsets of functionality
to send upstream. This is all backburner level priority, though.
== other ==
* reviewed most of Samsung's exynos4210 model
* completed a conflict-heavy rebase of qemu-linaro (the result of
MemoryRegion conversions for omap devices landing upstream)
* LP:903239 : added linux-user support for some missing xattr syscalls
that were causing build problems for apparmor
-- PMM
Hi,
After learning how to control MEM_ALIGN and, therefore, alignment
hints from the vectorizer, I was able to generate 64-bit hints (with
the help of Ramana's patches). I saw a 16% improvement on a benchmark
with stack variables, for which we now force alignment to 64 bits and
create alignment hints, instead of using peeling.
Ira
Hi,
* Finished an across-compilers report for benchmarks over the latest in FSF
and Linaro series. Will start storing results in the
linaro-toolchain-benchmarks bzr repository.
* Looking closer at eembc results, especially regressions between gcc-4.4
and gcc-4.6. Did runs with gcc-linaro-4.4 with -fno-unroll-loop. Will
continue analyze and try to present the result in a good way.
* Reviewed Michael's geomean implementation.
* I will be on Christmas holiday w52 and w01, will be back 9/1.
/Regards
Åsa
Hi there. Could the toolchain team please have a look through the
current GCC blueprints and update them? You can see a list and states
at:
http://apus.seabright.co.nz/helpers/backlog
and for gcc-linaro only at:
http://apus.seabright.co.nz/helpers/backlog/project/gcc-linaro
Please check for any that:
* are on your short-term todo list but aren't against your name
* have been started but are stuck in the backlog or todo
* are finished but not marked as such
* are blocked
* are duplicates or too undefined
* or are obsolete
I'm especially interested in:
* "slp-supported-ops"
* "sms-register-scheduling"
* "better-block-operations"
* "libraries-for-backlog"
* "backport-conditional-execution"
* "improve-peeling"
* "64-bit-sync-pimitives"
* "neon-strided-load-extract"
If you've finished a significant amount of work on one blueprint then
let me know. We can split that work out and push the rest back into
the backlog.
Also, let me know if you're blocked on final benchmarking. We can now
easily benchmark a merge request and see the difference.
-- Michael
Continued work on 64-bit neon operations. The negdi2 seems to be more
difficult than previously thought - vneg won't do it, and there's no way
to encode either "0-reg" or "not(reg)+1", so I'm shelving that idea for
the moment, and moving on to one_compldi2_neon, which ought to be
straight forward.
Did the entire Linaro GCC release process, in the absence of Michael
Hope, from source to announcement. The process didn't go as smoothly as
I'd have liked, but I got through it, mostly. Hopefully Michael won't be
travelling next time ...
Tried to figure out how to do 64-bit shifts using a QImode shift amount.
This promised to eliminate the unnecessary zero-extends, but it doesn't
work because neither iwmmxt or neon registers are permitted to hold
QImode values (presumably changing this would have consequences
elsewhere?). Annoyingly, it's also not possible to put SImode values in
(most) neon registers, so I'm not sure quite how to optimize the values.
More investigation required.
Hi!
This week was spent doing internal ST-E work, but related to the Linaro
tcwg so I will give a short summary anyway.
I have taken the Linaro toolchain (prebuilt by the Android working group)
and used it in our internal Android build.
There were several build errors, as expected when going from gcc-4.4.3,
which is the default compiler in Android (Gingerbread) to gcc-4.6.2. Many
errors were solved with patches from the Linaro Android distribution.
Did some benchmarking related to web browsing:
ARMBBench (load and rendering of web pages) - gave me 4-6% improvement with
the Linaro toolchain.
Sunspider and BroserMark (JavaScript) - gave me ~6% overall regression with
the Lianaro toolchain. However, when zooming in to individual test cases -
SunSpider consist of ~25 tests in 9 categories - the results are really
scattered. A few tests are mainly contributing to the regression. I try to
narrow things down to understand which code parts in v8 (the JavaScript
engine) that causes the slowdown.
Best regards
Åsa
Continue working on the patch to estimate register pressure on SMS:
Addressing the comments received from Richard and Ayal.
Testing the patch on libav micro benchmarks.
Summary
* "make check-gcc" for linux gcc, cygwin gcc and native windows gcc.
Details:
1. "make check-gcc" on linux.
* One more failed case (gcc.dg/visibility-d) for the toolchain
generated from crosstool-ng based on embedded toolchain code base. But
logs show the .s files are the same.
2. "make check-gcc" on windows.
* Dir format issue:
Native windows programs require the disk symbol format as c:, d:,
etc. But in cygwin, it is changed to /cygdrive/c, /cygdrive/d. Need
wrapper to convert it.
* qemu output in cygwin (Qemu-0.15.1-windows-Medium.zip from
http://lassauge.free.fr/qemu/)
qemu can not output the result like "*** EXIT 0" on screen. Need
wrapper to handle it.
* "make check-gcc" for cygwin toolchain (build from scratch in cygwin).
You can run make check like it on linux.
* "make check-gcc" for pre-installed binary toolchain (installed as
native windows programs)
a. configure gcc from the source package. (Only need the config*,
Makefile to make sure "make check" work)
b. reset the TEST_GCC_EXEC_PREFIX (site.exp) to the correct dir
(INSTALL DIR) with the right format.
c. wrap gcc/xgcc to use the pre-installed gcc and change the dir format.
d. handle /usr/share/dejagnu/testglue.c (cp it to current test dir
or convert it to windows path)
Plan:
* Handle g++ test on windows.
* Work out a formal document or wiki page on how to "make check-gcc" on windows.
* Test and analyze the failed cases.
Best regards!
-Zhenqiang
PS:
1) qemu-system-arm.exe sample
#!/bin/sh
dir=`dirname $0`
run ()
{
# Change /cygdrive/e to e:
para=`echo $* | sed -e 's/\/cygdrive\/e/e\:/'`
# arm.exe is the real qemu-system-arm.exe
# output to stdout.txt or stderror.txt.
$dir/arm.exe $para | tee
# output to screen
cat $dir/stdout.txt
}
run $*
2) xgcc.exe sample
#!/bin/sh
run ()
{
# Change /cygdrive/e to e:
para=`echo $* | sed -e 's/\/cygdrive\/e/e\:/'`
# Use a local copy of testglue.c rather than /usr/share/dejagnu/testglue.c
para=`echo $para | sed -e 's/\/usr\/share\/dejagnu\/testglue.c/testglue.c/g'`
# run the test with preinstalled binary toolchain
#TBD: handle g++
arm-none-eabi-gcc.exe $para
}
run $*
3) TEST_GCC_EXEC_PREFIX in site.exp sample
# Toolchain is installed at e:/Dec/RC3.
TEST_GCC_EXEC_PREFIX "e:/Dec/RC3/lib/gcc/"
Hi,
* I've been debugging various errors and warnings that I encountered
with the binary CSL 2011.03 toolchain
* Fleshed out my recipe for the external toolchain; now get a working
core-image-minimal that boots fine within qemu
* Debugged why cmake based recipes (like libproxy) are having trouble
when compiling with an external toolchain
* Currently the libc is provided by the sysroot of the external
toolchain. This might not be ideal and as time permits I'd like to find
a way to get eglibc build instead.
Regards
Ken
Task Planned Estimated Actual
Historical
~~~~~~~
Connect 2011.q4
preparation 28/10/2011 28/10/2011
28/10/2011
Linaro Tasks
~~~~~~~~~~~~
Fully Investigate the O3
performance
regressions 31/01/2012
Neon backend experiments 09/12/2011 14/12/2011
with alignment hints
and addressing mode work.
Investigate partial-partial
PRE and regression with
bitmnp01 18/12/2011
Writeup on the optimizations 31/12/2011
enabled with PGO
RAG :
RED : None
AMBER:
==Progress===
* The Android guys found a bug with the vcvt.f64.s32 instruction
coming out after my patch and I found a few assembler issues as well
during this process which are now fixed upstream.
* Backported the A15 patches into Linaro 4.6
* Assisted as needed with the release which really wasn't too much
work for me other than the revert .
* Backported one part of the partial-partial PRE patch . Still looking into it.
* Did some analysis of the failure with di-layout.c test failure and
RichardS has now fixed it in the middle-end.
* Wrote a patch to replace all vector mode aligned vldm / vstm with
equivalent vld1.64 and vst1.64 to allow more alignment hints to come
out of the compiler. Still not fully happy with it but it's looking
much better than the original hack.
=== Plans ===
* Continue looking at partial-partial PRE and try and understand it further.
* Flush out these neon patches that I'm accruing with the addressing
modes and see where we get to with alignment hints and vld1.64's .
* Look at movw's / movt's vs constant pools.
* Submit my PGO patch .
Absences.
* Dec 19 - 31st Dec - Tentatively booked
* Feb 6-10 : Linaro Connect Q1.12.
* Feb 11- 15 : Holiday.
== QEMU ==
* Wrote a fix for bug 883133 (code buffer/libc conflict); spent some
time testing it because
I wasn't sure whether the crash I was seeing after that was my fix not
being complete or actually
bug 893208.
* Got it to boot with -cpu 486; without that it's triple faulting in
a divide just after a load of time stamp
reads which makes me suspicious that 893208 is a timer problem.
* (It also fails when used with vnc graphics, but works in SDL and
curses, but I'll leave that bug for
another time).
== String routines ==
* With one more tweak to my memchr, it finally made it into eglibc.
Dave
Hi,
I received this question from an ARM FAE:
Does the 4.5.2 version support A15 optimization? Or would
you recommend using the latest 4.6 versions?
Thanks for any response I could forward back to him.
Best regards,
Matt
== This week ==
* Got the -fsched-pressure code into a state where it's almost
presentable. Found a few more things to tweak on the way.
Fixed some FIXMEs, notably to honour MAX_SCHED_READY_INSNS.
* More testing on ARM. Tried to get some SPEC2000 results
as well as the usual EEMBC & DENbench, but I'm not sure
how noisy the SPEC ones are.
* More testing on powerpc. Decided that this really isn't a good target
to test on for 4.7 because of the poor choice of pressure classes.
SPEC CPU2006 INT results are reasonable-to-good, but the FP ones
suffer from the fact that we think there are twice as many registers
available for normal FP than there actually are. I'd like to fix this,
but all pressure-estimation bits of GCC suffer from the same problem,
and it's hard to justify as part of Linaro, because it doesn't
apply to ARM.
* Fixed upstream PR 50873 (ICE for NEON misaligned moves). Thanks to
Ramana for the heads-up and analysis.
* Retested and posted the patch for PR 48941 upstream (poor code generated
by the vzip*() and vunzp*() arm_neon.h functions).
Richard
== GDB ==
* Created and published Linaro GDB 7.3-2011.12 release.
* Updated Linaro GDB 7.3 to GDB 7.3.1 code base.
* Implemented support for single-stepping atomic operation
code sequences for ARM (and Thumb) (LP #892008). Checked
in to mainline and Linaro GDB.
* Ongoing work on remote support for "info proc" and core file
generation. Currently yet another solution for the remote
interface has been brought up in mailing list discussions
(support accessing arbitrary files on the remote side, not
just /proc). I'm working on prototyping this suggestion.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Hi Michael,
I have finally managed to complete the release process. It wasn't quite
as smooth as I would have liked, but we seem the have got there!
Notes:
- Ramana's VCVT patch caused an Android problem. This was reverted right
before the release.
- The initial release spin and test went without a hitch.
- There was an additional test failure in the GCC testsuite, but this
turns out to be because the snapshot date "20121201" happens to contain
the string "120". Interestingly, this will also be true for most of 2012.
- The ubutest runs seem to have a some problems: all of the glibc and
python builds have failed with a message about libgcc. Since this has
hit both 4.5 and 4.6 simultaneously I'm assuming it's environmental and
not caused by a new toolchain bug. The rest of the compilation appears fine.
- The benchmarking seems fine on A9, but I couldn't find results for the
others, although the scheduler lists the jobs.
- The upload to Launchpad was somewhat problematic. Uploading 4.5 took
two attempts. Uploading 4.6 failed about 6 times (at 20 minutes or so
each) before I tried from another machine with a faster uplink - that
went first time.
Andrew
The Linaro Toolchain Working Group is pleased to announce the 2011.12
release of both Linaro GCC 4.6 and Linaro GCC 4.5.
Linaro GCC 4.6 2011.12 is the tenth release in the 4.6 series. Based
off the latest GCC 4.6.2+svn181866, it contains a range of vectoriser
performance improvements and general bug fixes.
Interesting changes include:
* Updates to 4.6.2+svn181866
* Generic tuing support for Big-endian platforms.
* SLP support for operations with arbirary numbers of operands.
* SLP support for conditions.
* Pattern recognition support in basic-block SLP.
* Enhancements to mixed-size condition pattern recognition.
* Support for 64bit __sync* primitives on ARM.
* Unaligned block-move support for ARMv7.
* Added Cortex-A15 integer pipeline tuning.
Linaro GCC 4.5 2011.12 is the sixteenth release in the 4.5
series. Based off the latest GCC 4.5.3+svn181877, this is a
maintenance focused release.
Interesting changes in 4.5 include:
* Updates to 4.5.3+svn181877
The source tarballs are available from:
https://launchpad.net/gcc-linaro/+milestone/4.6-2011.12https://launchpad.net/gcc-linaro/+milestone/4.5-2011.12
Downloads are available from the Linaro GCC page on Launchpad:
https://launchpad.net/gcc-linaro
More information on the features and issues are available from the
release page:
https://launchpad.net/gcc-linaro/4.6/4.6-2011.12https://launchpad.net/gcc-linaro/4.5/4.5-2011.12
Mailing list: http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Bugs: https://bugs.launchpad.net/gcc-linaro/
Questions? https://ask.linaro.org/
Interested in commercial support? inquire at support(a)linaro.org
Hi,
- fixed PR 51285
- continued looking at the alignment issue, ran Michael's script with
different options, tested Ramana's preliminary patch for vld1/vst1,
and my "don't peel for low loop bounds" patch
Ira
The Linaro Toolchain Working Group is pleased to announce the
release of Linaro QEMU 2011.12.
Linaro QEMU 2011.12 is the latest monthly release of
qemu-linaro. Based off upstream (trunk) QEMU, it includes a
number of ARM-focused bug fixes and enhancements.
New in this month's release:
- There are no Linaro-specific changes of note in this release
- This release is based on the upstream QEMU 1.0 release.
(Note that future qemu-linaro releases will continue to track
upstream trunk; the release dates for upstream and our
release just happened to be conveniently aligned in this case.)
Known issues:
- Graphics do not work for OMAP3 based models (beagle, overo)
with 11.10 Linaro images.
- This release of qemu-linaro is known not to work on ARM hosts.
(See bugs #883133, #883136)
The source tarball is available at:
https://launchpad.net/qemu-linaro/+milestone/2011.12
More information on Linaro QEMU is available at:
https://launchpad.net/qemu-linaro
The Linaro Toolchain Working Group is pleased to announce the release of
Linaro GDB 7.3.
Linaro GDB 7.3 2011.12 is the fourth release in the 7.3 series. Based off
the latest GDB 7.3.1, it includes a number of ARM-focused bug fixes and
enhancements.
This release contains:
* Update to GDB 7.3.1 code base
* Support single-stepping atomic operations (LDREX/STREX sequences)
The source tarball is available at:
https://launchpad.net/gdb-linaro/+milestone/7.3-2011.12
More information on Linaro GDB is available at:
https://launchpad.net/gdb-linaro
I had a play with the vecotiser to see how peeling, unrolling, and
alignment affected the performance of simple memory bound loops.
The short story is:
* For fixed length loops, don't peel
* Performance is the same for 8 byte aligned arrays and up
* Performance is very similar for unaliged arrays
* vld1 is as fast as vldmia
* vld1 with specified alignment is much faster than vld1
The loop is the rather ugly and artifical::
void op(struct ains * __restrict out, const struct aints * __restrict in)
{
for (int i = 0; i < COUNT; i++)
{
out->v[i] = (in->v[i] * 173) | in->v[i];
}
}
where `struct aints` is a aligned structure. I couldn't figure out how
to use an aligned typedef of ints without still introducing a runtime
check. I assume I was running into some type of runtime alias
checking.
This compiled into::
vmov.i32 q10, #173
add r3, r0, #5
0:
vldmia r1!, {d16-d17}
vmul.i32 q9, q8, q10
vorr q8, q9, q8
vstmia r0!, {d16-d17}
cmp r0, r3
bne 0b
I then lied to the compiler by changing the actual alignment at
runtime. See:
http://people.linaro.org/~michaelh/incoming/runtime-offset.png
The performance didn't change for actual alignments of 8,
16, or 32 bytes.
I then converted the loop into one using vld1 and fed it smaller
alignments. See:
http://people.linaro.org/~michaelh/incoming/small-offsets.png
The throughput falls into two camps: one of alignments
1, 2, or 4 and one of 8, 16, 32. The throughput is very similar for
both camps but has some stange dropoffs at 24 words, around 48 words,
and around 96 words. The terminal throughput at 300 words and above
is within 0.5 %
I then converted the vld1 and vst1 to specifiy an alignment of 64
bits. See:
http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50
words by 14 %. This graph also shows the overhead of the runtime
peeling check. The blue line is the vectoriser version which is
slower to pick up due the greater per call overhead.
I then went back to the vectoriser and changed the alignment of the
struct to cause peeling to turn on and off. See:
http://people.linaro.org/~michaelh/incoming/unroll.png
At 200 words, the version without peeling is 2.9 % faster. This is
partly due to a fixed count loop turning into a runtime count due to
unknown alignment.
This run also showed the affect of loop unrolling. The loop seems to
be unrolled for loops of <= 64 words and drops off in performance past
around 8 words. When the unrolling finally drops out, performance
increases by 101 %.
Raw results and the test cases are available in
lp:~linaro-toolchain-dev/linaro-toolchain-benchmarks/private-runs
A graph of all results is at:
http://people.linaro.org/~michaelh/incoming/everything.png
The usual caveats apply: this test was all in L1, only on the A9, and
very artificial.
-- Michael
> On Mon, Dec 5, 2011 at 1:40 AM, Tom Gall <tom.gall(a)linaro.org> wrote:
> > I probably know the answer to this already but ...
> >
> > For shared libs one can define and use something like:
> >
> > void __attribute__ ((constructor)) my_init(void);
> > void __attribute__ ((destructor)) my_fini(void);
> >
> > Which of course allows your lib to run code just after the library is
> > loaded and just before the library is going to be unloaded. This helps
> > keep out cruft such as the following out of your design:
> >
> > PleaseCallThisLibraryFunctionFirstOrThereWillBeAnErrorWhichYouWillHitCausingYouToPostToTheMailingListAskingTheSameQuestionThatHasBeenAsked1000sOfTimes();
> >
> > Yeah .. you know the function. I don't like it either.
> >
> > Unfortunately this doesn't work when people link in the .a from your
> > lib. Libs like libjpeg-turbo in theory should never ever need to be
> > linked in that fashion but consider the browsers who link to the
> > universe instead of using system shared libs.
On Mon, Dec 05, 2011 at 04:19:11PM +0800, Kito Cheng wrote:
> Here is some triky way for this problem, you can put the constructor
> and destructor to the source file which contain necessary function
> call in your libraries to enforce the linker to archive your
> constructor and destructor.
>
> However if this solution is not work for your situation, you can apply
> the patch in attach for build script to enable the
> LOCAL_WHOLE_STATIC_LIBRARIES for executable,
>
> After patch you can just add a line in your Android.mk :
>
> LOCAL_WHOLE_STATIC_LIBRARIES += libfoo
>
> The most disadvantage of this way is you should always link libfoo by
> LOCAL_WHOLE_STATIC_LIBRARIES...and this patch don't send to linaro and
> aosp yet.
[...]
Part of the problem here is that .a libraries lack the dependency and
linkage metadata that shared libraries have.
-2)
Put up with the need to call an explicit initialisation function
for the library. A lot of commonly-used libraries require an
initialisation call, and I'm not sure it causes that much of a
problem in practice...
-1)
Put a C++ wrapper around just enough of your library such that your
constructor/destructor code is recognised as a needed static
constructor/descructor by the toolchain.
I can't think of a very nice way of doing this, so I won't elaborate
on it...
It's also not really a solution, since you still need to pull in a
dummy static object from somewhere in order to cause the construcor
and descructor to get called.
0)
libtool or similar may help solve this problem, but I don't know much
about this -- also, for solving the problem, that approach only works
if uses of your library link via libtool.
1)
One hacky approach is to rename your library to libmylib-real.a, and
then make replace libmylib.a with a linker script which pulls in the
needed constructor as well as the real library:
libmylib.a:
EXTERN(__mylib_constructor)
INPUT(/path/to/libmylib-real.a)
This works, providing that __mylib_constructor is external (normally,
you would be able have the constructor function static, but it needs
to be externally visible in order to be pulled in in this way.
2)
Another way of doing a similar thing is to mark __mylib_constructor
as undefined in all the objects that make up the library.
Unfortunately, there seems to be no obvious way of doing that: the
assembler generates undefined symbol references automatically for
unresolved references at assembly time. There's no way for force
the existence of an undefined symbol without an actual reference to
it. objcopy/elfedit don't seem to support adding such a symbol
either. It would be simple to write a tool to add the undefined
symbol reference (such tools may exist already), but binutils doesn't
seem to provide this for you. The plausible-looking -u option to
gcc doesn't do anything unless doing a link.
One other way of doing it without a special tool is to insert a bogus
relocation into the text section of each object with an assembler
.reloc directive specifying relocation type R_<arch>_NONE.
There isn't really a portable way to do that, though. The name of
the relocation changes per-arch, and some arches have other quirks
(on ARM for example, .reloc cannot refer to the current location,
but seems instead to need to refer to a defined symbol which is non-zero
distance away from the location counter).
One advantage to this approach is that your .a file looks just
like any other .a file. Also, you can include that dependency
in only those objects which really require the library to be
initialised (normally, this is not a huge benefit though, since
probably most of your objects _do_ require the library to be
initialised).
A disadvantage (other than portability problems) is that, like (1),
the constructor symbol must be external (not static)... so it
pollutes the symbol table and isn't protected against people calling
it directly.
You can create a dummy symbol instead of referring to the constructor
symbol directly though -- this solves the second problem.
3)
Finally, you can split your contructor/destructor code out into a
separate .o file (say mylib-ctors.o), and use the linker script
trick for (1) to forcibly include this object when linking:
libmylib.a:
INPUT(/path/to/mylib-ctors.o /path/to/mylib-real.a)
This avoids some of the disadvantages of the other approaches,
but you still end up with a strange-looking library which is really
a linker script.
This is closer to how the C library traditionally solves the problem
(i.e., the crt*.o stuff). libc.so also tends to be a linker script,
which deals with the fact that some parts of libc must be statically
linked from a separate library when linking to -lc.
Obviously, approaches (1)..(3) all suffer from arch or toolchain
portability problems (or both). (The GNU/GCC __constructor__ thing
is obviously a portability problem in itself, it you're minded to
care about it.)
Cheers
---Dave
* Linaro GCC
Continued work on 64-bit shift / extend / etc. in NEON. I have posted an
RFC to the gcc-patches list in the hope of getting some feedback on how
best to fix this. No response yet. Hopefully some of the Linaro guys are
at least looking at it ...
Merged FSF GCC 4.5 and 4.6 into the Linaro GCC release branches prior to
the release next week.
Set more benchmarking work running in my ongoing investigation into
generic tuning.
Did a dry run of the extra release testing Michael normally does. It
failed. Michael says he's fixed it now, but I know how to do my bit, so
fingers crossed.
* Other
Experienced some IT/connectivity outages within Mentor. Resolved now.
==Progress===
* Off sick on Monday
* Systematic testing duty - few Aarch64 issues.
* Linaro patch review duty.
* Tested my vcvt fixed point patch and close to committing.
* Worked on sometime on movw / movt for symbol references rather than
constant pools . While this gives nice benefits it's a code size hog
and needs further investigation.
* PGO patch being tested finally and should go back up for review.
=== Plans ===
* Release week next week.
* Start looking at partial_partial PRE.
* Finish committing by backlog of patches.
Absences.
* Dec 19 - 31st Dec - Tentatively booked
* Feb 6-10 : Linaro Connect Q1.12/
Summary:
* Patch linaro crosstool-ng.
* Windows install package
Details:
* Patch linaro crosstool-ng:
* Back port upstream patches.
* Check-in the zlib/libiconv/expat/ncurses related patches to linaro branch.
* Create reference windows install package for linaro toolchain from
installjammer. The install process works well on Win7.
Plans:
* Investigate test on Windows.
Best regards!
-Zhenqiang
Hi,
OpenEmbedded:
* started on creating a receipts to compile the "core-image-minimal"
using an external prebuilt toolchain (csl arm-2011.03)
* there are still a lot of warnings at the do_package/do_package_qa task
* the good news is that the build process finishes and kernel plus root
file system image gets created
* the bad news is that the rootfs lacks some important libs like libc
and therefore won't run under qemu-system-arm
(since init, busybox, etc. are dynamically linked)
* currently a 3-lines hack on oe-core is required to be able to
overwrite a task of the generic glibc receipt; all other files could go
into a separate layer
Linaro Android:
* had a quick look into the EABI attribute tag issue
Regards
Ken
== String routines ==
* Sent updated memchr to the eglibc list
== 64 bit atomics ==
* Ran a set of timing consistency tests that a colleague had sent me
while I was off; Panda passed those, so time
doesn't appear to be going backwards or anything, so that's not the
problem with membase.
* Pushed the code into linaro-gcc.
== QEmu ==
* Tested Peter's prerelease - all good.
* Started looking at the issues for running in TCG mode on ARM
== Other ==
* Read through the ARMv8 instructions docs that landed on arm.com;
quite interesting. Note that multiple instruction
IT blocks are listed as being deprecated for 32bit mode on v8
(although this will work but it can be put in a mode to fault
you to make it easy to find the uses).
* Some debugging of Panda odd timing issue with Paul Mckenney.
Dave
RAG:
Red:
Amber:
Green:
Current Milestones:
|| || Planned || Estimate || Actual ||
||upstream-omap3-cleanup || 2011-11-10 || 2011-12-15 || ||
||cp15-rework || 2012-01-06 || 2012-01-06 || ||
||initial-a15-system-model || 2012-01-27 || 2012-01-27 || ||
||qemu-kvm-getting-started || 2012-03-04?|| 2012-03-04?|| ||
(for blueprint definitions: https://wiki.linaro.org/PeterMaydell/QemuKVM)
Historical Milestones:
||add-omap3-networking || 2011-10-13 || 2011-10-13 || 2011-10-13 ||
||a15-systemmode-planning || 2011-10-13 || 2011-10-13 || 2011-09-22 ||
||a15-usermode-support || 2011-11-10 || 2011-11-10 || 2011-10-27 ||
== qemu-kvm-getting-started ==
* now reasonably set up to run KVM under Fast Model; howto is here:
https://wiki.linaro.org/PeterMaydell/A15OnFastModels
* rebased kvm patches into qemu-linaro
* fixed bug where we weren't passing cpu number to kvm properly
when delivering an interrupt
* sent some minor patches to upstream qemu that will be needed for
kvm (eg configure script tweaks)
== initial-a15-system-model ==
* started on cleaning up a9/11mpcore private peripheral implementation;
now mostly done and looking much better as a base for a15
== other ==
* preparation for qemu-linaro release (rolled tarball, tested)
* submitted patch to fix buffer overrun in GIC model
* discussion: linux-user mode race conditions, and in particular
how we should handle signals that arrive during syscall emulation
* upstream patch review: imx31 round 3
-- PMM
== GDB ==
* Completed new set of patches to support both "info proc" and
core file generation across the remote protocol, and posted
them to the mailing list for review.
* Tested GDB trunk in preparation for 7.4 release branch point
on multiple platforms; analyzed and fixed a couple of problems,
some also present on ARM in remote testing. Patches checked
in to mainline.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
== This week ==
* More on -fsched-pressure. Testing on POWER7 showed a degenerate case
that I'd failed to handle well. Fixed that. Saw that part of the
problem on POWER7 was that IRA was using a combination of GENERAL_REGS
and CR_REGS as a single pressure class, so there appeared to be 39
registers available for storing integers. Fixed (or worked around) that.
Tweaked a few other things too. The only denbench result that I
wasn't happy with was RSA, where both forms of -fsched-pressure are
significantly worse than -fno-sched-pressure. Tracked down the cause
of that. We had a block BB1:
A: (set (reg:DI X) Y)
B: (clobber (reg:DI Z))
C: (set (subreg:SI (reg:DI Z) 0) (... X ...))
D: (set (subreg:SI (reg:DI Z) 4) (...))
where B makes sure that Z is treated as dead before C. Interblock
motion causes B to be scheduled in an earlier block, but none of
the other instructions can be. This means that, when we schedule BB1,
it still contains A, C and D, and Z now appears to be live on entry to
the block. C therefore appears to reduce register pressure, because
it contains the last use of X, and appears to leave Z's liveness
unaffected. In reality it should be treated as increasing register
pressure by 1 (-1 for the death of X, +2 for the birth of Z).
I "fixed" this by moving C's dependencies to B, a bit like we do
for scheduling groups (although none of the other handling of
scheduling groups should apply). This made a big difference,
so that the new code is a win on RSA.
There's still one SPEC2006 degradation on POWER7 that I want
to look at.
* Caught up on a lot of mail. gcc-patches backlog has gone down
from ~4900 when I got back to ~500.
* Briefly looked at x86's drap support, to see what would be needed
for ARM. Didn't look for long though: the overhead seems excessive
for optional alignment, and the agreement seemed to be that 128-bit
alignment wouldn't really make much of a difference anyway.
Richard
Hi!
* Continued with running eembc, coremark, denbench and spec2k on the ursas
with the latest of the Linaro and FSF series. The variants used were
o3-neon and o3-neon-novect. Something went wrong with the variants the
first time, so I had to rerun the tests once.
Discussed draft report with Michael, next week I will share with the rest
of the team.
* Did a rerun SPEC2K runs with "train" and "ref" data sets. I did -o2 and
-o3 runs on a panda with the two data sets. Asked for a sanity check of the
numbers.
* Prepared and held a presentation about the tcwg internally.
* Will be tied up with internal work for the most of w49.
Best regards
Åsa
Hi,
- Ran eon with gcc 4.7: there are much more loops similar to the one
in lp#831094 that get vectorized (due to some data ref analysis
improvement), so the impact of disabling peeling for such loops (i.e.
loops with low loop bound) is even bigger than for 4.6, and
vectorization improves the performance by 2.5%.
I prefer to understand the peeling/alignment situation better and not
just commit this patch (and I spent some time trying to do that).
- Fixed PR 51301 - a bug in over-promotion pattern. Proposed for merge
to gcc-linaro-4.6.
- Merged the last SLP patch to gcc-linaro-4.6.
Ira
This email is just a quick summary of what we (Linaro) are
planning in the way of QEMU work to support KVM on ARM Cortex-A15.
The idea is to let people know what's coming up, find out if we've
forgotten anything, and avoid people duplicating work unnecessarily.
Most of this is based on a useful session at the recent 'ARM server
mini-summit' in Orlando (UDS/Linaro Connect) at the beginning of
this month.
The work we're currently proposing to do falls into three parts:
* refactor QEMU's cp15 register handling
At the moment QEMU handles cp15 accesses by calling out to a single
helper function which is an enormous set of nested switch statements
to handle the different coprocessor registers. Access permissions are
checked separately at translate time. This design makes specifying
board-dependent or cpu-dependent registers somewhat painful; it's also
easy for the access permission checks to be out of sync. There is no
support for banked cp15 registers either (needed for trustzone and
virtualisation). We need a better design which lets a board or core
register handler routines for cp15 registers. This will make the code
cleaner and more maintainable as a base for new features.
This isn't strictly a requirement for KVM, but we're going to want
KVM to be able to hand off cp15 accesses to QEMU, and I don't think
that's going to be maintainable or reliable without this refactoring.
(https://blueprints.launchpad.net/qemu-linaro/+spec/cp15-rework)
* A15 system model
Basically a QEMU model of a Versatile-Express with a Cortex-A15
minus the virtualization and LPAE extensions. This needs the
A15 private peripherals (just the GIC in the right place in
the memory map, really; generic timer not required) and the
new memory map version of the vexpress board model, plus some
new cp15 registers. (Bill Carson has already done some patches
in this area but they need a little rework and may have minor
missing pieces.)
https://blueprints.launchpad.net/qemu-linaro/+spec/initial-a15-system-model
* miscellaneous integration work
We're aiming for a reasonable working prototype of A15 guest on
an A15 Fast Model host here; we need to fix at least some of
the bugs which currently mean upstream QEMU doesn't work on ARM hosts,
sort out which kernel and qemu trees we are developing from, and
get things running in our validation lab's continuous integration
setup.
https://blueprints.launchpad.net/qemu-linaro/+spec/qemu-kvm-getting-started
Also on the radar is a fourth piece of work:
* QEMU virtio-mmio support
This is adding support for the 'mmio' virtio transport, which will
allow virtio support in a versatile-express model. We're going to
need this at some point but the current thought is that we want
to do the above listed more important bits of work first...
(The exception would probably be if it turned out that this was
sufficiently useful for making early KVM development easier)
https://blueprints.launchpad.net/qemu-linaro/+spec/add-amba-virtio-support
So, questions:
(1) did we forget something important?
(2) is anybody else already planning to do any of this (or would
like to start)? if so we should coordinate...
(3) is there anything that the kernel folk need/want earlier
rather than later?
thanks
-- PMM
Hi,
Now that upstream trunk is in stage3 and we have a few patches that
won't really make it upstream until stage1 is reopened is it
worthwhile having a new status in the merge requests that moves it
into a to_upstream status . The other option is to have a common
spreadsheet that we keep updating with links to merge requests that
need to be upstreamed .
Thoughts ?
Ramana
PS - Any clue on what's happening with the branch diff bug that's been
open in launchpad forever now ?
Hi,
* Worked on peeling problem in eon (#831094). Wrote a patch that
checks if the number of vector iterations is going to be more than 2,
and disables peeling otherwise. With this patch I see about 1.5%
regression with vectorization (and about 7% without it).
* I am thinking to extend the patch for unknown number of iterations
by creating a run-time check. The threshold could be set by param.
Another option, could be doing it through the cost model, but it's
hard to evaluate costs when misalignments are unknown (and, I think,
the cost model handles known misalignment properly).
* Disabling peeling for low loop bounds also helps with one of EEMBC
benchmarks, for which vectorization with double-words is more
beneficial than with quad-words. It turns out that we are able to
force the alignment for double-words (and, therefore, avoid peeling),
because we check that the required alignment (64 in this case) is less
or equal to BIGGEST_ALIGNMENT, where
arm.h:#define BIGGEST_ALIGNMENT (ARM_DOUBLEWORD_ALIGN ?
DOUBLEWORD_ALIGNMENT : 32)
and
arm.h:#define DOUBLEWORD_ALIGNMENT 64
So, we can never force alignment for 128 bits on ARM. I wonder if
that's a real limitation.
* Proposed three SLP patches to gcc-linaro, and merged two of them.
Ira
Addressing the comments received from Richard and Ayal regarding the
patch to estimate register pressure.
Testing the patch on eembc and libav micro benchmarks.
Looking at the regressions seen with SMS.
== GDB ==
* Ongoing work on support for cross-platform core file generation.
Posted a new design proposal to the mailing list to include not
only "info proc mappings", but *all* "info proc" commands. This
would involve a remote protocol command to read arbitrary proc
files, instead of a specific command to retrieve the memory map.
* Investigated Launchpad bug:
#891970 msp430-gdb segmentation fault with target remote
== GCC ==
* Patch review week.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Worked on adding support for 64-bit NEON integer shifts. I have this
working now, although I'm still not very happy about how the register
allocator chooses which mode to use - it prefers core-registers if the
values start or end in core-regs, even though moving to values to NEON
registers might be more efficient (general 64-bit shifts in core
registers require several instructions). I've also had to mark the CC
register clobbered in all cases, even though it only gets clobbered in
some of them, which might be necessary, but isn't very satisfactory.
The NEON shifts work showed that 32->64 bit extends could be done better
also. This hasn't been a great problem up to now, but the shift amount
(in particular) is typically a 32-bit value and yet needs to be
zero-extended to 64-bit for NEON's purposes. Right now, GCC prefers to
extend the value in core-registers, and then copy it to NEON. This
works, but burns another core-register - a scarce commodity - so I think
it would be better to copy it first, and then extend it after. NEON has
instructions for this, so I'm investigating how to get the compiler to
do it (this is all strictly post-combine, so the usual options are out,
and the register allocator has to be allowed to do it the old way in the
case where core-regs really are the best option, so it's tricky).