Merged both GCC 4.5 and 4.6 from FSF to Linaro. Matthias requested that
I avoid a particular upstream 4.6 commit, so I selected the revision
before that as the merge point. The problem was then fixes upstream, and
another fix was desirable, so I've redone the merge from the branch head.
Another widening-multiplies bug was reported to me (I logged it as
pr50318/lp843775), so I've fixed that and committed the fix upstream,
and filed a merge request on Launchpad.
Finished fixing the bugs in my thumb2 constants optimizations, and
backported the new patches to Linaro GCC 4.6. Pushed the updated stuff
to Launchpad for testing.
Richard Sandiford found a flaw in my patch for pr50193/lp836401, so I've
done another version of that and posted it upstream. Ramana didn't like
that version. I've started again trying to fix it a different way, but I
don't have it working just yet.
Continued work on my new constant reuse patch. I have it detecting many
constant expressions, and calculating the values for some of them. Once
it does that sufficiently well, the next step is to track what constants
are available where, I then I'll be in a position to find optimization
opportunities. At the moment, 'sufficiently well' could just mean
MOVW/MOVT pairs, as those are the most common
Tried to get the CS Panda Boards up and running again after the move. No
success. Ricardo is on the case. I'm still using the boards located at
Canonical.
Andrew
Continue looking at Richard's micro benchmarks taken from libav w.r.t
SMS and experiment with different patches that Richard wrote to
improve code generation.
Submitted SMS related patch for minor misc fixes
http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00551.html
Trying to understand why to new version of the patch to support
instructions with
REG_INC_NOTE in SMS causes bootstrap failure. Will email to the ml regarding it.
(http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01216.html)
* Looked at LP bug 736661. Sent a patch upstream. Some positive feedback,
but it hasn't been approved or rejected yet.
* Looked at the old R_ARM_THM_CALL linker bug, after Matthias prepared
a self-contained testcase (thanks). Attached a patch to the bug.
Will submit upstream next week if testing goes OK.
* Looked at why the backport of lp823708-4.5 retriggered the same
bootstrap failure that Chung-Lin's patch did. Haven't been able
to reproduce yet.
* Looked at making the neon_vget_high/low patterns use ordinary
subreg moves. Found that this triggered a fair few latent bugs
in the rtl optimisers. Tried to fix those. This gave some nice
improvements in some of the libav loops.
* Added h264 loops to the libav microbenchmarks.
* Blueprints.
* Upstream patch review.
Richard
* Completed First-time wiki page, at least for now. Expecting to add more
information as I go.
* Running SPEC2K on the Snowball board. The tests are failing because I run
out out of memory. This is due to too little RAM available in the default
kernel configuration. (Official HW pack.) I had a go with creating a swap
file on the SD-card. The tests are then running, but results are slow (which
makes sense). Will try to make more memory available with changed
uboot-option, or with a fresh kernel.
* Discussing with Michael about benchmark candidates that will add a web
browsing perspective to the benchmarks we have. I suggest Sunspider and V8
benchmark suite for the JavaScript aspect, and EEMBC Browsing Bench and
perhaps ARMBBench for the load and render aspect. As for the imaging aspect
we have DENBench and ConsumerBench.
Best Regards
Åsa
== String routines ==
* Trying to understand my strlen behaviour that Michael identified
- Found lots of ways of making the faster case slower, but none of making
the slower case faster!
- Perf not being available on Panda (bug 702999/843628) made it
difficult to
dig down
* Fixing standards corner cases for strchr/memchr
- input match needs to be truncated to char (fixes bug 842258 & 791274)
* Tidying up formatting for cortex-strings release
* Looking at eglibc integration again
- getting confused by what has to happen in config.sub and how
other users of it
cope with triplets like armv7 even though it's not in config.sub
== QEMU ==
* Testing Peter's QEMU release
- All good
- Lost a few hours due to the broken version of l-i-f-ui in Oneiric
- PPA version works OK
* A little bit of perf profiling
== Other ==
* Managed to get hold of a nice fast build machine
== GDB ==
* Worked on hardware watchpoint support for gdbserver.
== GCC ==
* Analyzed root cause of three more ICEs when building Linux
kernel with mainline GCC (reported by Arnd):
PR target/50305: Inline asm reload failure when building Linux kernel
PR middle-end/50307: SSA checking ICE when building Linux kernel
PR tree-optimization/50318: ICE optimizing widening
multiply-and-accumulate
* Implemented proposed fix for PR target/50305 and posted for review.
== Misc ==
* Installed updated FPGA bitfiles on my Versatile Express and verified
that network stability issues (LP #673820) are now fixed.
* Booked Linaro Connect Q4.11 travel.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
RAG:
Red:
Amber:
Green: overrunning OMAP3 upstreaming work (mostly) replanned
Current Milestones:
|| || Planned || Estimate || Actual ||
||qemu-linaro 2011-09 || 2011-09-15 || 2011-09-15 || ||
Historical Milestones:
||qemu-linaro 2011-04 || 2011-04-21 || 2011-04-21 || 2011-04-21 ||
||qemu-linaro 2011-05 || 2011-05-19 || 2011-05-19 || n/a ||
||close out 1105 blueprints || 2011-05-28 || 2011-05-28 || 2011-05-19 ||
||complete 1111 planning || 2011-05-28 || 2011-05-28 || 2011-05-27 ||
||qemu-linaro-2011-06 || 2011-06-16 || 2011-06-16 || 2011-06-16 ||
||qemu-linaro-2011-07 || 2011-07-21 || 2011-07-21 || 2011-07-21 ||
||qemu-linaro 2011-08 || 2011-08-18 || 2011-08-18 || 2011-08-18 ||
== upstream-omap3-patches ==
* more reshuffling of patches and dropping of unnecessary changes (eg
code reformatting)
* we're going to divide this blueprint into four, each of which has a
reasonably clearly defined submilestone and an estimated 3
engineering weeks of work in it
* in order to not have work on this completely push out other items on
the schedule, we're going to limit work done on this to 2 or 3 days
each week
* still todo: actually split the blueprint, set dates for
submilestones, check that other blueprints fit reasonably in the
other half-week
== linaro-qemu-11.11 ==
* built a pre-release tarball and tested it -- looks good for next
week's release
* investigated whether we can reinstate the firmware blobs in our
releases (bringing us back into line with upstream) -- should be
possible but need to go through the license approval process since
some are GPLv3
== a15-system-mode-planning ==
* starting to see some (gentle) pressure for A15 support
* thinking about what we should do here; my current opinion is that
QEMU should implement an "A15 without virtualization or LPAE" -- we
have Linux kernels that will boot on this, and it is essentially
what an A15-on-A15 hw virt guest would see. The device work will be
needed for KVM anyway. Need to write this up.
== other ==
* submitted TSC licensing request to add the firmware blobs back into
our qemu-linaro tarballs, in line with how upstream do releases
* I need to track better how much time I'm spending on things like code
review on qemu-devel, minor bug fixing and other things that aren't
blueprints
* all holiday to the end of the year now booked (see below)
Current qemu patch status is tracked here:
https://wiki.linaro.org/PeterMaydell/QemuPatchStatus
Absences (to end of year):
Sep 19, Sep 29-Oct 07, Oct 17, Nov 21, Dec 15-Jan 03: leave
Oct 30-Nov 04: Linaro Connect Q4.11
Hi,
* I've been setting up a new system because the old laptop died
* finished the initial port of libunwind for Android on ARM
* changed debuggerd to make use of libunwind to unwind the stack of
crashing applications
* it works and the output looks great :)
* I plan to document these things in the wiki by next week
Regards
Ken
Just as an FYI, I've added these loops to the libav microbenchmarks
avg-h264-chroma-mc8-8.txt
avg-pixels8-8.txt
ff-h264-idct-add-8-8.txt
ff-put-pixels8x16-8.txt
h264-loop-filter-luma-8.txt
idct-internal-8.txt
put-h264-chroma-mc8-8.txt
put-h264-qpel8-h-lowpass-8.txt
put-h264-qpel8-hv-lowpass-8.txt
put-h264-qpel8-v-lowpass-8.txt
based on Michael's h264 profile. These loops:
decode_residual
ff_h264_decode_mb_cavlc
fill_decode_caches
aren't really the kind of thing that the microbenchmark is designed for;
running the whole h264 benchmark is probably a better test. Some of the
functions in the profile just consist of two copies of a simpler loop,
one after the other, so for those I just used the simpler loop.
Usual microbenchmark caveats apply.
Richard
Hi,
* merged vector over-promotion patch to linaro-gcc-4.6
* committed upstream the change of the default vector size for NEON
* continued working on widening shifts
Ira
Hi Dave. I've been hacking away and have checked in a couple of
benchmarking and plotting scripts to lp:cortex-strings. The current
results are at:
http://people.linaro.org/~michaelh/incoming/strings-performance/
All are done on an A9. The results are very incomplete due to how
long things take to run. I'll leave ursa3 doing these over the
weekend which should flesh this out for the other routines.
Your new memcpy() is looking good as well - as fast as GLIBC.
-- Michael
While out benchmarking today, I ran across code similar to this:
int *a;
int *b;
int *c;
const int ad[320];
const int bd[320];
const int cd[320];
void fill()
{
for (int i = 0; i < 320; i++)
{
a[i] = ad[i];
b[i] = bd[i];
c[i] = cd[i];
}
}
I was surprised and happy to see the vectoriser kick in for the copy.
The inner loop looks like:
add r5, r3, ip
adds r4, r3, r7
vldmia r2!, {d16-d17}
vldmia r1!, {d18-d19}
adds r0, r3, r6
vst1.32 {q9}, [r5]
vst1.32 {q8}, [r4]
vldmia r3, {d16-d17}
adds r3, r3, #16
cmp r3, r8
vst1.32 {q8}, [r0]
bne .L3
so r3 is the loop variable and {ip,r7} are the offsets from r3 to the
destination pointers. Adding a __restrict doesn't change the code.
Richard, will your auto-inc/dec changes combine the final vldmia r3,
add r3 into a vldmia r3! ?
Changing the int *a into in-file arrays like int a[320] gives:
vldmia r0!, {d16-d17}
vldmia r5!, {d18-d19}
vstmia r4!, {d18-d19}
vstmia r1!, {d16-d17}
vldmia r2!, {d16-d17}
vstmia r3!, {d16-d17}
cmp r3, r6
bne .L2
Marking them as extern int a[320] goes back to the first form.
Can we always use the second form? What optimisation is preventing it?
-- Michael
On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert <david.gilbert(a)linaro.org> wrote:
> Hi Michael,
> I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S
> that is armv7
> and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned cases
> and for large (128K or larger) copies. I've also (accidentally)
> wired the memcpy-hybrid
> one into the Makefile.am (I wasn't sure what the right way to do this
> was - the neon_sources
> seemed a good place for it, but there is nothing currently in there
> that turns off the non-neon
> version).
>
> I'd be interested in seeing the results for both; I've got a bit of
> a soft spot for the hybrid
> solution.
>
> On the memset, yes the 'and' that you added is fine - but I started
> having a play and have
> some performance results (on -t 128) that I don't really understand:
>
>
> 1) and r1,#0xff
> orr r1,r1,r1,lsl#8
> orr r1,r1,r1,lsl#16
>
> That's your solution - and fastest at somewhere around 2270MB/s for
> me - by the TRM I reckon
> that should be 3 cycles.
>
> 2) lsls r1,#24
> orr r1,r1,r1,lsr#8
> orr r1,r1,r1,lsr#16
>
> lsl isn't explicitly listed in the TRM, so I assumed that was the
> same as a move with a constant
> shift, which my reading is that it's a single cycle; and the lsls is 2
> bytes - so you would think
> that should be as fast as yours but 2 bytes smaller - except it's
> reliably down at 2228MB/s - so
> it is slower.
>
> 3) Thinking it was an alignment issue I tried adding a mov r5,r1 to
> the front of that, and got 2248MB/s -
> so being faster with an extra instruction it probably was an alignment issue?
>
> 4) I also tried a pair of bfi's:
> bfi r1,r1, #8, #8
> bfi r1,r1, #16, #16
>
> That came out at 2228MB/s - and is 4 cycles by the book.
Unfortunately you can't tell the performance from the latency.
Attached is a micro benchmark that has the three different versions
(and, ubx, lsl). After compensating for the loop time, I got:
* lsl: 1.006 s
* ubx: 0.876 s
* and: 0.918 s
even though ubx has a latency of two cycles.
I then took the AND version and shifted it to the start of the file.
This small change in alignment pushed it up to 1.048 s which is 14 %
slower.
-- Michael
* Linaro GCC
Fixed up, committed and posted two bug fixes to my thumb2 constants
patches, found by other people running FSF trunk.
Analysed bug lp:836401 / pr50193, developed a fix, and posted it both
upstream and to launchpad for testing. The launchpad tests have come
back clean, and the patch is approved, but upstream have not approved it
yet.
Posted a query to linaro-dev mailing list asking for ARM CPU ID register
numbers, and got lots back. Entered these into the patch, and begun some
test builds. I will post the new verion upstream, if all's well, next week.
Started looking at an optimization I discussed with Richard Earnshaw in
Cambourne, in which GCC attempts to synthesize constants more
efficiently by reusing constant values already in registers. I've made a
start, but not much more to say just yet.
* Other
- Public holiday Monday.
- Half day leave on Wednesday.
- Internal train session
Continue looking at Richard's micro benchmarks w.r.t SMS.
Examining Ayal's comments to the patch to support instructions with
REG_INC_NOTE in SMS.
(http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01216.html)
Took one day off yestarday (4/9)
Do we know anything about "Csmith"?
Maybe we should try it?
Andrew
-------- Original Message --------
Subject: Re: [PATCH][ARM] pr50193: ICE on a | (b << negative-constant)
Date: Thu, 1 Sep 2011 13:21:38 +0000 (UTC)
From: Joseph S. Myers <joseph(a)codesourcery.com>
To: Andrew Stubbs <ams(a)codesourcery.com>
CC: gcc-patches(a)gcc.gnu.org, patches(a)linaro.org
Newsgroups: gmane.comp.gcc.patches
References: <4E5F6B5F.2020207(a)codesourcery.com>
On Thu, 1 Sep 2011, Andrew Stubbs wrote:
> This patch fixes the problem by merely checking that the constant is positive.
> I've confirmed that values larger than the mode-size are not a problem because
> the compiler optimizes those away earlier, even at -O0.
Do you mean that you have observed for some testcases that they get
optimized away - or do you have reasons (if so, please state them) to
believe that any possible path through the compiler that would result in a
larger constant here (possibly as a result of constant propagation and
other optimizations) will always result in it being optimized away as
well? If it's just observation it would be better to put the complete
check in here.
Quite of few of the Csmith-generated bug reports from John Regehr have
involved constants appearing in unexpected places as a result of
transformations in the compiler. It would probably be a good idea for
someone to try using Csmith to find ARM compiler bugs (both ICEs and
wrong-code); pretty much all the bugs reported have been testing on x86
and x86_64, so it's likely there are quite a few bugs in the ARM back end
that could be found that way.
--
Joseph S. Myers
joseph(a)codesourcery.com
Hi,
libunwind:
* improvements in case the user doesn't use ARM unwind tables but DWARF info
* code used to pick ARM unwind from the crt files which says "cantunwind"
android:
* upgraded working base to 11.08
* continue to port libunwind to android
* noticed a header file clash that causes errors
* finished an android app that uses a native part to crash the process
* as a vehicle to test the modified debuggerd
Regards
Ken
== GDB ==
* Russell King now wants to revert my kernel patch that
fixed #615974; discussed alternative options.
== GCC ==
* Patch review week.
* Analyzed root cause of ICE when building Linux kernel
with mainline GCC (reported by Arnd).
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
== QEmu ==
* Sent 64bit atomic helper fix upstream
* Basic boot time and simple benchmarks v Panda board
* Tested prebuilt images and Peter's latest post-merge QEmu tree
- The full Ubuntu desktop on an emulated Overo is a bit slow -
it's rather short on RAM
- The full Ubuntu desktop on an emulated VExpress isn't bad; it's
got the full 1G; (with particularly grim
line of awk to mount vexpress images based on Peter's
suggestion of the use of 'file')
== String routines ==
* Pushed memcpy and memset up to cortex-strings bzr
* Working through memset issue with Michael
- Made my code a little less sensitive to initial alignment
== Hard float ==
* Testing libffi 3.0.11rc1 - still hasn't got variadic patch in, but
hopeing it will land later in the cycle.
== Other ==
* Excavating inbox after week off.
* Build LMbench and kicked run off on Panda. (Got stuck in some
heuristics under emulation)
Dave
== This week ==
* Looked at the get_arm_condition_code ICE. Seems to be a popular bug:
was reported as #589887 #823708 and #809761 in Lauchpad and as PR49030
in bugzilla. Sent a patch upstream.
* Submitted SMS register-dependency patch upstream.
* Reviewed Bernd's new shrink-wrap patch.
* Tried to clean up my microbenchmarks. Found that preloading the caches
at the start of the benchmark fixed the variations I was seeing on a
Beagleboard. (As Dave says, it seems that there's no allocation
on write.) Added code to check the results of each loop. Packaged
it up and pushed into bzr.
* An IBM colleague kindly tried my -fsched-pressure patch/hack
on s390. Although it performed the best of the three runs
(trunk, -fsched-pressure, patch+-fsched-pressure), there were
some disappointing outliers. -fsched-pressure still introduces a 7%
regression in one test (down from a 14% regression without the patch).
Another test benefited from -fsched-pressure without my patch but
regressed with it.
== Next week ==
* Look more at SMS.
* Look more at the sched-pressure thing (if I get time).
Richard
* SPEC2K week. Experimented with building and running and finally did a full
run on the Panda board.
* Running SPEC2K on the Snowball board as well. It is troublesome to work
with the board because of the ethernet problem that makes the board freeze
after some time. I have to use the SD-card for file transfer. It seems to be
a known issue on the Snowball V3. Tested a V5 board which was much more
stable. Have asked for a V5 board from ST-E Linaro internal project manager.
(Do not know when I will get it.)
* Preliminary travel booking done for Linaro Connect in Orlando.
Best Regards
Åsa