Hi,
the current gcc-4.6 packages build for both softfp and hard, so that the armel
and (not yet existing) armhf packages can be installed together in the system.
To enable multilib, I currently use the rather complicated arm-multilib.diff,
which works, but doesn't seem to be correct. With the much simpler arm-ml2.diff,
the directory for the default multilib is not resolved to . (as done for e.g.
amd64).
[amd64] $ gcc -print-multi-directory
.
[armel] $ gcc -print-multi-directory
sf
If I understand the code correctly, this comes from the hard setting of
MULTILIB_DEFAULTS in the arm target. If you look at mips, you see
#ifndef MULTILIB_DEFAULTS
#define MULTILIB_DEFAULTS \
{ MULTILIB_ENDIAN_DEFAULT, MULTILIB_ISA_DEFAULT, MULTILIB_ABI_DEFAULT }
#endif
which records the proper selected defaults.
Should something similiar be done for arm?
Matthias
Hi; I've just completed a tricky rebase of qemu-linaro on upstream;
there were several invasive upstream changes which have landed
recently and which meant that I had to tweak a lot of the qemu-linaro
patches as I did the rebase. I've tested the results but it's possible
that some breakage may have slipped through...
So if you're a regular user of qemu-linaro's system mode and feel
like checking the sources out of git:
git://git.linaro.org/qemu/qemu-linaro.git
building them and testing that the things you regularly do with it
haven't regressed, then I'd appreciate it, and you can help us avoid
any nasty surprises in the next (2011.09) release.
Thanks!
-- Peter Maydell
See:
http://builds.linaro.org/toolchain/gcc-4.7~svn178154
The problem is -Werror triggering on:
../../../gcc-4.7~/gcc/config/arm/arm.c: In function 'int
optimal_immediate_sequence_1(rtx_code, long long unsigned int,
four_ints*, int)':
../../../gcc-4.7~/gcc/config/arm/arm.c:2690:46: error: comparison
between signed and unsigned integer expressions [-Werror=sign-compare]
../../../gcc-4.7~/gcc/config/arm/arm.c:2690:60: error: comparison
between signed and unsigned integer expressions [-Werror=sign-compare]
../../../gcc-4.7~/gcc/config/arm/arm.c:2691:20: error: comparison
between signed and unsigned integer expressions [-Werror=sign-compare]
../../../gcc-4.7~/gcc/config/arm/arm.c:2691:34: error: comparison
between signed and unsigned integer expressions [-Werror=sign-compare]
../../../gcc-4.7~/gcc/config/arm/arm.c:2701:16: error: comparison
between signed and unsigned integer expressions [-Werror=sign-compare]
../../../gcc-4.7~/gcc/config/arm/arm.c:2702:18: error: comparison
between signed and unsigned integer expressions [-Werror=sign-compare]
../../../gcc-4.7~/gcc/config/arm/arm.c:2703:18: error: comparison
between signed and unsigned integer expressions [-Werror=sign-compare]
-- Michael
Booked hotel and travel for Linaro Connect in Orlando.
Fixed a couple of bugs in my thumb2 constants patch and retested. The
test results came back clean, so I've committed it upstream.
Bernd claimed he has found some test failures that might be caused by my
patch, but I couldn't reproduce them at first. I've now got the failure,
but I've not yet investigated the cause. Next week ...
Committed my widening multiplies patches to Linaro GCC, after first
convincing Richard Sandiford that it wasn't totally bonkers.
Started work on ARM GCC tuning options:
* Submitted a patch for -m{arch,cpu,tune}=native
http://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg14225.html
* Submitted a patch for -m{arch,cpu,tune}=generic-armv7-a
http://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg14231.html
Joseph found an issue with those patches, but that was easily resolved
and I've reposted both.
RAG:
Red:
Amber: OMAP3 patch upstreaming is (still) slower progress than hoped
Green:
Current Milestones:
|| || Planned || Estimate || Actual ||
||qemu-linaro 2011-09 || 2011-09-15 || 2011-09-15 || ||
Historical Milestones:
||qemu-linaro 2011-04 || 2011-04-21 || 2011-04-21 || 2011-04-21 ||
||qemu-linaro 2011-05 || 2011-05-19 || 2011-05-19 || n/a ||
||close out 1105 blueprints || 2011-05-28 || 2011-05-28 || 2011-05-19 ||
||complete 1111 planning || 2011-05-28 || 2011-05-28 || 2011-05-27 ||
||qemu-linaro-2011-06 || 2011-06-16 || 2011-06-16 || 2011-06-16 ||
||qemu-linaro-2011-07 || 2011-07-21 || 2011-07-21 || 2011-07-21 ||
||qemu-linaro 2011-08 || 2011-08-18 || 2011-08-18 || 2011-08-18 ||
== upstream-omap3-patches ==
* finished the fairly nasty rebase of qemu-linaro onto upstream master
(several invasive changes went into master that meant a number of
our local patches needed updating)
* omap_gpmc changes cleaned up, updated to use MemoryRegions, and
submitted to upstream (17 patch patchset)
== gsoc-support ==
* final meeting/evaluation writeup now the GSoC project has ended
* we now have a first pass at what some upstream-acceptable versions
of the Android goldfish platform devices might look like, and a
much better idea of the degree of difference between the android
and upstream qemu trees, and where the pitfalls/issues lie
== other ==
* fixed some breakage upstream in n810 and integratorcp models caused
by landing of MemoryRegion changes
* trying to write up what my preferred model of device connections
would look like in concrete C implementation terms
* interesting discussion on boot-architecture list about how boot
loaders should start hypervisor-aware software (xen, kvm kernel):
http://www.mail-archive.com/boot-architecture@lists.linaro.org/msg00053.html
Current qemu patch status is tracked here:
https://wiki.linaro.org/PeterMaydell/QemuPatchStatus
Absences:
Oct 30-Nov 04: Linaro Connect Q4.11
== This week ==
* Wrote some patches to make SMS schedule register moves. They made a
significant difference to some libav loops. I'm running a regression
test on pwoerpc-ibm-aix5.3.0 and will submit upstream next week if
all goes OK.
* Looked at why mjpegenc was so much worse with SMS. Turned out to be
a register spilling problem. Found that -fira-algorithm=priority
avoids the regression and makes several other tests better too.
(I just tested that to see whether there was a feasible register
allocation for these cases; -fira-algorithm=priority isn't the
way to go.)
* Saw that the register allocator seemed to be tripping over the
XImode "structure" values, and that we still had one vector move
per structure element by the time we get to the scheduling passes.
Eliminated those with a combination of one fix and one hack.
It seemed to avoid the allocation problems.
* Patch review (Linaro and upstream).
* Backported libgcc visibility fix to 4.6 and 4.5.
== Next week ==
* Submit register-scheduling patch.
* Submit memory cost patch (from auto-inc-dec changes)
* Possibly submit the auto-inc-dec changes themselves, depending on
how the rtx cost discussion goes.
Richard
==GCC==
===Progress===
* Looked at the vectorize_with_neon_quad failure again and decided
that I had to handle another case but not convinced that the extra
stall we'd get in this case was worth it. In any case it would have
been a workaround but Richard Sandiford fixed this by getting df to do
the right thing which would have been the right fix.
* Backported tbh patch.
* Backported conditional execution improvements patch from Jiangning
to Linaro 4.6 branch.
* Committed the LTO + Neon / Android intrinsics patch.
* Panda seems more reliable this week but I suspect that's the room
cooling more .
* Broke up a few blueprints and marked some as done.
* BRANCH_COST results show not a huge variation in SPEC and there are
some results that are inconsistent.. Need to run a few benchmarks
again Sigh :( .
* Finished the A9 scheduler patch for smull and friends and committed
upstream and into Linaro 4.6.
* Reviewed the shrink-wrapping patch and the widening multiplies patch
for a short duration.
* Looked at the failures in the "popular embedded benchmark" for
sometime with Asa.
* Tried one of the ICE patches and that seemed to work just fine with
bootstrap on FSF trunk. Need to figure out why this was breaking in
the Linaro 4.6 tree. https://bugs.launchpad.net/gcc-linaro/+bug/689887
=== Plans ===
Next Week - Holiday :) Feet not up but walking in what looks like
typical bank holiday weather ... Might check email later in the week.
Meetings:
* 1-1s
* TCWG calls
* Thumb2 performance call.
Absences.
* 29th Aug - Sept. 2 - Holiday booked and approved.
* 31st Oct - 4th Nov - Linaro Summit Orlando - Travel booked - hotel
to be booked.
* Investigated the errors in the automotive test and concluded that they are
CRC-errors, but not depending on the test case result (non intrusive crc
check). We decided these errors need to be cleared out once and for all.
Michael and Ramana helping out with continued investigation.
* EEMBC run on both Panda and Snowball with gcc4.5.2. Results look
reasonable, but Michael will also have a look. I will spend a little more
time comparing the results from the two boards.
* Started to run SPEC2K on the Panda board.
Best Regards
Åsa
Following on from yesterday's call about what it would take to enable
SMS by default: one of the problems I was seeing with the SMS+IV patch
was that we ended up with excessive moves. E.g. a loop such as:
void
foo (int *__restrict a, int n)
{
int i;
for (i = 0; i < n; i += 2)
a[i] = a[i] * a[i + 1];
}
would end up being scheduled with an ii of 3, which means that in the
ideal case, each loop iteration would take 3 cycles. However, we then
added ~8 register moves to the loop in order to satisfy dependencies.
Obviously those 8 moves add considerably to the iteration time.
I played around with a heuristic to see whether there were enough
free slots in the original schedule to accomodate the moves.
That avoided the problem, but it was a hack: the moves weren't
actually scheduled in those slots. (In current trunk, the moves
generated for an instruction are inserted immediately before that
instruction.)
I mentioned this to Revital, who told me that Mustafa Hagog had
tried a more complete approach that really did schedule the moves.
That patch was quite old, so I ended up reimplementing the same kind
of idea in a slightly different way. (The main functional changes
from Mustafa's version were to schedule from the end of the window
rather than the start, and to use a cyclic window. E.g. moves for
an instruction in row 0 column 0 should be scheduled starting at
row ii-1 downwards.)
The effect on my flawed libav microbenchmarks was much greater
than I imagined. I used the options:
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
The "before" code was from trunk, the "after" code was trunk + the
register scheduling patch alone (not the IV patch). Only the tests
that have different "before" and "after" code are run. The results were:
a3dec
before: 500000 runs take 4.68384s
after: 500000 runs take 4.61395s
speedup: x1.02
aes
before: 500000 runs take 20.0523s
after: 500000 runs take 16.9722s
speedup: x1.18
avs
before: 1000000 runs take 15.4698s
after: 1000000 runs take 2.23676s
speedup: x6.92
dxa
before: 2000000 runs take 18.5848s
after: 2000000 runs take 4.40607s
speedup: x4.22
mjpegenc
before: 500000 runs take 28.6987s
after: 500000 runs take 7.31342s
speedup: x3.92
resample
before: 1000000 runs take 10.418s
after: 1000000 runs take 1.91016s
speedup: x5.45
rgb2rgb-rgb24tobgr16
before: 1000000 runs take 1.60513s
after: 1000000 runs take 1.15643s
speedup: x1.39
rgb2rgb-yv12touyvy
before: 1500000 runs take 3.50122s
after: 1500000 runs take 3.49887s
speedup: x1
twinvq
before: 500000 runs take 0.452423s
after: 500000 runs take 0.452454s
speedup: x1
Taking resample as an example: before the patch we had an ii of 27,
stage count of 6, and 12 vector moves. Vector moves can't be dual
issued, and there was only one free slot, so even in theory, this loop
takes 27 + 12 - 1 = 38 cycles. Unfortunately, there were so many new
registers that we spilled quite a few.
After the patch we have an ii of 28, a stage count of 3, and no moves,
so in theory, one iteration should take 28 cycles. We also don't spill.
So I think the difference really is genuine. (The large difference
in moves between ii=27 and ii=28 is because in the ii=27 schedule,
a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled
with time(B) == time(A) + ii + 1.)
I also saw benefits in one test in a "real" benchmark, which I can't
post here.
Richard