Hi,
* got llvm+clang working on ARM:
https://wiki.linaro.org/KenWerner/Sandbox/HowToBuildToolchainComponents#llv…
* checked whether llvm inlines the __sync_* builtins on ARM or not:
https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations#LLVM
* developed a patch for #681138 (tested with current gcc-linaro)
* spent some time for bootstrapping the GCC trunk in order to test and post
that patch on the ml but wasn't successful
(finally ran into the issues discussed at #659713)
* did some verification work on #674090
* preparing to work on the "investigate current developer tools" item
Regards
Ken
Hi there. Currently you can't use NEON instructions in inline
assembly if the compiler is set to -mfpu=vfp such as Ubuntu's
-mfpu=vfpv3-d16. Trying code like this:
int main()
{
asm("veor d1, d2, d3");
return 0;
}
gives an error message like:
test.s: Assembler messages:
test.s:29: Error: selected processor does not support Thumb mode `veor d1,d2,d3'
The problem is that -mfpu=vfpv3-d16 has two jobs: it tells the
compiler what instructions to use, and also tells the assembler what
instructions are valid. We might want the compiler to use the VFP for
compatibility or power reasons, but still be able to use NEON
instructions in inline assembler without passing extra flags.
Inserting ".fpu neon" to the start of the inline assembly fixes the
problem. Is this valid? Are assembly files with multiple .fpu
statements allowed? Passing '-Wa,-mfpu=neon' to GCC doesn't work as
gas seems to ignore the second -mfpu.
What's the best way to handle this? Some options are:
* Add '.fpu neon' directives to the start of any inline assembly
* Separate out the features, so you can specify the capabilities with
one option and restrict the compiler to a subset with another.
Something like '-mfpu=neon -mfpu-tune=vfpv3-d16'
* Relax the assembler so that any instructions are accepted. We'd
lose some checking of GCC's output though.
-- Michael
- Continued looking into NEON special loads and stores.
- Benchmarks: concentrated on EEMBC Telecom:
- autcor gets vectorized
- viterbi, besides strided data accesses, needs to sink conditional
stores to allow if-conversion and make the main loop vectorizable.
Since the potential here is 4x, I think it's worthwhile to work on
this.
- conven, fbital also have control-flow issue, but much more
complicated than viterbi
- fft has a problem with loop count, I would like to investigate
this a bit more
- diffmeasure doesn't seem to have vectorization potential
- Fixed GCC PR 46663 on trunk, testing the fix for 4.3, 4.4, 4.5.
Hi,
Here's a work-in-progress patch which fixes many execution failures
seen in big-endian mode when -mvectorize-with-neon-quad is in effect
(which is soon to be the default, if we stick to the current plan).
But, it's pretty hairy, and I'm not at all convinced it's not working
"for the wrong reason" in a couple of places.
I'm mainly posting to gauge opinions on what we should do in big-endian
mode. This patch works with the assumption that quad-word vectors in
big-endian mode are in "vldm" order (i.e. with constituent double-words
in little-endian order: see previous discussions). But, that's pretty
confusing, leads to less than optimal code, and is bound to cause more
problems in the future. So I'm not sure how much effort to expend on
making it work right, given that we might be throwing that vector
ordering away in the future (at least in some cases: see below).
The "problem" patterns are as follows.
* Full-vector shifts: these don't work with big-endian vldm-order quad
vectors. For now, I've disabled them, although they could
potentially be implemented using vtbl (at some cost).
* Widening moves (unpacks) & widening multiplies: when widening from
D-reg to Q-reg size, we must swap double-words in the result (I've
done this with vext). This seems to work fine, but what "hi" and "lo"
refer to is rather muddled (in my head!). Also they should be
expanders instead of emitting multiple assembler insns.
* Narrowing moves: implemented by "open-coded" permute & vmovn (for 2x
D-reg -> D-reg), or 2x vmovn and vrev64.32 for Q-regs (as
suggested by Paul). These seem to work fine.
* Reduction operations: when reducing Q-reg values, GCC currently
tries to extract the result from the "wrong half" of the reduced
vector. The fix in the attached patch is rather dubious, but seems
to work (I'd like to understand why better).
We can sort those bits out, but the question is, do we want to go that
route? Vectors are used in three quite distinct ways by GCC:
1. By the vectorizer.
2. By the NEON intrinsics.
3. By the "generic vector" support.
For the first of these, I think we can get away with changing the
vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so
that vector registers will hold elements in memory order -- so, all the
contortions in the attached patch will be unnecessary. ABI issues are
irrelevant, since vectors are "invisible" at the source code layer
generally, including at ABI boundaries.
For the second, intrinsics, we should do exactly what the user
requests: so, vectors are essentially treated as opaque objects. This
isn't a problem as such, but might mean that instruction patterns
written using "canonical" RTL for the vectorizer can't be shared with
intrinsics when the order of elements matters. (I'm not sure how many
patterns this would refer to at present; possibly none.)
The third case would continue to use "vldm" ordering, so if users
inadvertantly write code such as:
res = vaddq_u32 (*foo, bar);
instead of writing an explicit vld* intrinsic (for the load of *foo),
the result might be different from what they expect. It'd be nice to
diagnose such code as erroneous, but that's another issue.
The important observation is that vectors from case 1 and from cases 2/3
never interact: it's quite safe for them to use different element
orderings, without extensive changes to GCC infrastructure (i.e.,
multiple internal representations). I don't think I quite realised this
previously.
So, anyway, back to the patch in question. The choices are, I think:
1. Apply as-is (after I've ironed out the wrinkles), and then remove
the "ugly" bits at a later point when vectorizer "array load/store"
support is implemented.
2. Apply a version which simply disables all the troublesome
patterns until the same support appears.
Apologies if I'm retreading old ground ;-).
(The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good
code for the quad-word vec_pack_trunc_<mode> pattern. It would
eventually be applied as a separate patch.)
Thoughts?
Julian
ChangeLog
gcc/
* config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode
of vector registers.
* config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in
big-endian mode.
(reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>)
(reduc_umin_<mode>, reduc_umax_<mode>)
(neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>)
(neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>)
(vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle
big-endian mode for quad-word vectors.
Hi there. I've had a few questions recently about how to build a
cross compiler, so I took a stab at writing the steps down in a
Makefile. See:
https://code.launchpad.net/~michaelh1/+junk/cross-build
Hopefully it's easy to follow. It uses a binary sysroot and gives you
vanilla binutils 2.20 and Linaro GCC 2010.11 in a good enough way that
you can cross-compile for Maverick. The script is minimal and trades
readability for flexibility.
Note that Marcin's cross compiler packages or the Embedian toolchains
are a better way to go, but if you want to see the steps involved
check out the script.
Marcin or Matthias, would you mind reviewing it?
-- Michael
Hi,
- the struggle with the board took a lot of time
- continued to investigate special loads/stores
- looked for benchmarks:
EEMBC Consumer filters rgbcmy and rgbyiq should be vectorizable
once vld3, vst3/4 are supported
EEMBC Telecom viterbi is supposed to give 4x on NEON once
vectorized (according to
http://www.jp.arm.com/event/pdf/forum2007/t1-5.pdf slide 29). My old
version of viterbi is not vectorizable because of if-conversion
problems. I'd be really happy to check the new version (it is supposed
to be slightly different).
Looking into other EEMBC benchmarks.
FFMPEG http://www.ffmpeg.org/ (got this from Rony Nandy from
User Platforms). It contains hand-vectorized code for NEON.
Investigating.
I am probably taking a day off on Sunday.
Ira
This wiki page came up during the toolchain call:
https://wiki.linaro.org/Internal/People/KenWerner/AtomicMemoryOperations/
It gives the code generated for __sync_val_compare_and_swap
as including a push {r4} / pop {r4} pair because it uses too many
temporaries to fit them all in callee-saves registers. I think you
can tweak it a bit to get rid of that:
# int __sync_val_compare_and_swap (int *mem, int old, int new);
# if the current value of *mem is old, then write new into *mem
# r0: mem, r1 old, r2 new
mov r3, r0 # move r0 into r3
dmb sy # full memory barrier
.LSYT7:
ldrex r0, [r3] # load (exclusive) from memory pointed to
by r3 into r0
cmp r0, r1 # compare contents of r0 (mem) with r1
(old) -> updates the condition flag
bne .LSYB7 # branch to LSYB7 if mem != old
# This strex trashes the r0 we just loaded, but since we didn't take
# the branch we know that r0 == r1
strex r0, r2, [r3] # store r2 (new) into memory pointed to
by r3 (mem)
# r0 contains 0 if the store was
successful, otherwise 1
teq r0, #0 # compares contents of r0 with zero ->
updates the condition flag
bne .LSYT7 # branch to LSYT7 if r0 != 0 (if the
store wasn't successful)
# Move the value that was in memory into the right register to return it
mov r0, r1
dmb sy # full memory barrier
.LSYB7:
bx lr # return
I think you can do a similar trick with __sync_fetch_and_add
(although you have to use a subtract to regenerate r0 from
r1 and r2).
On the other hand I just looked at the gcc code that does this
and it's not simply dumping canned sequences out to the
assembler, so maybe it's not worth the effort just to drop a
stack push/pop.
-- PMM
== Linaro GCC ==
* Finished testing patch for lp675347 (QT inline-asm atomics), and
send upstream for comments (no response yet). Suggested reverting a
patch (which enabled -fstrict-volatile-bitfields by default on ARM)
locally for Ubuntu in the bug log.
* Continued working on NEON quad-word vectors/big-endian patch. This
turned out to be slightly fiddlier than I expected: I think I now have
semantics which make sense, though my patch requires (a) slight
middle-end changes, and (b) workarounds for unexpected combiner
behaviour re: subregs & sign/zero-extend ops. I will send a new version
of the patch to linaro-toolchain fairly soon for comments.
Hello,
I have a question about cbnz/cbz thumb-2 instruction implementation in
thumb2.md file:
I have an example where we jump to a label which appears before the branch;
for example:
L4
...
cmp r3, 0
bne .L4
It seems that cbnz instruction should be applied in this loop; replacing
cmp+bne; however, cbnz fails to be applied as diff = ADDRESS (L4) - ADDRESS
(bne .L4) is negative and according to thumb2_cbz in thumb2.md it should
be 2<=diff<=128 (please see snippet below taken from thumb2_cbz).
So I want to double check if the current implementation of thumb2_cbnz
in thumb2.md needs to be changed to enable it.
The following is from thumb2_cbnz in thumb2.md:
[(set (attr "length")
(if_then_else
(and (ge (minus (match_dup 1) (pc)) (const_int 2))
(le (minus (match_dup 1) (pc)) (const_int 128))
(eq (symbol_ref ("which_alternative")) (const_int 0)))
(const_int 2)
(const_int 8)))]
Thanks,
Revital
== This week ==
* More ARM testing of binutils support for STT_GNU_IFUNC.
* Implemented the GLIBC support for STT_GNU_IFUNC. Simple ARM testcases
seem to run correctly.
* Ran the GLIBC testsuite -- which includes some ifunc coverage --
but haven't analysed the results yet.
* Started looking at Thumb for STT_GNU_IFUNC. The problem is that
BFD internally represents Thumb symbols with an even value and
a special st_type (STT_ARM_TFUNC); this is also the old, pre-EABI
external representation. We need something different for STT_GNU_IFUNC.
* Tried making BFD represent Thumb symbols as odd-value functions
internally. I got it to "work", but I wasn't really happy with
the results.
* Looked at alternatives, and in the end decided that it would be
better to have extra internal-only information in Elf_Internal_Sym.
This "works" too, and seems cleaner to me. Sent an RFC upstream:
http://sourceware.org/ml/binutils/2010-11/msg00475.html
* Started writing some Thumb tests for STT_GNU_IFUNC.
* Investigated #618684. Turned out to be something that Bernd
had already fixed upstream. Tested a backport.
== Next week ==
* More IFUNC tests (ARM and Thumb, EGLIBC and binutils).
Richard
== Linaro and upstream GCC ==
* LP #674146, dpkg segfaults during debootstrap on natty armel: analyzed
and found this should be a case of PR44768, backported mainline revision
161947 to fix this.
* LP #641379, bitfields poorly optimized. Discussed some with Andrew Stubbs.
* GCC bugzilla PR45416: Code generation regression on ARM. Been looking
at this regression, that started from the expand from SSA changes since
4.5-experimental. The problem seems to be TER not properly being
substituted during expand (compared to prior "convert to GENERIC then
expand"). I now have a patch for this, which fixes the PR's testcase,
but when testing current upstream trunk, hit an assert fail ICE on
several testcases in the alias-oracle; it does however, test without
regressions on a 4.5 based compiler. I am still looking at the upstream
failures.
== This week ==
* Continue with GCC issues and PRs.
* Think about GCC performance opportunities (Linaro)
== Last Week ==
* Started writing libunwind support for ARM-specific unwinding, but
realized that the native ARM toolchain may be causing problems. Spent time
trying to isolate the issues, but haven't found the culprit yet.
* Began to consider the possibility of developing an ARM-specific
unwinding library which would integrate into libunwind (and be reusable
elsewhere). Basically, the ARM.ex{idx,tbl} sections are unique to ARM.
* Looked at other applications where unwinding functionality already
exists to see how ARM unwinding is done (e.g. GDB).
Could/should that functionality be replaced with calls to libunwind (or
to the aforementioned ARM-specific helper library)?
== This Week ==
* Continue to implement ARM-specific unwinding in libunwind.
--
Zach Welch
CodeSourcery
zwelch(a)codesourcery.com
(650) 331-3385 x743
== Linaro GCC ==
* LP:634738: Firstly, fix this in combiner. Try the other approach
(without changes to arm.md) suggested by Andrew S, to fix
arm_gen_constant in some cases to generate lsl/lsr ranther loading
constant. Some piece of code in arm.c was written in 1998, hard to
understand with few comments. During this, find some lsl/lsr can be
replaced by ubfx. Use gen_extzv_t2 when arm_arch_thumb2 is true to
transform lsl + lsr to ubfx. Two patches are ready.
* LP:633243: Got build failures on FSF trunk for
arm-none-linux-gnueabi. Test patch on FSF trunk 2010-10-21. No regression.
* LP:638935: predicate "vfp_register_operand" should return true for
VFP_D0_D7_REGS registers. Fixed.
predicates {store,load}_multiple_operation assumes mode is SImode, and
size of data is 4. Fix them to accept multiple VFP operations. Write
three new test cases for stm/fldm/fstm pattern. Test patch on FSF trunk
2010-10-21. No regression.
* SMS on thumb2. Discussed with Revital Eres back and forth on doloop
pattern for thumb2. doloop pattern is not recognized so far on thumb2.
Revital has a fix to thumb2_cbz pattern. After this fix, doloop
pattern should be recognized.
* Ping ARM fix PR45701 in gcc-patches for the fifth time. Still no reply.
== This Week ==
* Look at regressions of ldm/stm backport on Linaro GCC 4.5.
* Internal review of patch to LP:638935
* Try SMS on thumb2 for EEMBC, if Revital's thumb2_cbz pattern fix is
accepted by upstreams.
--
Yao (齐尧)
Hi there. Our Versatile Express has been installed in the data centre
and is available for use. See:
https://wiki.linaro.org/WorkingGroups/ToolChain/Hardware
for the details. If you're a member of the Toolchain WG then you
should already have an account.
Dave is currently using this machine for benchmarking. Until we get
more hardware, please use IRC or email to manage access.
-- Michael
Reviewed Yao's patch for AND optimization. Some back on forth on the
best way to tackle this problem.
LP:663939 - thumb2 constant loading
- backported my patches to GCC 4.5
- awaiting review
LP:595479 - .eh_frame broken.
- Discovered this problem had been fixed (with Thomas' patch) since
August, and has also been fixed upstream, albeit with an alternative
patch. Nothing to do here.
LP:641379 - bitfields poorly optimized.
- analysed the problem. The code in cse.c that is supposed to fix this
does not recognise the case.
- created a patch and tested it for both GCC 4.6 and 4.5.
- awaiting review
LP:674146 - dpkg segfault.
- started looking at this, but Chung-Lin took it first.
While trying to reproduce lp:674146, I discovered that my IGEPv2 had a
corrupted rootfs, again. I only fixed it last week, so I looked into it
more deeply. It seems the SD card has developed at least one bad block.
Reformatted, scanned and reinstalled the files from backup. I think the
problem was caused by the daily apt package download (it was always
those files that were corrupt), so I've disabled that. I've also
disabled access-time-stamps. If it happens again I will have to consider
using a different underlying filesystem format.
LP:643479 / CS Issue:8610 - Multiply and accumulate optimization
- created patches for both issues.
- both were machine description subtleties.
- backported the patches to 4.5
- the patches apply and work fine, but ...
- found an extra problem with redundant moves
- awaiting review
GCC 4.6
- Created a new Launchpad series and branch to track GCC 4.6 development.
- Set up the CS internal build config.
- Tried to build the latest checkout and failed
- glibc problem still unfixed - Jie has reported it now.
- libquadmath build fails
Merged FSF GCC trunk (pre-4.5.2) into Linaro GCC 4.5 tree.
Merged the outstanding Launchpad merge requests into GCC 4.5. The
testing showed regressions, so I backed out most of the merges and did
them in smaller batches. Chung-Lin and Richard's patches passed the
testing, so that leaves Yao's as the problem patch. I didn't get time to
test this assertion this week.
----------------------------------
Next week: Vacation.
Andrew Stubbs wrote:
> * Instruction set coverage.
> - Are there any ARM/Thumb2 instructions that we are not taking
> advantage of? [2]
> - Do we test that we use the instructions we do have? [3]
There is no general frame work to test instruction set coverage. The only
way to find out, really, is to create some test cases where you expect
the compiler to produce a certain insn.
Is there a list of all ARM/Thumb2 instructions and the ones implemented in
the GCC ARM machine descriptions?
> * Constant pools
> - it might be a very handy space optimization to have small
> functions share one constant pool, but the way the passes work one
> function at a time makes this hard. (LP:625233)
There are also passes working on the entire program, or a partition.
Isn't it more a question of how to group and process functions that are
candidates for sharing a constant pool with a neighbor?
Are there algorithms for this kind of pool sharing in the academic
or ARM-specific literature?
Other suggestions for the discussion:
* Better use of conditional execution.
- No idea how much this really helps for ARM, but there are bug
reports about missed opportunities from time to time, so...
- How to model conditional execution before register allocation?
- How exploit opportunities better in GCC (ifcvt is inadequate
and too late in the pipeline).
- Also look at LLVM here, it appears to have a better cost model
for if-conversion than GCC (taking into account a target-dependent
branch misprediction
penalty, for example).
* Basic block re-ordering for speed/size.
- The existing basic block reordering pass in GCC implements only
a reordering strategy for speed.
- The pass does not run at all for functions optimized for size.
* Comparing ARM cost models and param settings to x86_64
- Compare, for some set of functions/benchmarks, the results of
estimate_num_insns, estimate_operator_cost, and
estimate_move_cost, between ARM and x86_64. Rationalize or fix
any significant differences. See whether heuristics based on
these functions require tuning for ARM.
- Go through params.def and see if there are further ARM tuning
opportunities. There are more than 100 DEFPARAMs and many of
them guide heuristics but have only been tuned on x86_64.
(There is set_default_param_value, but most backends do not
change the defaults.)
Hoping this is helpful,
Ciao!
Steven
David G. requested a few packages installed on silverbell today (the
quad-A9 VE porter machine we host in the datacenter). We got dchroots
instead:
----- Forwarded message from LaMont Jones via RT <rt(a)admin.canonical.com> -----
Date: Fri, 26 Nov 2010 21:02:22 +0000
Subject: [rt.admin.canonical.com #42662] Simple package installs on silverbell
On Fri Nov 26 17:08:28 2010, kiko(a)canonical.com wrote:
> Hi there,
>
> Could we get installed on silverbell:
>
> build-essential
> debhelper
> fakeroot
>
> And could we get deb-src's added to sources.list and do an apt-get
> update to allow us to apt-get source certain packages? This is for
> simple compilation benchmarks. Thanks!
Dchroot environments have been created on silverbell for both maverick
and natty. Within the chroot, you can sudo apt-get install to install
packages. apt-get update/dist-upgrade and installs that cause package
removal will require a GSA to do them.
To build for maverick: dchroot -c maverick (and then build however you want...)
lamont
----- End forwarded message -----
--
Christian Robottom Reis | [+55] 16 9112 6430 | http://launchpad.net/~kiko
Linaro Engineering VP | [ +1] 612 216 4935 | http://async.com.br/~kiko
Hi,
* the ARM __sync_* glibc-ports patch was accepted upstream
* posted proposal for consolidating sync primitives but stdatomic seems to be
the future
* used my small gcc testsuite patch to verify __sync_* support of the gcc-
linaro
* created:
https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations
* looked into GOMP support on ARM:
- #pragma omp atomic results in proper asm code (dmb, ldrex, strex, dmb)
- #pragma omp flush results in a DMB instruction
- #pragma omp barrier results to a call to GOMP_barrier (I'm not sure if
this is the desired behavior)
* started to look into #681138
Regards
Ken
Hand crafted a simple strchr and comparing it with Libc:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr
It's interesting it's significantly faster than libc's on A9's, but on
A8's it's slower for large sizes. I've not really looked why yet; my
implementation is just the absolute simplest thumb-2 version.
Did some ltrace profiling to see what typical strchr and strlen sizes were,
and got a bit surprised at some of the typical behaviours
(Lots of cases where strchr is being used in loops to see if another string
contains anyone of a set of characters, a few cases
of strchr being called with Null strings, and the corner case in the spec
that allows you to call strchr with \0 as the character
to search for).
Trying some other benchmarks (pybench spends very little time in
libc,package builds of simple packages seem to have a more interesting
mix of libc use).
Sorting out some of the red tape for contributing.
Dave
It's a bit of a newbie question, but I've been wondering if you can
intermix hard float VFPv3-D16 code with VFPv3-D32 code. You can as:
According to the ABI:
* d0-d15 are used for floating point parameters, no matter if you are
D16 or D32
* d0-d15 are not preserved across function calls
* d16-d31 must be preserved across function calls
The scenarios are:
A D32 function calls a D16 function:
* The first 16 (!) parameters are passed in D0-D15
* Any remaining are passed on the stack
* The D16 function doesn't know about D16-D31, doesn't use them, and
hence preserves them
A D16 function calls a D32 function:
* The first 16 parameters are passed in D0-D15
* Any remaining are passed on the stack
* The D32 function preserves any of the D16-D31 registers that it
uses. Redundant, but fine.
A D32 function (A) calls a D16 function (B) which calls a D32 function (C):
* Parameters are OK, as above
* B doesn't use D16-D31 and hence preserves them
* C preserves any of the D16-D31 that it uses, which preserves them
from A's point of view
-- Michael
(short week: only three days)
RAG:
Red:
Amber:
Green: qemu: initial pull req sent; vfp-in-sighandlers patchset sent
Milestones:
| Planned | Estimate | Actual |
finish virtio-system | 2010-08-27 | postponed | |
get valgrind into linaro PPA | 2010-09-15 | 2010-09-28 | 2010-09-28 |
complete a qemu-maemo update | 2010-09-24 | 2010-09-22 | 2010-09-22 |
finish testing PCI patches | 2010-10-01 | 2010-10-22 | 2010-10-18 |
Progress:
* qemu: final polish on a patchset for saving/restoring VFP
and iWMMXT registers across linux-user mode signal handlers;
patch series sent to mailing list
* qemu: sent a pull request for a small set of ARM fixes
(make SMC undef; fix PXHxx; fix saturating add/sub; fix VCVT)
* reviewed arm semihosting SYS_GET_CMDLINE patch v2
* I now have enough qemu patches in flight that I'm tracking them
at https://wiki.linaro.org/PeterMaydell/QemuPatchStatus
(simple manual list for now, hopefully will be sufficient)
Meetings: toolchain, pdsw-tools
Plans
- qemu consolidation
Absences: (complete to end of 2010)
Thu/Fri 25-26 Nov; Fri 17 Dec - Tue 4 Jan inclusive.
(Dallas Linaro sprint 9-15 Jan.)
For the record, the thing I half-remembered on the call was:
http://gcc.gnu.org/ml/gcc-patches/2009-08/msg00697.html
and:
http://gcc.gnu.org/ml/gcc-patches/2009-09/msg02112.html
The problem is that all __sync operations besides __sync_lock_test_and_set
and __sync_lock_release are defined to be full barriers. Using something
like __sync_val_compare_and_swap for __arch_compare_and_exchange_val_*_acq
and __arch_compare_and_exchange_val_*_rel may on some architectures be too
heavyweight, since those macros only need acquire/after and release/before
barriers. See in particular:
http://gcc.gnu.org/ml/gcc-patches/2009-08/msg00928.html
from the first thread, where the feeling was that the future wasn't
these __sync builtins, but the new C and C++ atomic memory support.
Probably already known, sorry. I just wasn't sure that trying to
convert everyone (not just ARM) to __sync_* was necessarily going
to go down well.
Richard
== Last Week ==
* Reached the point with understanding libunwind where I can begin
writing patches for parsing unwind information out of .ARM.exidx and
.ARM.extab ELF sections.
== This Week ==
* Begin writing support for ARM-specific unwind information to libunwind.
--
Zach Welch
CodeSourcery
zwelch(a)codesourcery.com
(650) 331-3385 x743
== Linaro GCC ==
* Continued looking at big-endian/quad-vector patch: attempted to
figure out the proper semantics for vec_extract in big endian mode
(about 1 day). Put on hold temporarily to work on lp675347, QT failing
to build due to constraint failure in inline asm statements used for
atomic operations: found the patch which introduced the failure, and
suggested a workaround to the OP. Came up with a plausible-looking
patch, and started testing it, after spending some time trying to
figure out why ARM Linux mainline doesn't build at present. Patch sent
upstream.
Hi Richard,
As per the discussion at this mornings call; I've reread the TRM and I
agree with you about the LSLS being the same speed as the TST. (1 cycle)
However as we agreed, the uxtb does look like 2 cycles v the AND 1 cycle.
On the space v perf theme, one thing that would be interesting to know is
whether there are any icache/issue stage limitations;
i.e. if I have a stream of 32-bit Thumb-2 instructions that are all listed
as 1 cycle and are all in i-cache, can they be fetched
and issued fast enough, or is there a performance advantage to short
instructions?
Dave