Hi,
* the ARM __sync_* glibc-ports patch was accepted upstream
* posted proposal for consolidating sync primitives but stdatomic seems to be
the future
* used my small gcc testsuite patch to verify __sync_* support of the gcc-
linaro
* created:
https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations
* looked into GOMP support on ARM:
- #pragma omp atomic results in proper asm code (dmb, ldrex, strex, dmb)
- #pragma omp flush results in a DMB instruction
- #pragma omp barrier results to a call to GOMP_barrier (I'm not sure if
this is the desired behavior)
* started to look into #681138
Regards
Ken
Hand crafted a simple strchr and comparing it with Libc:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr
It's interesting it's significantly faster than libc's on A9's, but on
A8's it's slower for large sizes. I've not really looked why yet; my
implementation is just the absolute simplest thumb-2 version.
Did some ltrace profiling to see what typical strchr and strlen sizes were,
and got a bit surprised at some of the typical behaviours
(Lots of cases where strchr is being used in loops to see if another string
contains anyone of a set of characters, a few cases
of strchr being called with Null strings, and the corner case in the spec
that allows you to call strchr with \0 as the character
to search for).
Trying some other benchmarks (pybench spends very little time in
libc,package builds of simple packages seem to have a more interesting
mix of libc use).
Sorting out some of the red tape for contributing.
Dave
It's a bit of a newbie question, but I've been wondering if you can
intermix hard float VFPv3-D16 code with VFPv3-D32 code. You can as:
According to the ABI:
* d0-d15 are used for floating point parameters, no matter if you are
D16 or D32
* d0-d15 are not preserved across function calls
* d16-d31 must be preserved across function calls
The scenarios are:
A D32 function calls a D16 function:
* The first 16 (!) parameters are passed in D0-D15
* Any remaining are passed on the stack
* The D16 function doesn't know about D16-D31, doesn't use them, and
hence preserves them
A D16 function calls a D32 function:
* The first 16 parameters are passed in D0-D15
* Any remaining are passed on the stack
* The D32 function preserves any of the D16-D31 registers that it
uses. Redundant, but fine.
A D32 function (A) calls a D16 function (B) which calls a D32 function (C):
* Parameters are OK, as above
* B doesn't use D16-D31 and hence preserves them
* C preserves any of the D16-D31 that it uses, which preserves them
from A's point of view
-- Michael
(short week: only three days)
RAG:
Red:
Amber:
Green: qemu: initial pull req sent; vfp-in-sighandlers patchset sent
Milestones:
| Planned | Estimate | Actual |
finish virtio-system | 2010-08-27 | postponed | |
get valgrind into linaro PPA | 2010-09-15 | 2010-09-28 | 2010-09-28 |
complete a qemu-maemo update | 2010-09-24 | 2010-09-22 | 2010-09-22 |
finish testing PCI patches | 2010-10-01 | 2010-10-22 | 2010-10-18 |
Progress:
* qemu: final polish on a patchset for saving/restoring VFP
and iWMMXT registers across linux-user mode signal handlers;
patch series sent to mailing list
* qemu: sent a pull request for a small set of ARM fixes
(make SMC undef; fix PXHxx; fix saturating add/sub; fix VCVT)
* reviewed arm semihosting SYS_GET_CMDLINE patch v2
* I now have enough qemu patches in flight that I'm tracking them
at https://wiki.linaro.org/PeterMaydell/QemuPatchStatus
(simple manual list for now, hopefully will be sufficient)
Meetings: toolchain, pdsw-tools
Plans
- qemu consolidation
Absences: (complete to end of 2010)
Thu/Fri 25-26 Nov; Fri 17 Dec - Tue 4 Jan inclusive.
(Dallas Linaro sprint 9-15 Jan.)
For the record, the thing I half-remembered on the call was:
http://gcc.gnu.org/ml/gcc-patches/2009-08/msg00697.html
and:
http://gcc.gnu.org/ml/gcc-patches/2009-09/msg02112.html
The problem is that all __sync operations besides __sync_lock_test_and_set
and __sync_lock_release are defined to be full barriers. Using something
like __sync_val_compare_and_swap for __arch_compare_and_exchange_val_*_acq
and __arch_compare_and_exchange_val_*_rel may on some architectures be too
heavyweight, since those macros only need acquire/after and release/before
barriers. See in particular:
http://gcc.gnu.org/ml/gcc-patches/2009-08/msg00928.html
from the first thread, where the feeling was that the future wasn't
these __sync builtins, but the new C and C++ atomic memory support.
Probably already known, sorry. I just wasn't sure that trying to
convert everyone (not just ARM) to __sync_* was necessarily going
to go down well.
Richard
== Last Week ==
* Reached the point with understanding libunwind where I can begin
writing patches for parsing unwind information out of .ARM.exidx and
.ARM.extab ELF sections.
== This Week ==
* Begin writing support for ARM-specific unwind information to libunwind.
--
Zach Welch
CodeSourcery
zwelch(a)codesourcery.com
(650) 331-3385 x743
== Linaro GCC ==
* Continued looking at big-endian/quad-vector patch: attempted to
figure out the proper semantics for vec_extract in big endian mode
(about 1 day). Put on hold temporarily to work on lp675347, QT failing
to build due to constraint failure in inline asm statements used for
atomic operations: found the patch which introduced the failure, and
suggested a workaround to the OP. Came up with a plausible-looking
patch, and started testing it, after spending some time trying to
figure out why ARM Linux mainline doesn't build at present. Patch sent
upstream.
Hi Richard,
As per the discussion at this mornings call; I've reread the TRM and I
agree with you about the LSLS being the same speed as the TST. (1 cycle)
However as we agreed, the uxtb does look like 2 cycles v the AND 1 cycle.
On the space v perf theme, one thing that would be interesting to know is
whether there are any icache/issue stage limitations;
i.e. if I have a stream of 32-bit Thumb-2 instructions that are all listed
as 1 cycle and are all in i-cache, can they be fetched
and issued fast enough, or is there a performance advantage to short
instructions?
Dave
LP:663939 - Thumb2 constants
* Continued testing, found a few bugs. Tidied a few bits up.
* Wrote some new testcases to go with the patch.
LP:618684 - ICE
* Begun looking at this one. So far I can't reproduce it. I have a
debuggable native toolchain building, but it'd been delayed by hardware
issues.
In the course of testing I discovered that the ARM FSF config wasn't
testing the right thing, so begun work on a new, more appropriate FSF
build/test config for Linaro work.
Also found the the SD card rootfs in my IGEPv2 board was corrupted. I've
restored it from backup, and now it's working once more.
== Linaro and upstream GCC ==
* Linaro launchpad issues:
- LP #672833, x64-64 varargs regression: after testing pushed bzr branch
for merging.
- LP #634738, inefficient low bit extraction: some discussion with Yao.
- LP #618684, ICE when building ziproxy: looked into and quickly found
not reproducible anymore of Linaro 4.5 trunk.
* Worked on some GCC bugzilla PRs:
- PR44557, ICE in Thumb-1 secondary reload: this should be fixed by a
change of the scratch operand constraint of "reload_inhi" from "r" to
"l". Interesting to note that this was from the
merged-arm-thumb-backend-branch merge, from about 10 years ago.
- PR46508: libffi fails to build on VFP asm instructions, seems to need
a '.fpu vfp' directive. Probably missed earlier because my toolchain was
configured with --with-fpu=vfp.
- PR45416: 4.6 code generation regression on ARM, after expand from SSA
changes. Looking at this currently.
== This week ==
* Look at Linaro issues with higher priority.
* Continue working on GCC PRs.