Hi,
I started to look into mixed vector sizes (in the same loop). My main reason
for this was to allow widening and narrowing instructions, that have
different vector sizes for src and dest, to work properly. My example was
widen_mult (int = short * short), I thought its implementation was not
optimal. But now that I have a working GCC mainline for ARM, I see that it
works just fine.
short ub[], uc[];
int c[];
for (i = 0; i < n; i++)
c[i] = ub[i] * ua[i];
is compiled as:
.L11:
add r1, r1, #1
vldmia r4!, {d18-d19}
cmp r5, r1
vldmia ip!, {d16-d17}
vmull.s16 q10, d18, d16
vstr d20, [r3, #-32]
vstr d21, [r3, #-24]
vmull.s16 q8, d19, d17
vstr d16, [r3, #-16]
vstr d17, [r3, #-8]
add r3, r3, #32
bhi .L11
which looks good to me at least from the vmull point of view.
Does anyone have an example when mixed vector size instructions are not used
properly?
Another reason for mixed sizes could be cases where only part of the loop
can be vectorized with the wider vectors. I don't know how common this is.
Are there any other reasons to implement mixed vector sizes? I understand
that this can be a useful feature, I am just not sure it's the most
important one.
Thanks,
Ira
I've been going through the ChangeLog for the release and am having
trouble justifying some of the changes brought in. In particular:
* -fstrict-volatile-bitfields, which is more appropriate for bare
metal/kernel code
* Cortex-M4 support
* C locale support in libstdc++-v3
The march/mcpu clean up is OK but marginal.
Our focus is time based performance on the Cortex-A series with an
implied applications over kernel/bare metal. This is a very narrow
view, but every non-performance line of code we bring in can also
bring in a bug.
Any thoughts? For those who are looking at using our toolchain, is
earlier access to other toolchain improvements interesting?
-- Michael
Hi all,
As you may or may not know, upstream GCC has now entered 'stage 3' of
it's development cycle. This will last until spring.
This means that they are only accepting bug fixes and documentation
improvements. New features and any performance improvements must wait
until GCC 4.6 branches, prior to release, and GCC 4.7 development opens.
During this process, our usual preferred work flow (upstream first) will
not work, so we'll have to do something else.
Here's my proposal:
* Create a new Launchpad branch for GCC 4.6.
* Synchronize this branch with upstream regularly
* once per week, perhaps.
* Try to get upstream approval for all new patches in the usual way
* on the understanding that they won't be applied until stage 1
* bug fixes are unaffected and may commit as usual.
* Commit all pending patches to our own 4.6 branch
* and backport them to our 4.5, branch, of course.
* Usual "no test regressions" policy applies to our own patches
* but beware regressions from merges from upstream.
* we may want to track the clean 4.6 test results for comparison
This is little different to what we do with the 4.5 release branch now.
Thoughts?
Andrew
The Linaro Toolchain Working Group is pleased to announce the latest
release of Linaro GCC 4.5.
Linaro GCC 4.5 is the fourth release in the 4.5 series. Based off the
latest GCC 4.5.1+svn164911, it includes many ARM-focused performance
improvements and bug fixes.
Interesting changes include:
* Various NEON related fixes
* Performance improvements
* A clean up of some of the testsuite test cases
* An updated version of the __sync multicore primitives
* Improvements in data packing when optimising for size
* C locale support in libstdc++-v3
This release adds the new option -fstrict-volatile-bitfields and
enables it by default on ARM. See doc/invoke.texi for more
information.
The source tarball is available from:
https://launchpad.net/gcc-linaro/+milestone/4.5-2010.11-0
Downloads are available from the Linaro GCC page on Launchpad:
https://launchpad.net/gcc-linaro
Note that there were no changes to the 4.4 series.
-- Michael
The Linaro Toolchain Working Group is pleased to announce the release
of Linaro GDB 7.2.
Linaro GDB 7.2 2010.11-0 is the second release in the 7.2 series.
Based off the latest GDB 7.2, it includes a number of ARM-focused bug
fixes and enhancements.
This release concentrates on the GDB test suite and tidies up a number
of failures.
The source tarball is available at:
https://launchpad.net/gdb-linaro/+milestone/7.2-2010.11-0
More information on Linaro GDB is available at:
https://launchpad.net/gdb-linaro
-- Michael
Hi,
It looks like it's enough to implement targetm.vectorize.
autovectorize_vector_sizes for NEON in order to enable initial
auto-detection of vector size. With the attached patch and
-mvectorize-with-neon-quad flag, the vectorizer first tries to vectorize
for 128 bit, and if this fails, it tries to vectorize for 64 bit. For
example, in the attached testcase number of iterations is too small for 128
bit (first 2 iterations have to be peeled in order to align the array
accesses), but is sufficient for 64 bit (the accesses are aligned here).
I'd appreciate your comments on the patch, and I also have a few questions:
1. Why the default vector size is 64?
2. Where is the place of NEON vectorization tests? I found NEON tests with
intrinsics at gcc.target/arm, is that the right place?
3. According to gcc.dg/vect/vect.exp the only flag that is used for NEON
(in addition to target independent flags) is -ffast-math. Is that enough?
Thanks,
Ira
ChangeLog:
* config/arm/arm.c (arm_autovectorize_vector_sizes): New
function.
(TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES): Define.
Index: config/arm/arm.c
===================================================================
--- config/arm/arm.c (revision 166032)
+++ config/arm/arm.c (working copy)
@@ -246,6 +246,7 @@ static bool arm_builtin_support_vector_misalignmen
const_tree type,
int misalignment,
bool is_packed);
+static unsigned int arm_autovectorize_vector_sizes (void);
/* Table of machine attributes. */
@@ -391,6 +392,9 @@ static const struct default_options arm_option_opt
#define TARGET_VECTOR_MODE_SUPPORTED_P arm_vector_mode_supported_p
#undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE
#define TARGET_VECTORIZE_PREFERRED_SIMD_MODE arm_preferred_simd_mode
+#undef TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES
+#define TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES \
+ arm_autovectorize_vector_sizes
#undef TARGET_MACHINE_DEPENDENT_REORG
#define TARGET_MACHINE_DEPENDENT_REORG arm_reorg
@@ -23223,6 +23227,12 @@ arm_expand_sync (enum machine_mode mode,
}
}
+static unsigned int
+arm_autovectorize_vector_sizes (void)
+{
+ return TARGET_NEON_VECTORIZE_QUAD ? 16 | 8 : 0;
+}
+
static bool
arm_vector_alignment_reachable (const_tree type, bool is_packed)
{
test:
#define N 5
unsigned int ub[N+2] = {1,1,6,39,12,18,14};
unsigned int uc[N+2] = {2,3,4,11,6,7,1};
void main1 ()
{
int i;
unsigned int udiff = 2;
unsigned int umax = 10;
for (i = 0; i < N; i++)
{
/* Summation. */
udiff += (ub[i+2] - uc[i]);
/* Maximum. */
umax = umax < uc[i+2] ? uc[i+2] : umax;
}
}
Hi there. Just a reminder that today's call is the first at the new
time of 0900 UTC which is 9am in the UK, 10am in Germany, 11am in
Israel, and 5pm in China.
I've updated the meetings page at:
https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings
with the new details.
-- Michael
Hi,
I am backporint some patches from FSF mainline, which may improve Linaro
4.5 gcc on thumb2 speed.
The first one is done by Richard E. "Improve optimization to transform
TST into LSLS"
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html
After it applied to Linaro 4.5 tree, EEMBC speed number downgrades,
while code size is reduced to some extent. The code difference is like
this,
6801 ldr r1, [r0, #0]
f831 3013 ldrh.w r3, [r1, r3, lsl #1]
-f413 6f00 tst.w r3, #2048 ; 0x800
-f43f af41 beq.w cc <t_run_test+0xcc>
+0518 lsls r0, r3, #20
+f57f af44 bpl.w cc <t_run_test+0xcc>
4610 mov r0, r2
After reading cortex-a8 TRM, I can't find exact timing cycles of lsls.
Under Chung-Lin's help, we feel that lsls should be slower than tst, but
don't have any evidence to prove. If any people is familiar with arm
microarch, help is welcome. If our assumption is correct, we may can
change this patch to an optimization specific to size only.
The second patch is Bernd's "Fix an if statement in arm_rtx_costs_1"
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02096.html
After this patch applied, EEMBC benchmark number is not changed. Shall
we merge this patch to linaro 4.5 tree? I am inclined to merge it, but
if you have concerns on this patch, let us discuss here.
--
Yao Qi
CodeSourcery
yao(a)codesourcery.com
(650) 331-3385 x739