Here's my summary from Monday's meeting on the harder parts of binary
toolchains.
Using a 4.6 compiler against a 4.5 based sysroot such as Natty:
* libgcc and libstdc++ are part of the compiler
* The compiler expects features that are in the corresponding runtime
* You can't reliably run or validate against an earlier runtime
The solution is to upgrade the runtime on the sysroot to 4.6. 4.6 is
backwards compatible. Ubuntu did this with Maverick and it caused no
problems, although problems such as Debian #622783 have been seen.
Multiarch:
* The Ubuntu multiarch patch should work with a sysroot
* Multiarch and multilib should work together
Multilib:
* The current ARM multilib rules are old and not very relevant
* Multilib means you need multiple sysroots as well
* Skip multilib for the first release
Other:
* Anything we support in cross we should support native first
* Check that we don't have to directly supply the source that goes
with the binary sysroot
-- Michael
==Progress===
* Out of office for a day.
* Wrote a quick patch to do vcvt.f32.s32 with fractional bits where we
can. Tested no regressions, need to commit this after review upstream.
* Desk move and packing for that.
* Looked at a bug report LP 836588 appears to go away with fno-gcse.
Needs more digging.
=== Plans ===
* Look at LP 836588.
* Finish auto-inc-dec pipeline scheduling work.
* Clear out some of the old patches (POST_MODIFY_DISP for vfp, BRANCH_COST )
* Settle into new desk.
* Again a short week as I'm away for 2 days for an internal training event.
Meetings:
* 1-1s
* TCWG calls
Absences.
* 5th October - out of office.
* 13th - 14th October - Out of Office - Internal training event.
* 31st Oct - 4th Nov - Linaro Summit Orlando - Travel booked -
* 08 Nov - 11 Nov - Tentatively booked
* Dec 19 - 31st Dec - Tentatively booked
== Last week ==
* Patch review.
* Backported second attempt to fix get_arm_condition_code ICE.
* Worked on -fsched-pressure. Experimented with various combinations
of ideas. This is giving some good results (e.g. a 2x improvement
in libav's put_h264_qpel8_hv_lowpass_8) but needs a bit more work
to fix some outliers.
I'll be away from 18th Oct to 14th Nov.
Richard
Random data for the day: Dave Pigott has installed some new PandaBoard
build machines in the validation lab. They're identical to mine
except that root is on USB Flash instead of NFS, and they have a much
faster flash drive for the build area.
The time taken to bootstrap and test gcc-linaro-2011.09 with C, C++,
Fortran, and LTO is:
* ursa3, ursa4 (Toshiba USB stick): 301 minutes build / 369 test
* ursa2 (no-name USB stick): 324 minutes build / 422 test
* tcpanda (fast USB stick): 274 minutes build / 265 test
So the new combo gives a 1.38 x faster build. I'm surprised as I
though the build was CPU bound. I'd hate to see what building on an
SD card is like.
Note that /tmp is in RAM, /scratch is ext4, the new boards use
noatime, and the kernel doesn't have the new USB performance fix.
-- Michael
Continue working on estimating register pressure with SMS:
- Discussed current approach with Richard which gave useful leads.
- Started to implement this approach.
- Doing experiments on libav microbench.
The unaligned accesses in libpng are, for the large copies, a bug. Our attempt to align the row buffer to a 16 byte boundary was off-by-one so we end up always mis-aligning it. I've posted a patch on the png-mng-implement list:
http://sourceforge.net/mailarchive/message.php?msg_id=28194444
The time spent in memcpy() is probably an illusion. The data out of zlib gets copied to one row buffer where it is unfiltered (if necessary) then a copy is made in a separate buffer that is only used for the filter handling. If you test using images with large rows (I don't know what pngbench does) the copy buffer may well get flushed out of the second level cache between each row, then the memcpy will stall bringing it back in.
If you have machine level profiling you may see this as a massive time spike on some probably unrelated instruction which just happens to be in the PC when the stall stops everything.
Anyway, I have several ideas of how to avoid the copy when it isn't required.
John Bowler <jbowler(a)acm.org>
-----Original Message-----
From: Glenn Randers-Pehrson [mailto:glennrp@gmail.com]
Sent: Monday, October 03, 2011 1:15 PM
To: PNG/MNG implementation discussion list
Subject: [png-mng-implement] Use of memcpy() in libpng [Fwd from linaro-toolchain list]
Re: Use of memcpy() in libpng
David Gilbert
Tue, 27 Sep 2011 06:20:14 -0700
On 27 September 2011 14:16, Christian Robottom Reis <k...(a)linaro.org> wrote:
> On Tue, Sep 27, 2011 at 09:47:33AM +0100, Ramana Radhakrishnan wrote:
>> On 26 September 2011 21:51, Michael Hope <michael.h...(a)linaro.org> wrote:
>> > Saw this on the linaro-multimedia list:
>> >
>> > http://lists.linaro.org/pipermail/linaro-multimedia/2011-September/
>> > 000074.html
>> >
>> > libpng spends a significant amount of time in memcpy(). This might
>> > tie in with Ramana's investigation or the unaligned access work by
>> > allowing more memcpy()s to be inlined.
>>
>> It's the unaligned access and the change / improvements to the memcpy
>> that *might* help in this case. But that ofcourse depends on the
>> compiler knowing when it can do such a thing. Ofcourse what might be
>> more interesting is the kind of workload analysis that Dave's done in
>> the past with memcpy to know what the alignment and size of the
>> buffer being copied is.
>
> If you guys could take a look at this there is a potential requirement
> for the MMWG around libpng optimization; we could fit this in along
> with other work (possible vectorizing, etc) on that component.
It wouldn't take long to analyse the memcpy calls - life would be easier if we had the test program and some details on things like what size of images were used in these benchmarks.
Dave
Continued work on my constant reuse optimizations. Not too much this
week though. I've now fixed some issues with the ARM size-costs code
that was causing it to wildly over-estimate the cost of a MOVT
instruction. I'll have to post this upstream sometime soon.
Took another look at the shift-amount bug. Discussed the issue with Paul
Brook. I've now fixed the original bug, and fixed the new bug introduced
by Paul's original fix, and committed that upstream. I still need to
backport it to Linaro GCC though, and the latent bug that Richard S
spotted is still being analysed.
Did a merge from FSF 4.5 & 4.6 to Linaro, and pushed them the Launchpad
branches for testing.
Begun work benchmarking different setups for the generic tuning patches.
I had a lot of trouble trying to set up SPEC2000 though. Hopefully these
issues are now resolved, with some help from Michael, and I have
established some baseline figures on both A8 and A9 to work from.
No progress on native tuning. I'm still waiting for upstream review.
In other news: Mentor's contract with Linaro has now been extended for
another 6 months. :)
== String Routines ==
* Built and tested a newlib with my memchr in - ready to go with a
bit of tidy up.
* Followed up on my eglibc patch submission by a comment suggesting
the use of --with-cpu pointing back at the previous discussion.
== 64 Bit atomics ==
* Updated gcc patch based on Ramana's comments, retested and posted
new version
- Lost half a day to a failing SD card in our panda.
== QEMU ==
* Posted a patch that made one variable thread local using __thread
that fixes multi threaded user mode ARM programs (e.g. firefox); this
seems to have mutated on the list into a patch for more general thread
local support.
Dave
== GDB ==
* Reimplemented patch to disable address space randomization
in gdbserver to respect the "set disable-randomization"
command, and checked it in to mainline and Linaro GDB 7.3
* Worked on support for cross-platform core file generation.
== GCC ==
* Checked in mainline fix for PR 50305.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Hi,
- worked on the RTL part of the widen-shift patch
- backported to linaro 2/3 of the SLP patches, and proposed the third one
- worked on additional SLP improvements:
- swap operands to make statements isomorphic
- support load with offset 1 (after load from 0)
- started working on presentation for NEON forum
Upcoming holidays:
Oct 12, Wed - half day
Oct 13, Thu
Oct 16-19, Sun-Wed - half day
Oct 20, Thu
Ira
Hi,
So one of the things Michael pointed out in today's call was that the
ARM backend doesn't generate vcvt.f32.s<type> where you have an idiom
conversion from fixed to floating point as in the example below. I've
chosen to implement this in the following manner in the backend using
these interfaces from real.c . The reason I've chosen to not allow
this transformation in case flag_rounding_math is true is because this
instruction always ends up rounding using round-to-nearest rather than
obeying whats in the FPSCR and thus is not safe for programs that want
to dynamically set their rounding modes.
The benefits are quite obvious in that we eliminate a load from the
constant pool and a floating point multiply and thus essentially
shaving off a floating point multiply + Load latency off these
sequences. This instruction can only write the output into the same
register as the input register which is why I've modelled it as below
by tying op1 into op0.
If there's a simpler way of using the interfaces into real.c then I'm all ears ?
Thoughts ? I believe such idioms are used in libav from where the
original report appears to have come and thus it's a worthwhile gain
where we can have it. Any other places where folks might have noticed
this.
I will post upstream as well once I finish testing this patch. I'm
posting this here to get some feedback as well to let anyone who is
really really keen about trying this out have a go given I'm out
tomorrow.
( I took a quick look at the short -> f32 case as well but the fact
remains that loads either zero or sign extend anyway so there's
probably not much gain in modelling that right away and the win really
is in getting rid of that fp mul and the constant pool load. There's
probably some gain in going from i64-> f64 as well so those patterns
need to be written up at some point for completeness )
cheers
Ramana
2011-10-04 Ramana Radhakrishnan <ramana.radhakrishnan(a)linaro.org>
* config/arm/arm.c (vfp3_const_double_for_fract_bits): Define.
* config/arm/arm-protos.h (vfp3_const_double_for_fract_bits): Declare.
* config/arm/constraints.md ("Dt"): New constraint.
* config/arm/predicates.md (const_double_vcvt_power_of_two_reciprocal):
New.
* config/arm/vfp.md (*arm_combine_vcvt_f32_s32): New.
(*arm_combine_vcvt_f32_u32): New.
For the following testcases I see the code as follows with
-mfloat-abi=hard -mfpu=vfpv3 and -mcpu=cortex-a9
float foo (int i)
{
float v = (float)i / (1 << 11);
return v;
}
float foa_unsigned (unsigned int i)
{
float v = (float)i / (1 << 5);
return v;
}
After patch .
foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
fmsr s0, r0 @ int
vcvt.f32.s32 s0, s0, #11
bx lr
.size foo, .-foo
.align 2
.global foa_unsigned
.type foa_unsigned, %function
foa_unsigned:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
fmsr s0, r0 @ int
vcvt.f32.u32 s0, s0, #5
bx lr
.size foa_unsigned, .-foa_unsigned
.align 2
.global foo1
.type foo1, %function
rather than
.type foo, %function
foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
fmsr s15, r0 @ int
fsitos s0, s15
flds s15, .L2
fmuls s0, s0, s15
bx lr
.L3:
.align 2
.L2:
.word 973078528
.size foo, .-foo
.align 2
.global foa_unsigned
.type foa_unsigned, %function
foa_unsigned:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
fmsr s15, r0 @ int
fuitos s0, s15
flds s15, .L5
fmuls s0, s0, s15
bx lr
.L6:
.align 2
.L5:
.word 1023410176
* Vacation
Monday, Tuesday, and Wednesday.
* GCC
Continued work on my constant reuse optimizations. Disappointingly, I've
found that there are very few optimization opportunities in EEMBC
(ARM/Thumb V7-A), although it's not difficult to write testcases that
the optimization could improve. I also discovered that the data-flow
chains don't work exactly how I thought (with respect to if-then-else
cases) so I need to do a little more work.
Pinged the native tuning patches; they're still waiting for upstream review.
Still can't get the generic tuning work done as the CodeSourcery panda
boards appear to be still offline.
Committed to mainline the patch to support instructions with auto-inc
operations in SMS after addressing Ayal's comments. The patch contains
two parts; one of them fixes a bug revealed during bootstrapping with
the patch and SMS flags.
http://gcc.gnu.org/ml/gcc-patches/2011-09/msg01988.htmlhttp://gcc.gnu.org/ml/gcc-patches/2011-09/msg01987.html
Looking at estimating register pressure with SMS: based on previous
discussion with Richard the current approach is to try and use the
register pressure estimation in loop invariant pass.
I'm compiling an application built with TI's DVSDK 3 *[0].
/home/user/ti/dvsdk/dvsdk_3_01_00_10/linuxutils_2_25_02_08/packages/ti/sdo/linuxutils/cmem/lib/cmem.a470MV(cmem.o470MV):(.ARM.exidx+0x0):
undefined reference to `__aeabi_unwind_cpp_pr0'
arm-linux-gnueabi-gcc --version
arm-linux-gnueabi-gcc (Ubuntu/Linaro 4.5.2-5ubuntu2~ppa1) 4.5.2
arm-linux-gnueabi-ld --version
GNU ld (GNU Binutils for Ubuntu) 2.21.0.20110302
More full output is here (but it isn't particularly helpful due to TI's RTSC
make system's black-magic)
https://gist.github.com/925674
FYI: the MV in cmem.a470MV stands for MontaVista.
This name is hard-coded somewhere even though it's not being linked against
a MontaVista system.
I believe the 470 means that it should work with ARMv4 through ARMv7, but
I'm not positive.
My googling suggest that this is a toolchain bug and that the best way
around the issue is to create a file which defines the function as a void
dummy and include it.
http://www.codesourcery.com/archives/arm-gnu/msg03604.htmlhttp://comments.gmane.org/gmane.comp.boot-loaders.u-boot/78649http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=__aeabi_unwind_cpp_pr0
I have a script that I'll post shortly with instructions as to how to setup
TI's DVSDK with Linaro
AJ ONeal
[0] I'm not using the latest DVSDK version 4 because the paths and such are
so hard-coded for the 2009q3 version of codesourcery on ubuntu 10.04 LTS
that I don't know where to start fixing it.
===Progress===
* Patch review week.
* Looked at bootstrap issue for a while but Richard Sandiford picked
it up and sorted it out (Thanks Richard).
* Fun and games with some paperwork.
* Some backporting and testing patches. (50099 and 50186) underway.
=== Plans ===
* Clear out some of the old patches (POST_MODIFY_DISP for vfp,
BRANCH_COST ) and finish on auto-inc-dec patch from last week.
* Away for 1 day next week.
Meetings:
* 1-1s
* TCWG calls
Absences.
* 5th October -Out of Office.
* 13th - 14th October - Internal training.
* 31st Oct - 4th Nov - Linaro Summit Orlando.
* 08 Nov - 11 Nov - Tentatively booked
* Dec 19 - 31st Dec - Tentatively booked
== This week ==
* Applied patch for doing NEON high/low extraction using subregs.
Ramana pointed out that we do the same thing for insertion,
so I wrote a patch to handle that too. Both now merged into
Linaro sources.
* Looked at ARM bootstrap problem on trunk. Turned out to be
an aliasing problem. Submitted and applied patch.
* Reworked part of my SMS register-scheduling patch after feedback
from Ayal. Submitted new version upstream.
* Got SPEC2006 running on the powerpc boxes and tested one part
of my -fsched-pressure patch. Bit of a mixed bag. h264ref was
one of the worst sufferers, which was a bit worrying. I think
I'll need to make a third change too.
To recap, there are two pieces now:
1) Make -fsched-pressure honour the DFA
2) Make -fsched-pressure allow values that are live across a
loop to be spilled.
I naively hoped that (1) would be OK on its own, but h264 shows
that the current -fsched-pressure code is very conservative
when it comes to large blocks. It only considers register
deaths once there is a single remaining use; if there are two
unscheduled uses, it assumes that the register remains live
for the rest of the block.
So the problem that (1) was fixing was that -fsched-pressure was too
optimistic in terms of what it could schedule in a cycle. But with
that fixed, we seem to have too many sources of pessimism...
Richard
== GDB ==
* Worked on support for cross-platform core file generation.
* Followed up on patch to support disabling address space
randomization in gdbserver.
== GCC ==
* Followed up on patch for PR 50305.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
* Working on croos-compiling Firefox. Getting dependencies in place and
setting up the configuration file (.mozconfig). I have had the strategy to
fix one dependency at a time, picking prebuilt packages or building my self.
Michael told me at yesterday's meeting about multistrap, that could possibly
be used for fixing all dependencies at once. I will look into that next
week.
* During this process I have also spent some time reading up on cross
compilation in general and also on autoconf and the GNU build system.
Best Regards
Åsa
== String routines ==
* Got eglibc testing setup happy at last
- Note that -O3 builds generally seem to give a few more errors
that are probably worth looking at
- -march=armv6 -mthumb hit some non-thumb1 instructions (normally
non-lo registers), again worth looking at
- Cross testing to Qemu user mode often stalls, mostly on nptl
tests that abort/fail when run in system/natively
* Sent new version of eglibc/memchr patch upstream
* Now have working newlib test setup and reference set
- next step is to try adding my memchr there
== Other ==
* Testing a QEmu patch with Peter
* Looking at bug 861296 (difference in mmap layouts)
* Adding a few suggestions to the set of cpu hotplug tests.
* Dealing with the Manchester lab cold.
Short week; back on Monday
Dave
== GDB ==
* Committed hardware watchpoint support for gdbserver to mainline,
including two minor changes resulting from review comments;
backported those fixes to Linaro GDB as well.
* Implemented and tested support for disabling address space
randomization in gdbserver; patch posted for review.
* Investigated support for cross-platform core file generation.
== GCC ==
* Patch review week.
* Posted updated patch for PR 50305.
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand | Phone: +49-7031/16-3727
STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E.
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk
Wittkopp
Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
Arnd Bergmann <arnd(a)arndb.de> wrote on 08/26/2011 04:44:26 PM:
> On Thursday 25 August 2011, Russell King - ARM Linux wrote:
> >
> > Arnd, can you test this to make sure your gdb test case still works,
and
> > Mark, can you test this to make sure it fixes your problem please?
>
> Hi Russell,
>
> The patch in question was not actually from me but from Ulrich Weigand,
> so he's probably the right person to test your patch.
> I'm forwarding it in full to Uli for reference.
Hi Arnd, hi Russell,
sorry for the late reply, I've just returned from vacation today ...
I've not yet run the test, but just from reading through the patch
it seems that this will at least partially re-introduce the problem
my original patch was trying to fix.
The situation here is about GDB performing an "inferior function
call", e.g. via the GDB "call" command. To do so, GDB will:
0. [ Have gotten control of the target process via some ptrace
intercept previously, and then ... ]
1. Save the register state
2. Create a dummy frame on the stack and set up registers (PC, SP,
argument registers, ...) as appropriate for a function call
3. Restart via PTRACE_CONTINUE
[ ... at this point, the target process runs the function until
it returns to a breakpoint instruction and GDB gets control
again via another ptrace intercept ... ]
4. Restore the register state saved in [1.]
5. At some later point, continue the target process [at its
original location] with PTRACE_CONTINUE
The problem now occurs if at point [0.] the target process just
happened to be blocked in a restartable system call. For this
sequence to then work as expected, two things have to happen:
- at point [3.], the kernel must *not* attempt to restart a
system call, even though it thinks we're stopped in a
restartable system call
- at point [5.], the kernel now *must* restart the originally
interrupted system call, even though it thinks we're stopped
at some breakpoint, and not within a system call
My patch achieved both these goals, while it would seem your
patch only solves the first issue, not the second one. In
fact, since any interaction with ptrace will always cause the
TIF_SYS_RESTART flag to be *reset*, and there is no way at all
to *set* it, there doesn't appear to be any way for GDB to
achive that second goal.
[ With my patch, that second goal was implicitly achieved by
the fact that at [1.] GDB would save a register state that
already corresponds to the way things should be for restarting
the system call. When that register set is then restored in [4.],
restart just happens automatically without any further kernel
intervention. ]
One way to fix this might be to make the TIF_SYS_RESTART flag
itself visible to ptrace, so the GDB could save/restore it
along with the rest of the register set; this would be similar
to how that problem is handled on other platforms. However,
there doesn't appear to be an obvious place for the flag in
the ptrace register set ...
Bye,
Ulrich