2011/5/6 Christian Robottom Reis email@example.com:
On Thu, May 05, 2011 at 04:08:01PM +0100, Måns Rullgård wrote:
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
Well, the work item above is also about providing optimized memory routines that come out of the TCWG; if NEON isn't interesting, are any of the optimized Thumb2 versions that the toolchain team worked on that are worth looking at?
I don't think there are that many things that are vastly useful for the kernel, but here is a summary (I intend to write a full report at some point but am still fighting SPEC for some benchmark stats and some of the corner cases of these routines)
Note also that these graphs were put together as I was working on the routines and aren't consistently on the same machine/libc etc - when I write the report up I'll gather a full consistent set.
On A9 the kernel's memset is pretty good - at around 64bytes or so it beats everything else and is at the top at larger sizes on the exynos I tried it on (on a 400 MHz Vexpress there were some points where my own implemeentation beat it). On A8 interestingly the Neon version I wrote is really much faster than the ARM versions - maybe this is actually worth a try even with the context switching costs for page clearing?
Note that the kernel's memset takes a shortcut by not returning the correct result as per the C spec which probably helps it in the short cases (and frankly I don't think anyone actually ever cares about the return value).
I've got a nice fast strlen that uses uadd8 - it's only really of benefit though if there are lots of longer strings.
I've got two strchr implementations; one is the absolutely simplest one I could write (but taking advantage of a modern cbz) and another also using the uadd8 to do a few bytes/iteration. The really simple one is at least as good as the libc and kernel ones for sort (<50 byte) cases and no worse on A9 for longer cases; on A8 the more complex libc one wins out after about 16 characters. The uadd8 version is much faster on longer strchr's but I think those are so rare it's not worth it.
The 'simple strchr' is now in Ubuntu Natty's eglibc.
Memchr is my best performance win - but is probably not that heavily used; again it's using uadd8 (you can tell I like that) and it's much faster on longer runs. Where the length parameter is small it falls back to a simple loop; it's worse case is where you pass a large block of memory (and hence it uses the more complex loop) but the result is found in the first few bytes.
This is in Ubuntu Natty's eglibc.
As discussed previously, see the memcpy charts above - I've added a new 2x2 set at the bottom comparing aligned/misaligned (only by 1 byte), and also added a non-neon memcpy I just wrote.
My non-neon memcpy is similar to the kernel/libc's - it's a bit less smart about copying n*32+1 bytes (which are the spiky bits you can see) but seems a little faster at the start and end of the ranges - nothing really to distinguish it. (It doesn't know about co-misaligned yet - as in both source/dest misaligned by 1 byte which you can see in the lower right graph). It is however abysmal in the non-aligned case - hint: Don't bother taking advantage of v7's non-aligned load/stores.
For non-aligned Neon wins; one of mine or bionic's neon routines -I seem to prefer non-aligned source, Bionic seems to prefer non-aligned destination; and Bionic's really drops off when it runs out of cache.
(I have a run cooking at the moment with a much wider set of misalignments but it takes ages)