On Thu, 5 May 2011, David Gilbert wrote:
Yes, while I've not actually looked at coding CRC32 or the crypto things I agree that they feel like they have much more room for working with; it's outside of the scope of what I was asked to look at however.
Well, you said that the current memcpy code in the kernel is quite good, which is nice not only because I wrote it :-) but that might indicate that the Neon optimization efforts might have a bigger return on the investment elsewhere.
The memcpy case is not interesting. Not at all. Most kernel memcpy calls are for small size copies. The large copy instances are just bad and misdesigned in the first place if they rely on memcpy (maybe they should simply have a custom copy function, maybe implemented with Neon).
Even outside the kernel vast memcpy's are fairly rare as far as I can tell - everyone knows they're going to hurt so people try and avoid them; the other thing is that people have been optimising ARM memcpy for decades and it appears to me to be hitting cache/bus bandwidths somewhere (although I don't have any figures for what those bandwidths are) - there may be some scope for optimising the smaller memcpy cases (e.g. taking advantage of things like the newer cbz to cut a few instructions out) - from my graphs the slope up to the point at which the non-neon code plateaus is quite gradual, which suggests it might be possible to optimise it a bit.
Indeed.
Nicolas