Scalable memory operations with multiple cores
david.gilbert at linaro.org
Tue Dec 7 11:11:55 UTC 2010
On 7 December 2010 07:34, Robert Fekete <robert.fekete at linaro.org> wrote:
> Hi Everyone knowing awfully lot about memory management and multicore op,
> My name is Robert Fekete and I work in the ST-Ericsson Landing Team.
> I have a question regarding multicore SMP aware memory operations libc, and
> I hope you can have an answer or at least direct me to someone who might
> know better. I have already mailed Michael Hope who recommended me to mail
> this list(of course).
> The issue I am experiencing is that when I am running a memset operation(on
> different non overlapping memory areas) on both cores simultaneously they
> will affect each other and prolong the work of each processes/cpu's by a
> factor of ~3-4(measured in Lauterabach).
> i.e. assume a memset op that takes around 1ms (some really big buffer in a
> for loop many times)
> Non-simultaneous memsets: (total time 2ms, each process execution time 1ms,
> x == running)
> core1 :xxxxoooo
> core2 :ooooxxxx
> Simultaneous memsets: (total time 3ms , each process 3ms)
> core1 :xxxxxxxxxxxx
> core2 :xxxxxxxxxxxx
> Well there are some factors that can explain parts of it.
> There is a bandwidth limitation which will peak on one memset meaning that
> two memsets cannot possibly go faster and a proper value should be around
> 2ms(not 3-4ms), The L2$ is 512k which may lead to severe L2$ mess up since
> the sceduler is extremely fair and fine granular between the cores. L1$ will
> also get hurt.
Since I already had a memset bandwidth test and access to a 4 core A9
vexpress I thought I'd recreate it; the results are kind of fun. These are
loops writing 256k at a time with memset (which maybe dangerously close to
L2 size) in a tight loop, and printing out a MB/s figure every few seconds:
1xmemset loop - ~1900MB/s
2xmemset loop - 2x1340MB/s -> 2680MB/s total
3xmemset loop - 2x450MB/s + 1x270 (!!!) -> 1170MB/s total
4xmemset loop - 3x160MB/s + 1x280 (!!!) -> 760MB/s total
This is using libc memset. So I guess what we're seeing there is that 1 CPU
isn't managing to saturate a bus, where as two between them are saturating
it and getting a fair share each.
What's happening at 3 and 4 CPUs seems similar to your case where things are
dropping badly; I really don't know why the behaviour is so asymmetric on
the 3 and 4 core case.
(When using a simple armv7 routine I wrote the results were similar except
in the 2 core case where I didn't seem to get the benefit, which is an
interesting indicator that how the code
is written might leave it more or less susceptible to some interactions).
But the real issue is that the two totally different memops(big ones though)
> in parallell will mess up for each other. Not good if one of the processes
> on core 1(for example) is a high prio compositor and the second process on
> core 2 is a lowprio crap application. The the Low prio crap App will
> severely mess up for the compositor. OS prio does not propagate to bus
> access prio.
I don't think it's unusual to have this problem; if you are sharing a
limited resource be it disk bandwidth, ram bandwidth or any bus inbetween
you're going to hit the same problem.
I also don't think it's too unusual to find the non-linearity where when
lots of things fight over the same resource the total drops - although I am
surprised it's that bad.
Is there a way to make this behaviour a bit more deterministic...for the
> high prio op at least, like a scalable memory operation libc variant.
I guess one thing to understand are where the bottlenecks are and whether
there are counters on each bus to see what is happening. I can see that a
system could be
designed that would prioritise memory ops from one core over others, I don't
know if any ARMs do that or not.
I read an article about a similar issue on Intel CPUs and the solution in
> this case is a scalable memory allocator lib.
> More academic reading of the phenomenon:
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the linaro-dev