On 7 December 2010 07:34, Robert Fekete <robert.fekete@linaro.org> wrote:
Hi Everyone knowing awfully lot about memory management and multicore op,

My name is Robert Fekete and I work in the ST-Ericsson Landing Team.

I have a question regarding multicore SMP aware memory operations libc, and I hope you can have an answer or at least direct me to someone who might know better. I have already mailed Michael Hope who recommended me to mail this list(of course).

The issue I am experiencing is that when I am running a memset operation(on different non overlapping memory areas) on both cores simultaneously they will affect each other and prolong the work of each processes/cpu's by a factor of ~3-4(measured in Lauterabach).

i.e. assume a memset op that takes around 1ms (some really big buffer in a for loop many times)

Non-simultaneous memsets: (total time 2ms, each process execution time 1ms, x == running)
core1 :xxxxoooo
core2 :ooooxxxx

Simultaneous memsets: (total time 3ms , each process 3ms)
core1 :xxxxxxxxxxxx
core2 :xxxxxxxxxxxx

Well there are some factors that can explain parts of it.
There is a bandwidth limitation which will peak on one memset meaning that two memsets cannot possibly go faster and a proper value should be around 2ms(not 3-4ms), The L2$ is 512k which may lead to severe L2$ mess up since the sceduler is extremely fair and fine granular between the cores. L1$ will also get hurt.

Since I already had a memset bandwidth test and access to a 4 core A9 vexpress  I thought I'd recreate it; the results are kind of fun.  These are loops writing 256k at a time with memset (which maybe dangerously close to L2 size) in  a tight loop, and printing out a MB/s figure every few seconds:

  1xmemset loop - ~1900MB/s
  2xmemset loop - 2x1340MB/s -> 2680MB/s total
  3xmemset loop - 2x450MB/s + 1x270 (!!!) -> 1170MB/s total
  4xmemset loop - 3x160MB/s + 1x280 (!!!) -> 760MB/s total

This is using libc memset.  So I guess what we're seeing there is that 1 CPU isn't managing to saturate a bus, where as two between them are saturating it and getting a fair share each.
What's happening at 3 and 4 CPUs seems similar to your case where things are dropping badly; I really don't know why the behaviour is so asymmetric on the 3 and 4 core case.

(When using a simple armv7 routine I wrote the results were similar except in the 2 core case where I didn't seem to get the benefit, which is an interesting indicator that how the code
is written might leave it more or less susceptible to some interactions).

But the real issue is that the two totally different memops(big ones though) in parallell will mess up for each other. Not good if one of the processes on core 1(for example) is a high prio compositor and the second process on core 2 is a lowprio crap application. The the Low prio crap App will severely mess up for the compositor. OS prio does not propagate to bus access prio.


I don't think it's unusual to have this problem; if you are sharing a limited resource be it disk bandwidth, ram bandwidth or any bus inbetween you're going to hit the same problem.
I also don't think it's too unusual to find the non-linearity where when lots of things fight over the same resource the total drops - although I am surprised it's that bad.

Is there a way to make this behaviour a bit more deterministic...for the high prio op at least, like a scalable memory operation libc variant.

I guess one thing to understand are where the bottlenecks are and whether there are counters on each bus to see what is happening.  I can see that a system could be
designed that would prioritise memory ops from one core over others, I don't know if any ARMs do that or not.



Dave